Files changed: docs/hardware/environment.md docs/services/services.md runbooks/recon-operations.md runbooks/recon-service-integration.md
189 lines
5.4 KiB
Markdown
189 lines
5.4 KiB
Markdown
# RECON Operations Runbook
|
|
|
|
## Service Info
|
|
|
|
- **Host:** recon-vm (VM 131 on data node) — migrated from CT 130 on 2026-04-19
|
|
- **IP:** 192.168.1.130 / 100.64.0.24
|
|
- **Install:** /opt/recon/
|
|
- **User:** zvx
|
|
- **Services:** `recon.service`, `recon-watchdog.service`, `kiwix.service` (systemd)
|
|
|
|
## Service Management
|
|
|
|
```bash
|
|
ssh zvx@100.64.0.24
|
|
sudo systemctl start|stop|restart|status recon
|
|
journalctl -u recon -f
|
|
```
|
|
|
|
## Health Check
|
|
|
|
```bash
|
|
curl -s http://100.64.0.24:8420/api/health | python3 -m json.tool
|
|
# Returns: healthy (200), degraded/unhealthy (503)
|
|
# Checks: Qdrant, TEI, NFS, Gemini keys, pipeline counts
|
|
```
|
|
|
|
## Pipeline Status
|
|
|
|
```bash
|
|
ssh zvx@100.64.0.24
|
|
cd /opt/recon && source venv/bin/activate
|
|
python3 recon.py status # Summary counts
|
|
python3 recon.py failures # Failed documents
|
|
python3 recon.py search "query" # Test search
|
|
```
|
|
|
|
## Dashboard
|
|
|
|
- **URL:** http://100.64.0.24:8420
|
|
- Shows: pipeline progress, per-source breakdown, Qdrant stats
|
|
- Auto-refreshes every 30s
|
|
|
|
## Common Operations
|
|
|
|
```bash
|
|
cd /opt/recon && source venv/bin/activate
|
|
|
|
# Add a PDF
|
|
python3 recon.py upload --file /path/to.pdf --category "Reference"
|
|
|
|
# Add web content
|
|
python3 recon.py ingest-url "https://example.com/article" --process
|
|
|
|
# Crawl a website
|
|
python3 recon.py crawl "https://docs.example.com" --process
|
|
|
|
# Manual pipeline run (normally automatic via service)
|
|
python3 recon.py extract
|
|
python3 recon.py enrich
|
|
python3 recon.py embed
|
|
|
|
# Scan library for new PDFs (normally hourly via service)
|
|
python3 recon.py scan
|
|
python3 recon.py queue
|
|
```
|
|
|
|
## Dependencies
|
|
|
|
| Service | Host | Port | Purpose |
|
|
|---------|------|------|---------|
|
|
| Qdrant | cortex | 6333 | Vector DB (recon_knowledge collection) |
|
|
| TEI | cortex | 8090 | Text embeddings (bge-m3, 1024-dim) |
|
|
| Ollama | cortex | 11434 | Chat model for Aurora RAG |
|
|
| NFS | pi-nas | — | /mnt/library (PDF source) |
|
|
| Gemini API | Google | — | Enrichment + vision OCR (4 keys in .env) |
|
|
| Contabo VPS | 100.64.0.1 | — | Backup destination |
|
|
|
|
## Backups
|
|
|
|
- **Destination:** `root@100.64.0.1:/opt/backups/recon/`
|
|
- **Full sync (concepts, text, DB, config):** every 6 hours via cron
|
|
- **DB snapshot only:** every 2 hours via cron
|
|
- **Script:** `/opt/recon/scripts/backup.sh`
|
|
|
|
### Verify backups
|
|
|
|
```bash
|
|
ssh root@100.64.0.1 'ls -lh /opt/backups/recon/recon_*.db && du -sh /opt/backups/recon/'
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Pipeline stalled (no progress)
|
|
|
|
```bash
|
|
journalctl -u recon -n 50 # Check errors
|
|
curl -s http://100.64.0.24:8420/api/health # Check dependencies
|
|
sudo systemctl restart recon # Restart
|
|
```
|
|
|
|
### Gemini rate limits (429 errors)
|
|
|
|
Built-in: exponential backoff 5s→10s→20s→40s→80s with jitter. Window failures skip that window and continue — partial enrichment beats zero.
|
|
|
|
If sustained: reduce `enrich_workers` in config.yaml, restart.
|
|
|
|
### Qdrant down
|
|
|
|
```bash
|
|
ssh zvx@cortex
|
|
docker ps | grep qdrant
|
|
docker restart qdrant
|
|
# If data lost: ssh zvx@100.64.0.24 'cd /opt/recon && source venv/bin/activate && python3 recon.py rebuild'
|
|
```
|
|
|
|
### TEI down
|
|
|
|
```bash
|
|
ssh zvx@cortex
|
|
docker ps | grep tei
|
|
docker restart tei
|
|
```
|
|
|
|
### NFS mount lost
|
|
|
|
```bash
|
|
ssh zvx@100.64.0.24
|
|
mount | grep library
|
|
sudo mount -a
|
|
sudo systemctl restart recon
|
|
```
|
|
|
|
### Reset stuck documents
|
|
|
|
```bash
|
|
cd /opt/recon && source venv/bin/activate
|
|
# Find stuck transitional states
|
|
sqlite3 data/recon.db "SELECT status, COUNT(*) FROM documents WHERE status IN ('extracting','enriching','embedding') GROUP BY status;"
|
|
# Reset them
|
|
sqlite3 data/recon.db "UPDATE documents SET status='queued' WHERE status='extracting';"
|
|
sqlite3 data/recon.db "UPDATE documents SET status='extracted' WHERE status='enriching';"
|
|
sqlite3 data/recon.db "UPDATE documents SET status='enriched' WHERE status='embedding';"
|
|
```
|
|
|
|
### Full recovery from Contabo backup
|
|
|
|
```bash
|
|
ssh zvx@100.64.0.24
|
|
sudo systemctl stop recon
|
|
rsync -av root@100.64.0.1:/opt/backups/recon/concepts/ /opt/recon/data/concepts/
|
|
rsync -av root@100.64.0.1:/opt/backups/recon/text/ /opt/recon/data/text/
|
|
# Pick the latest DB backup
|
|
rsync -av root@100.64.0.1:/opt/backups/recon/recon_latest.db /opt/recon/data/recon.db
|
|
cd /opt/recon && source venv/bin/activate
|
|
python3 recon.py rebuild # Rebuilds Qdrant from concept JSONs
|
|
sudo systemctl start recon
|
|
```
|
|
|
|
## Key Files
|
|
|
|
| Path | Purpose |
|
|
|------|---------|
|
|
| `/opt/recon/config.yaml` | All configuration |
|
|
| `/opt/recon/.env` | Gemini API keys (GEMINI_KEY_1 through GEMINI_KEY_4) |
|
|
| `/opt/recon/data/recon.db` | SQLite status DB |
|
|
| `/opt/recon/data/concepts/` | Gemini extraction results (CRITICAL — costs $ to regenerate) |
|
|
| `/opt/recon/data/text/` | Extracted page text (regenerable from PDFs) |
|
|
| `/opt/recon/PROJECT-BIBLE.md` | Full system documentation |
|
|
| `/opt/recon/scripts/backup.sh` | Backup script |
|
|
| `/opt/recon/scripts/validate.py` | Pipeline consistency checker |
|
|
| `/opt/recon/scripts/rebuild_qdrant.py` | Nuclear Qdrant rebuild |
|
|
|
|
## Pipeline Architecture
|
|
|
|
```
|
|
/mnt/library/ (NFS)
|
|
│
|
|
▼ hourly scan
|
|
[Catalogue] → [Queue] → [Extract] → [Enrich] → [Embed] → [Complete]
|
|
4 workers 16 workers 4 workers
|
|
PyPDF2 Gemini TEI+Qdrant
|
|
pdftotext 2.0 Flash bge-m3
|
|
Tesseract 1024-dim
|
|
Gemini Vision
|
|
```
|
|
|
|
---
|
|
|
|
*Last updated: 2026-04-19 — Updated for CT 130 → VM 131 migration*
|