echo6-docs/runbooks/recon-operations.md
echo6-autocommit 19ff1c7e79 auto: docs sync 2026-04-19T06:00:07+00:00
Files changed: docs/hardware/environment.md docs/services/services.md runbooks/recon-operations.md runbooks/recon-service-integration.md
2026-04-19 06:00:07 +00:00

189 lines
5.4 KiB
Markdown

# RECON Operations Runbook
## Service Info
- **Host:** recon-vm (VM 131 on data node) — migrated from CT 130 on 2026-04-19
- **IP:** 192.168.1.130 / 100.64.0.24
- **Install:** /opt/recon/
- **User:** zvx
- **Services:** `recon.service`, `recon-watchdog.service`, `kiwix.service` (systemd)
## Service Management
```bash
ssh zvx@100.64.0.24
sudo systemctl start|stop|restart|status recon
journalctl -u recon -f
```
## Health Check
```bash
curl -s http://100.64.0.24:8420/api/health | python3 -m json.tool
# Returns: healthy (200), degraded/unhealthy (503)
# Checks: Qdrant, TEI, NFS, Gemini keys, pipeline counts
```
## Pipeline Status
```bash
ssh zvx@100.64.0.24
cd /opt/recon && source venv/bin/activate
python3 recon.py status # Summary counts
python3 recon.py failures # Failed documents
python3 recon.py search "query" # Test search
```
## Dashboard
- **URL:** http://100.64.0.24:8420
- Shows: pipeline progress, per-source breakdown, Qdrant stats
- Auto-refreshes every 30s
## Common Operations
```bash
cd /opt/recon && source venv/bin/activate
# Add a PDF
python3 recon.py upload --file /path/to.pdf --category "Reference"
# Add web content
python3 recon.py ingest-url "https://example.com/article" --process
# Crawl a website
python3 recon.py crawl "https://docs.example.com" --process
# Manual pipeline run (normally automatic via service)
python3 recon.py extract
python3 recon.py enrich
python3 recon.py embed
# Scan library for new PDFs (normally hourly via service)
python3 recon.py scan
python3 recon.py queue
```
## Dependencies
| Service | Host | Port | Purpose |
|---------|------|------|---------|
| Qdrant | cortex | 6333 | Vector DB (recon_knowledge collection) |
| TEI | cortex | 8090 | Text embeddings (bge-m3, 1024-dim) |
| Ollama | cortex | 11434 | Chat model for Aurora RAG |
| NFS | pi-nas | — | /mnt/library (PDF source) |
| Gemini API | Google | — | Enrichment + vision OCR (4 keys in .env) |
| Contabo VPS | 100.64.0.1 | — | Backup destination |
## Backups
- **Destination:** `root@100.64.0.1:/opt/backups/recon/`
- **Full sync (concepts, text, DB, config):** every 6 hours via cron
- **DB snapshot only:** every 2 hours via cron
- **Script:** `/opt/recon/scripts/backup.sh`
### Verify backups
```bash
ssh root@100.64.0.1 'ls -lh /opt/backups/recon/recon_*.db && du -sh /opt/backups/recon/'
```
## Troubleshooting
### Pipeline stalled (no progress)
```bash
journalctl -u recon -n 50 # Check errors
curl -s http://100.64.0.24:8420/api/health # Check dependencies
sudo systemctl restart recon # Restart
```
### Gemini rate limits (429 errors)
Built-in: exponential backoff 5s→10s→20s→40s→80s with jitter. Window failures skip that window and continue — partial enrichment beats zero.
If sustained: reduce `enrich_workers` in config.yaml, restart.
### Qdrant down
```bash
ssh zvx@cortex
docker ps | grep qdrant
docker restart qdrant
# If data lost: ssh zvx@100.64.0.24 'cd /opt/recon && source venv/bin/activate && python3 recon.py rebuild'
```
### TEI down
```bash
ssh zvx@cortex
docker ps | grep tei
docker restart tei
```
### NFS mount lost
```bash
ssh zvx@100.64.0.24
mount | grep library
sudo mount -a
sudo systemctl restart recon
```
### Reset stuck documents
```bash
cd /opt/recon && source venv/bin/activate
# Find stuck transitional states
sqlite3 data/recon.db "SELECT status, COUNT(*) FROM documents WHERE status IN ('extracting','enriching','embedding') GROUP BY status;"
# Reset them
sqlite3 data/recon.db "UPDATE documents SET status='queued' WHERE status='extracting';"
sqlite3 data/recon.db "UPDATE documents SET status='extracted' WHERE status='enriching';"
sqlite3 data/recon.db "UPDATE documents SET status='enriched' WHERE status='embedding';"
```
### Full recovery from Contabo backup
```bash
ssh zvx@100.64.0.24
sudo systemctl stop recon
rsync -av root@100.64.0.1:/opt/backups/recon/concepts/ /opt/recon/data/concepts/
rsync -av root@100.64.0.1:/opt/backups/recon/text/ /opt/recon/data/text/
# Pick the latest DB backup
rsync -av root@100.64.0.1:/opt/backups/recon/recon_latest.db /opt/recon/data/recon.db
cd /opt/recon && source venv/bin/activate
python3 recon.py rebuild # Rebuilds Qdrant from concept JSONs
sudo systemctl start recon
```
## Key Files
| Path | Purpose |
|------|---------|
| `/opt/recon/config.yaml` | All configuration |
| `/opt/recon/.env` | Gemini API keys (GEMINI_KEY_1 through GEMINI_KEY_4) |
| `/opt/recon/data/recon.db` | SQLite status DB |
| `/opt/recon/data/concepts/` | Gemini extraction results (CRITICAL — costs $ to regenerate) |
| `/opt/recon/data/text/` | Extracted page text (regenerable from PDFs) |
| `/opt/recon/PROJECT-BIBLE.md` | Full system documentation |
| `/opt/recon/scripts/backup.sh` | Backup script |
| `/opt/recon/scripts/validate.py` | Pipeline consistency checker |
| `/opt/recon/scripts/rebuild_qdrant.py` | Nuclear Qdrant rebuild |
## Pipeline Architecture
```
/mnt/library/ (NFS)
▼ hourly scan
[Catalogue] → [Queue] → [Extract] → [Enrich] → [Embed] → [Complete]
4 workers 16 workers 4 workers
PyPDF2 Gemini TEI+Qdrant
pdftotext 2.0 Flash bge-m3
Tesseract 1024-dim
Gemini Vision
```
---
*Last updated: 2026-04-19 — Updated for CT 130 → VM 131 migration*