mirror of
https://github.com/zvx-echo6/recon.git
synced 2026-05-20 06:34:40 +02:00
No description
- Python 86.8%
- HTML 6.1%
- JavaScript 5.4%
- CSS 1%
- Shell 0.7%
This merge integrates the complete refactor effort spanning Phases 0-6k,
bringing RECON from its initial baseline into production-grade form.
Pipeline architecture
---------------------
- Phases 0-2: foundation cleanup, removed dead code, standardized logging
- Phase 3: dispatcher rewrite — watches data/acquired/<subfolder>/ for
{hash}.txt + {hash}.meta.json pairs, atomic .tmp rename, idempotent
- Phase 4: content processors for PDF (PyPDF2 -> pdftotext -> Tesseract ->
Gemini Vision fallback chain), transcript, and text formats
- Phase 5: enrichment, embedding, and filing daemons split into
independently restartable threads
PeerTube acquisition
--------------------
- Phase 6a-6c: PeerTube channel watcher, caption acquisition with rate
limiting (429 handling), 0.5s rate_limit_delay enforced
- Phase 6d: multi-instance support
- Phase 6e: rewired then reverted dashboard PeerTube endpoint to live
in acquisition module
Format handling & library cleanup
---------------------------------
- Phase 6f: text processor for .txt ingestion
- Phase 6f-2: format normalizer in dispatcher
- Phase 6g-6j: library reorg — ghost domain cleanup, SCL moved to
dedicated domain folder, pi-nas fully decommissioned as a storage
target (NFS-only now), ~73 GB reclaimed
- Phase 6k: hash-identical dedup — 2,477 duplicate PDFs removed,
22.05 GB freed, catalogue/documents/Qdrant payloads updated
coherently, 226 empty domain subdirs pruned
- 16,340 transcripts remain un-filed pending title-match review
Dashboard & metadata
--------------------
- Gemini "null" string bug fixed in pdf_processor metadata voting
- Dashboard upload migrated to pipeline with multi-format support
State at release
----------------
- 7 daemon threads: dispatcher, enrich, embed, filing, peertube-acq,
progress, dashboard
- 29,201 documents in catalogue / documents tables (UNIQUE on hash PK)
- ~2.1M Qdrant vectors in recon_knowledge_hybrid (cortex:6333)
- ~67 GB library on /mnt/library (NFS from pi-nas)
- files.echo6.co serving 9,397 deduped PDFs
- recon.echo6.co dashboard + API on :8420
See cleanup-log.md for the full backlog and resolution history.
|
||
|---|---|---|
| lib | ||
| scripts | ||
| static | ||
| templates | ||
| .gitignore | ||
| api.py | ||
| config.yaml | ||
| enricher.py | ||
| migrate_paths.py | ||
| PROJECT-BIBLE.md | ||
| README.md | ||
| recon.py | ||
| requirements.txt | ||
| run-pipeline-now.sh | ||
| sweep_gated.sh | ||
RECON -- Knowledge Extraction Pipeline
Extracts structured knowledge from PDFs and web content into a Qdrant vector database for RAG retrieval by Aurora.
Quick Start
# Activate
cd /opt/recon && source venv/bin/activate
# Scan library for new PDFs
recon scan
# Queue and process
recon queue
recon extract
recon enrich
recon embed
# Or run full pipeline
recon run
# Ingest a web page
recon ingest-url "https://example.com/article" --category "Category" --process
# Crawl an entire docs site
recon crawl "https://docs.example.com" --include /docs/ --category "Category" --process
# Upload a PDF
recon upload --file /path/to/document.pdf --category "Category"
# Search
recon search "water purification methods"
# Check status
recon status
recon failures
Dashboard
Services
| Service | Location | Purpose |
|---|---|---|
| RECON Dashboard | recon:8420 | Pipeline management + API |
| Qdrant | cortex:6333 | Vector database |
| TEI | cortex:8090 | Embeddings (1,711/sec) |
| Ollama | cortex:11434 | Chat + fallback embeddings |
| OpenWebUI | cortex:8080 (ai.echo6.co) | Aurora chat with RAG |
| File Server | recon:8888 (files.echo6.co) | PDF downloads |
Key Paths
| Path | Contents |
|---|---|
| /opt/recon/ | Application code |
| /opt/recon/data/concepts/ | Gemini extractions (CRITICAL -- back these up) |
| /opt/recon/data/text/ | Extracted text |
| /opt/recon/data/recon.db | SQLite status DB |
| /mnt/library/ | PDF library (NFS from pi-nas) |
Backups
Automated every 6 hours to Contabo VPS via /opt/recon/scripts/backup.sh.
Concept JSONs are the most valuable data ($130+ of Gemini API work).
Qdrant is NOT backed up -- rebuilt from JSONs in ~10 minutes via recon rebuild.
Monitoring
# Pipeline status
recon status
# Tail logs
tail -f /opt/recon/logs/recon.log
# Pipeline run log
tail -f /opt/recon/pipeline.log
# Validate consistency
recon validate --deep
Full Documentation
See PROJECT-BIBLE.md for complete system documentation.