matt/recon

mirror of https://github.com/zvx-echo6/recon.git synced 2026-05-20 06:34:40 +02:00

No description

Python 86.8%
HTML 6.1%
JavaScript 5.4%
CSS 1%
Shell 0.7%

Find a file

Matt 8d54ff165d Merge refactor branch: RECON v1.0.0 This merge integrates the complete refactor effort spanning Phases 0-6k, bringing RECON from its initial baseline into production-grade form. Pipeline architecture --------------------- - Phases 0-2: foundation cleanup, removed dead code, standardized logging - Phase 3: dispatcher rewrite — watches data/acquired/<subfolder>/ for {hash}.txt + {hash}.meta.json pairs, atomic .tmp rename, idempotent - Phase 4: content processors for PDF (PyPDF2 -> pdftotext -> Tesseract -> Gemini Vision fallback chain), transcript, and text formats - Phase 5: enrichment, embedding, and filing daemons split into independently restartable threads PeerTube acquisition -------------------- - Phase 6a-6c: PeerTube channel watcher, caption acquisition with rate limiting (429 handling), 0.5s rate_limit_delay enforced - Phase 6d: multi-instance support - Phase 6e: rewired then reverted dashboard PeerTube endpoint to live in acquisition module Format handling & library cleanup --------------------------------- - Phase 6f: text processor for .txt ingestion - Phase 6f-2: format normalizer in dispatcher - Phase 6g-6j: library reorg — ghost domain cleanup, SCL moved to dedicated domain folder, pi-nas fully decommissioned as a storage target (NFS-only now), ~73 GB reclaimed - Phase 6k: hash-identical dedup — 2,477 duplicate PDFs removed, 22.05 GB freed, catalogue/documents/Qdrant payloads updated coherently, 226 empty domain subdirs pruned - 16,340 transcripts remain un-filed pending title-match review Dashboard & metadata -------------------- - Gemini "null" string bug fixed in pdf_processor metadata voting - Dashboard upload migrated to pipeline with multi-format support State at release ---------------- - 7 daemon threads: dispatcher, enrich, embed, filing, peertube-acq, progress, dashboard - 29,201 documents in catalogue / documents tables (UNIQUE on hash PK) - ~2.1M Qdrant vectors in recon_knowledge_hybrid (cortex:6333) - ~67 GB library on /mnt/library (NFS from pi-nas) - files.echo6.co serving 9,397 deduped PDFs - recon.echo6.co dashboard + API on :8420 See cleanup-log.md for the full backlog and resolution history.		2026-04-16 18:20:25 +00:00
lib	Migrate dashboard upload to pipeline with multi-format support	2026-04-16 02:18:45 +00:00
scripts	Initial commit: RECON codebase baseline	2026-04-14 14:57:23 +00:00
static	Phase 6b: fix dashboard Untitled/WEB bug for transcripts	2026-04-14 23:05:29 +00:00
templates	Migrate dashboard upload to pipeline with multi-format support	2026-04-16 02:18:45 +00:00
.gitignore	Initial commit: RECON codebase baseline	2026-04-14 14:57:23 +00:00
api.py	Initial commit: RECON codebase baseline	2026-04-14 14:57:23 +00:00
config.yaml	Phase 6f: text processor for .txt file ingestion	2026-04-15 22:39:31 +00:00
enricher.py	Initial commit: RECON codebase baseline	2026-04-14 14:57:23 +00:00
migrate_paths.py	Initial commit: RECON codebase baseline	2026-04-14 14:57:23 +00:00
PROJECT-BIBLE.md	Initial commit: RECON codebase baseline	2026-04-14 14:57:23 +00:00
README.md	Initial commit: RECON codebase baseline	2026-04-14 14:57:23 +00:00
recon.py	Phase 6d: PeerTube acquisition module + service thread	2026-04-15 03:08:51 +00:00
requirements.txt	Initial commit: RECON codebase baseline	2026-04-14 14:57:23 +00:00
run-pipeline-now.sh	Initial commit: RECON codebase baseline	2026-04-14 14:57:23 +00:00
sweep_gated.sh	Initial commit: RECON codebase baseline	2026-04-14 14:57:23 +00:00

README.md

RECON -- Knowledge Extraction Pipeline

Extracts structured knowledge from PDFs and web content into a Qdrant vector database for RAG retrieval by Aurora.

Quick Start

# Activate
cd /opt/recon && source venv/bin/activate

# Scan library for new PDFs
recon scan

# Queue and process
recon queue
recon extract
recon enrich
recon embed

# Or run full pipeline
recon run

# Ingest a web page
recon ingest-url "https://example.com/article" --category "Category" --process

# Crawl an entire docs site
recon crawl "https://docs.example.com" --include /docs/ --category "Category" --process

# Upload a PDF
recon upload --file /path/to/document.pdf --category "Category"

# Search
recon search "water purification methods"

# Check status
recon status
recon failures

Dashboard

http://100.64.0.24:8420

Services

Service	Location	Purpose
RECON Dashboard	recon:8420	Pipeline management + API
Qdrant	cortex:6333	Vector database
TEI	cortex:8090	Embeddings (1,711/sec)
Ollama	cortex:11434	Chat + fallback embeddings
OpenWebUI	cortex:8080 (ai.echo6.co)	Aurora chat with RAG
File Server	recon:8888 (files.echo6.co)	PDF downloads

Key Paths

Path	Contents
/opt/recon/	Application code
/opt/recon/data/concepts/	Gemini extractions (CRITICAL -- back these up)
/opt/recon/data/text/	Extracted text
/opt/recon/data/recon.db	SQLite status DB
/mnt/library/	PDF library (NFS from pi-nas)

Backups

Automated every 6 hours to Contabo VPS via /opt/recon/scripts/backup.sh. Concept JSONs are the most valuable data ($130+ of Gemini API work). Qdrant is NOT backed up -- rebuilt from JSONs in ~10 minutes via recon rebuild.

Monitoring

# Pipeline status
recon status

# Tail logs
tail -f /opt/recon/logs/recon.log

# Pipeline run log
tail -f /opt/recon/pipeline.log

# Validate consistency
recon validate --deep

Full Documentation

See PROJECT-BIBLE.md for complete system documentation.