131 lines
5.7 KiB
Markdown
131 lines
5.7 KiB
Markdown
|
|
# RECON — Knowledge Extraction Pipeline
|
||
|
|
|
||
|
|
## Overview
|
||
|
|
|
||
|
|
RECON extracts knowledge from PDFs and web content into a searchable vector database. PDFs are scanned from an NFS-mounted library, text is extracted (with Gemini Vision fallback for scanned docs), concepts are enriched via Gemini, and embeddings are stored in Qdrant. Aurora (Open WebUI) queries the knowledge base via RAG filter.
|
||
|
|
|
||
|
|
## Location
|
||
|
|
|
||
|
|
- **Host:** recon LXC (CT 130 on data node, 192.168.1.240)
|
||
|
|
- **IP:** 192.168.1.130 / 100.64.0.24 (Tailscale)
|
||
|
|
- **Install:** `/opt/recon/`
|
||
|
|
- **User:** zvx
|
||
|
|
- **Service:** `recon.service` (systemd, Type=simple, Restart=on-failure)
|
||
|
|
- **Dashboard:** https://recon.echo6.co (internal: http://100.64.0.24:8420)
|
||
|
|
- **Health:** https://recon.echo6.co/api/health
|
||
|
|
|
||
|
|
## Stack
|
||
|
|
|
||
|
|
| Component | Technology | Location |
|
||
|
|
|-----------|-----------|----------|
|
||
|
|
| Pipeline + CLI | Python 3.12, argparse | /opt/recon/recon.py |
|
||
|
|
| Dashboard + API | Flask | /opt/recon/lib/api.py (port 8420) |
|
||
|
|
| Status DB | SQLite (WAL mode) | /opt/recon/data/recon.db |
|
||
|
|
| Vector DB | Qdrant | cortex:6333 (Docker) |
|
||
|
|
| Embeddings | TEI (bge-m3, 1024-dim) | cortex:8090 (Docker) |
|
||
|
|
| Enrichment | Gemini 2.0 Flash | Google API (4 keys) |
|
||
|
|
| Vision OCR | Gemini 2.0 Flash | Google API (shared keys) |
|
||
|
|
| Text extraction | PyPDF2, poppler-utils, Tesseract | Local |
|
||
|
|
| PDF source | NFS | pi-nas:/export/library → /mnt/library |
|
||
|
|
| File server | nginx | localhost:8888 → files.echo6.co |
|
||
|
|
|
||
|
|
## Pipeline Stages
|
||
|
|
|
||
|
|
All stages run concurrently as daemon threads in the service:
|
||
|
|
|
||
|
|
1. **Scanner** (hourly) — walks /mnt/library, catalogues new PDFs, queues them
|
||
|
|
2. **Extract** (4 workers) — PyPDF2 → pdftotext → Tesseract → Gemini Vision per page
|
||
|
|
3. **Enrich** (16 workers, 4 API keys) — Gemini extracts structured concepts from text windows
|
||
|
|
4. **Embed** (4 workers) — TEI generates vectors, upserted to Qdrant
|
||
|
|
|
||
|
|
## Extraction Chain
|
||
|
|
|
||
|
|
Per page, in order. Each method only runs if the previous returned <50 chars:
|
||
|
|
|
||
|
|
1. **PyPDF2** — fast, free, works on text-based PDFs
|
||
|
|
2. **pdftotext** (poppler) — handles some PDFs PyPDF2 misses
|
||
|
|
3. **Tesseract OCR** — renders page to image, runs local OCR
|
||
|
|
4. **Gemini Vision** — renders page to PNG, sends to Gemini 2.0 Flash vision API
|
||
|
|
|
||
|
|
Method tracking saved in `data/text/{hash}/meta.json` as `ocr_methods` dict.
|
||
|
|
|
||
|
|
## Scale (as of Feb 2026)
|
||
|
|
|
||
|
|
- ~10,162 documents in pipeline
|
||
|
|
- ~95,000+ vectors in Qdrant (HNSW index, <10ms search latency)
|
||
|
|
- Collection: `recon_knowledge`
|
||
|
|
- ~13,239 PDFs catalogued from NFS library
|
||
|
|
|
||
|
|
## Resilience
|
||
|
|
|
||
|
|
- **Enricher**: Exponential backoff (5s→80s) for transient errors (429, 500, 503). Window-level failure isolation — partial enrichment beats zero.
|
||
|
|
- **Extractor**: Per-page timeout (30s), per-document timeout (1800s). Partial extractions saved.
|
||
|
|
- **Embedder**: Skip-on-failure per concept, batch processing.
|
||
|
|
- **Service**: Restart=on-failure, RestartSec=30, MemoryMax=3G.
|
||
|
|
|
||
|
|
## Configuration
|
||
|
|
|
||
|
|
**Config file:** `/opt/recon/config.yaml`
|
||
|
|
|
||
|
|
Key sections:
|
||
|
|
- `processing.extract_workers` (4), `enrich_workers` (16), `embed_workers` (4)
|
||
|
|
- `processing.extract_timeout` (1800s), `page_timeout` (30s)
|
||
|
|
- `processing.enrich_max_retries` (5), `enrich_base_delay` (5.0)
|
||
|
|
- `gemini.model` (gemini-2.0-flash), `gemini.response_mime_type` (application/json)
|
||
|
|
- `service.scan_interval` (3600), `stage_poll_interval` (30)
|
||
|
|
|
||
|
|
**API keys:** `/opt/recon/.env` — GEMINI_KEY_1 through GEMINI_KEY_4
|
||
|
|
|
||
|
|
## API Endpoints
|
||
|
|
|
||
|
|
| Endpoint | Method | Purpose |
|
||
|
|
|----------|--------|---------|
|
||
|
|
| `/` | GET | Dashboard HTML |
|
||
|
|
| `/api/knowledge-stats` | GET | Full pipeline stats, per-source breakdown |
|
||
|
|
| `/api/health` | GET | Health check (Qdrant, TEI, NFS, Gemini, pipeline) |
|
||
|
|
| `/api/search` | GET | Vector search (`?q=query&limit=5`) |
|
||
|
|
| `/api/upload` | POST | Upload PDF (multipart: file + category) |
|
||
|
|
| `/api/upload/<hash>/status` | GET | Upload status tracking |
|
||
|
|
| `/api/upload/categories` | GET | Available upload categories |
|
||
|
|
| `/api/ingest` | POST | Ingest intel JSON data |
|
||
|
|
| `/api/peertube/channels` | GET | List all channels from channel-map.json with video counts from PeerTube DB |
|
||
|
|
| `/api/peertube/channels/stats` | GET | Channel count, total videos, downloader status |
|
||
|
|
| `/api/peertube/channels/add` | POST | Add channel: resolve YT URL, create PeerTube channel, update JSON |
|
||
|
|
| `/api/peertube/channels/<name>` | DELETE | Remove channel from JSON and optionally from PeerTube |
|
||
|
|
|
||
|
|
## Backups
|
||
|
|
|
||
|
|
- **Destination:** `root@100.64.0.1:/opt/backups/recon/`
|
||
|
|
- **Full sync:** every 6 hours (concepts, text, DB, config)
|
||
|
|
- **DB snapshot:** every 2 hours
|
||
|
|
- **Recovery:** restore from Contabo → `recon rebuild` (reconstructs Qdrant from concept JSONs)
|
||
|
|
- **Critical data:** `data/concepts/` — Gemini extraction work, costs money to regenerate
|
||
|
|
|
||
|
|
## Key Files
|
||
|
|
|
||
|
|
```
|
||
|
|
/opt/recon/
|
||
|
|
├── recon.py # CLI entry point + service command
|
||
|
|
├── config.yaml # Configuration
|
||
|
|
├── .env # Gemini API keys
|
||
|
|
├── PROJECT-BIBLE.md # Full documentation
|
||
|
|
├── lib/
|
||
|
|
│ ├── api.py # Flask dashboard + API
|
||
|
|
│ ├── extractor.py # PDF → text (4-method chain)
|
||
|
|
│ ├── enricher.py # Text → concepts (Gemini)
|
||
|
|
│ ├── embedder.py # Concepts → vectors (TEI/Qdrant)
|
||
|
|
│ ├── status.py # SQLite DB (WAL, thread-safe)
|
||
|
|
│ └── utils.py # Config, hashing, logging
|
||
|
|
├── scripts/
|
||
|
|
│ ├── backup.sh # Backup to Contabo
|
||
|
|
│ ├── validate.py # Pipeline consistency checker
|
||
|
|
│ └── rebuild_qdrant.py # Nuclear Qdrant rebuild
|
||
|
|
└── data/
|
||
|
|
├── recon.db # SQLite status DB
|
||
|
|
├── concepts/{hash}/ # Enriched concept JSONs
|
||
|
|
└── text/{hash}/ # Extracted page text
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
*Last updated: 2026-02-16 — Initial creation*
|