echo6-docs/docs/software/recon.md
Matt Johnson e9231ac24a Migration: consolidate Echo6 docs to cortex with full infrastructure cleanup sync
- Documents recent infrastructure cleanup (8 CTs destroyed, 35 DNS records removed, Headscale cleanup)
- Adds 24 new runbooks covering Authentik, PeerTube, Meshtastic, RECON, Proxmox, Mailcow, Internet Archive, GPU routing
- Adds project documentation for headscale, vaultwarden, peertube, matrix, mmud, advbbs, arr stack
- Updates services.md, environment.md, caddy.md, authentik.md to match live infrastructure
- Removes 4 deprecated runbook duplicates (canonical versions live in projects/)
- Adds .gitignore for binary archives and editor temp files

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-13 06:02:16 +00:00

5.7 KiB

RECON — Knowledge Extraction Pipeline

Overview

RECON extracts knowledge from PDFs and web content into a searchable vector database. PDFs are scanned from an NFS-mounted library, text is extracted (with Gemini Vision fallback for scanned docs), concepts are enriched via Gemini, and embeddings are stored in Qdrant. Aurora (Open WebUI) queries the knowledge base via RAG filter.

Location

Stack

Component Technology Location
Pipeline + CLI Python 3.12, argparse /opt/recon/recon.py
Dashboard + API Flask /opt/recon/lib/api.py (port 8420)
Status DB SQLite (WAL mode) /opt/recon/data/recon.db
Vector DB Qdrant cortex:6333 (Docker)
Embeddings TEI (bge-m3, 1024-dim) cortex:8090 (Docker)
Enrichment Gemini 2.0 Flash Google API (4 keys)
Vision OCR Gemini 2.0 Flash Google API (shared keys)
Text extraction PyPDF2, poppler-utils, Tesseract Local
PDF source NFS pi-nas:/export/library → /mnt/library
File server nginx localhost:8888 → files.echo6.co

Pipeline Stages

All stages run concurrently as daemon threads in the service:

  1. Scanner (hourly) — walks /mnt/library, catalogues new PDFs, queues them
  2. Extract (4 workers) — PyPDF2 → pdftotext → Tesseract → Gemini Vision per page
  3. Enrich (16 workers, 4 API keys) — Gemini extracts structured concepts from text windows
  4. Embed (4 workers) — TEI generates vectors, upserted to Qdrant

Extraction Chain

Per page, in order. Each method only runs if the previous returned <50 chars:

  1. PyPDF2 — fast, free, works on text-based PDFs
  2. pdftotext (poppler) — handles some PDFs PyPDF2 misses
  3. Tesseract OCR — renders page to image, runs local OCR
  4. Gemini Vision — renders page to PNG, sends to Gemini 2.0 Flash vision API

Method tracking saved in data/text/{hash}/meta.json as ocr_methods dict.

Scale (as of Feb 2026)

  • ~10,162 documents in pipeline
  • ~95,000+ vectors in Qdrant (HNSW index, <10ms search latency)
  • Collection: recon_knowledge
  • ~13,239 PDFs catalogued from NFS library

Resilience

  • Enricher: Exponential backoff (5s→80s) for transient errors (429, 500, 503). Window-level failure isolation — partial enrichment beats zero.
  • Extractor: Per-page timeout (30s), per-document timeout (1800s). Partial extractions saved.
  • Embedder: Skip-on-failure per concept, batch processing.
  • Service: Restart=on-failure, RestartSec=30, MemoryMax=3G.

Configuration

Config file: /opt/recon/config.yaml

Key sections:

  • processing.extract_workers (4), enrich_workers (16), embed_workers (4)
  • processing.extract_timeout (1800s), page_timeout (30s)
  • processing.enrich_max_retries (5), enrich_base_delay (5.0)
  • gemini.model (gemini-2.0-flash), gemini.response_mime_type (application/json)
  • service.scan_interval (3600), stage_poll_interval (30)

API keys: /opt/recon/.env — GEMINI_KEY_1 through GEMINI_KEY_4

API Endpoints

Endpoint Method Purpose
/ GET Dashboard HTML
/api/knowledge-stats GET Full pipeline stats, per-source breakdown
/api/health GET Health check (Qdrant, TEI, NFS, Gemini, pipeline)
/api/search GET Vector search (?q=query&limit=5)
/api/upload POST Upload PDF (multipart: file + category)
/api/upload/<hash>/status GET Upload status tracking
/api/upload/categories GET Available upload categories
/api/ingest POST Ingest intel JSON data
/api/peertube/channels GET List all channels from channel-map.json with video counts from PeerTube DB
/api/peertube/channels/stats GET Channel count, total videos, downloader status
/api/peertube/channels/add POST Add channel: resolve YT URL, create PeerTube channel, update JSON
/api/peertube/channels/<name> DELETE Remove channel from JSON and optionally from PeerTube

Backups

  • Destination: root@100.64.0.1:/opt/backups/recon/
  • Full sync: every 6 hours (concepts, text, DB, config)
  • DB snapshot: every 2 hours
  • Recovery: restore from Contabo → recon rebuild (reconstructs Qdrant from concept JSONs)
  • Critical data: data/concepts/ — Gemini extraction work, costs money to regenerate

Key Files

/opt/recon/
├── recon.py              # CLI entry point + service command
├── config.yaml           # Configuration
├── .env                  # Gemini API keys
├── PROJECT-BIBLE.md      # Full documentation
├── lib/
│   ├── api.py            # Flask dashboard + API
│   ├── extractor.py      # PDF → text (4-method chain)
│   ├── enricher.py       # Text → concepts (Gemini)
│   ├── embedder.py       # Concepts → vectors (TEI/Qdrant)
│   ├── status.py         # SQLite DB (WAL, thread-safe)
│   └── utils.py          # Config, hashing, logging
├── scripts/
│   ├── backup.sh         # Backup to Contabo
│   ├── validate.py       # Pipeline consistency checker
│   └── rebuild_qdrant.py # Nuclear Qdrant rebuild
└── data/
    ├── recon.db           # SQLite status DB
    ├── concepts/{hash}/   # Enriched concept JSONs
    └── text/{hash}/       # Extracted page text

Last updated: 2026-02-16 — Initial creation