Matt Johnson e9231ac24a Migration: consolidate Echo6 docs to cortex with full infrastructure cleanup sync

- Documents recent infrastructure cleanup (8 CTs destroyed, 35 DNS records removed, Headscale cleanup)
- Adds 24 new runbooks covering Authentik, PeerTube, Meshtastic, RECON, Proxmox, Mailcow, Internet Archive, GPU routing
- Adds project documentation for headscale, vaultwarden, peertube, matrix, mmud, advbbs, arr stack
- Updates services.md, environment.md, caddy.md, authentik.md to match live infrastructure
- Removes 4 deprecated runbook duplicates (canonical versions live in projects/)
- Adds .gitignore for binary archives and editor temp files

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-04-13 06:02:16 +00:00

5.7 KiB

Raw Blame History

RECON — Knowledge Extraction Pipeline

Overview

RECON extracts knowledge from PDFs and web content into a searchable vector database. PDFs are scanned from an NFS-mounted library, text is extracted (with Gemini Vision fallback for scanned docs), concepts are enriched via Gemini, and embeddings are stored in Qdrant. Aurora (Open WebUI) queries the knowledge base via RAG filter.

Location

Host: recon LXC (CT 130 on data node, 192.168.1.240)
IP: 192.168.1.130 / 100.64.0.24 (Tailscale)
Install: /opt/recon/
User: zvx
Service: recon.service (systemd, Type=simple, Restart=on-failure)
Dashboard: https://recon.echo6.co (internal: http://100.64.0.24:8420)
Health: https://recon.echo6.co/api/health

Stack

Component	Technology	Location
Pipeline + CLI	Python 3.12, argparse	/opt/recon/recon.py
Dashboard + API	Flask	/opt/recon/lib/api.py (port 8420)
Status DB	SQLite (WAL mode)	/opt/recon/data/recon.db
Vector DB	Qdrant	cortex:6333 (Docker)
Embeddings	TEI (bge-m3, 1024-dim)	cortex:8090 (Docker)
Enrichment	Gemini 2.0 Flash	Google API (4 keys)
Vision OCR	Gemini 2.0 Flash	Google API (shared keys)
Text extraction	PyPDF2, poppler-utils, Tesseract	Local
PDF source	NFS	pi-nas:/export/library → /mnt/library
File server	nginx	localhost:8888 → files.echo6.co

Pipeline Stages

All stages run concurrently as daemon threads in the service:

Scanner (hourly) — walks /mnt/library, catalogues new PDFs, queues them
Extract (4 workers) — PyPDF2 → pdftotext → Tesseract → Gemini Vision per page
Enrich (16 workers, 4 API keys) — Gemini extracts structured concepts from text windows
Embed (4 workers) — TEI generates vectors, upserted to Qdrant

Extraction Chain

Per page, in order. Each method only runs if the previous returned <50 chars:

PyPDF2 — fast, free, works on text-based PDFs
pdftotext (poppler) — handles some PDFs PyPDF2 misses
Tesseract OCR — renders page to image, runs local OCR
Gemini Vision — renders page to PNG, sends to Gemini 2.0 Flash vision API

Method tracking saved in data/text/{hash}/meta.json as ocr_methods dict.

Scale (as of Feb 2026)

~10,162 documents in pipeline
~95,000+ vectors in Qdrant (HNSW index, <10ms search latency)
Collection: recon_knowledge
~13,239 PDFs catalogued from NFS library

Resilience

Enricher: Exponential backoff (5s→80s) for transient errors (429, 500, 503). Window-level failure isolation — partial enrichment beats zero.
Extractor: Per-page timeout (30s), per-document timeout (1800s). Partial extractions saved.
Embedder: Skip-on-failure per concept, batch processing.
Service: Restart=on-failure, RestartSec=30, MemoryMax=3G.

Configuration

Config file: /opt/recon/config.yaml

Key sections:

processing.extract_workers (4), enrich_workers (16), embed_workers (4)
processing.extract_timeout (1800s), page_timeout (30s)
processing.enrich_max_retries (5), enrich_base_delay (5.0)
gemini.model (gemini-2.0-flash), gemini.response_mime_type (application/json)
service.scan_interval (3600), stage_poll_interval (30)

API keys: /opt/recon/.env — GEMINI_KEY_1 through GEMINI_KEY_4

API Endpoints

Endpoint	Method	Purpose
`/`	GET	Dashboard HTML
`/api/knowledge-stats`	GET	Full pipeline stats, per-source breakdown
`/api/health`	GET	Health check (Qdrant, TEI, NFS, Gemini, pipeline)
`/api/search`	GET	Vector search (`?q=query&limit=5`)
`/api/upload`	POST	Upload PDF (multipart: file + category)
`/api/upload/<hash>/status`	GET	Upload status tracking
`/api/upload/categories`	GET	Available upload categories
`/api/ingest`	POST	Ingest intel JSON data
`/api/peertube/channels`	GET	List all channels from channel-map.json with video counts from PeerTube DB
`/api/peertube/channels/stats`	GET	Channel count, total videos, downloader status
`/api/peertube/channels/add`	POST	Add channel: resolve YT URL, create PeerTube channel, update JSON
`/api/peertube/channels/<name>`	DELETE	Remove channel from JSON and optionally from PeerTube

Backups

Destination: root@100.64.0.1:/opt/backups/recon/
Full sync: every 6 hours (concepts, text, DB, config)
DB snapshot: every 2 hours
Recovery: restore from Contabo → recon rebuild (reconstructs Qdrant from concept JSONs)
Critical data: data/concepts/ — Gemini extraction work, costs money to regenerate

Key Files

/opt/recon/
├── recon.py              # CLI entry point + service command
├── config.yaml           # Configuration
├── .env                  # Gemini API keys
├── PROJECT-BIBLE.md      # Full documentation
├── lib/
│   ├── api.py            # Flask dashboard + API
│   ├── extractor.py      # PDF → text (4-method chain)
│   ├── enricher.py       # Text → concepts (Gemini)
│   ├── embedder.py       # Concepts → vectors (TEI/Qdrant)
│   ├── status.py         # SQLite DB (WAL, thread-safe)
│   └── utils.py          # Config, hashing, logging
├── scripts/
│   ├── backup.sh         # Backup to Contabo
│   ├── validate.py       # Pipeline consistency checker
│   └── rebuild_qdrant.py # Nuclear Qdrant rebuild
└── data/
    ├── recon.db           # SQLite status DB
    ├── concepts/{hash}/   # Enriched concept JSONs
    └── text/{hash}/       # Extracted page text

Last updated: 2026-02-16 — Initial creation

5.7 KiB Raw Blame History