Initial commit: RECON codebase baseline

Current state of the pipeline code as of 2026-04-14 (Phase 1 scaffolding complete). Config has new_pipeline.enabled=false and crawler.sites=[] per refactor plan. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-05-20 06:34:40 +02:00 · 2026-04-14 14:57:23 +00:00 · 2026-04-14 14:57:23 +00:00 · 563c16bb71
commit 563c16bb71
59 changed files with 18327 additions and 0 deletions
--- a/PROJECT-BIBLE.md
+++ b/PROJECT-BIBLE.md
@ -0,0 +1,785 @@
+# RECON Project Bible v2.0
+
+*Last updated: 2026-02-16*
+
+---
+
+## 1. Mission Statement
+
+RECON (Reconnaissance, Extraction, Conceptualization, and Operationalization of kNowledge) is a knowledge extraction pipeline that processes PDFs and web content into structured concepts stored in a Qdrant vector database. These concepts power Aurora, the RAG-enabled AI assistant running on OpenWebUI.
+
+**The core loop:** Content in (PDF/web) -> Text extracted -> Concepts enriched (Gemini) -> Vectors embedded (TEI/BGE-M3) -> Searchable knowledge (Qdrant) -> Aurora answers questions with citations.
+
+---
+
+## 2. Infrastructure
+
+### Hosts
+
+| Host | IP (Tailscale) | Role |
+|------|---------------|------|
+| recon LXC | 100.64.0.24 (CT 130 on toc) | RECON application, dashboard, pipeline |
+| cortex VM | 100.64.0.14 (VM 150 on toc) | Qdrant, TEI, Ollama, OpenWebUI |
+| pi-nas | 100.64.0.21 (192.168.1.245) | NFS file server for PDF library |
+| Contabo VPS | 100.64.0.1 (5.189.158.149) | Backup destination |
+
+### Services on cortex (100.64.0.14)
+
+| Service | Port | Purpose |
+|---------|------|---------|
+| Qdrant | 6333 | Vector database (recon_knowledge collection) |
+| TEI (text-embeddings-inference) | 8090 | Embedding server (bge-m3, 1024-dim, ~1,711 emb/sec) |
+| Ollama | 11434 | LLM server + fallback embeddings (~8 emb/sec) |
+| OpenWebUI | 8080 | Aurora chat interface (ai.echo6.co) |
+
+### Services on recon LXC (100.64.0.24)
+
+| Service | Port | Purpose |
+|---------|------|---------|
+| RECON Dashboard | 8420 | Web UI + API for pipeline management |
+| File Server | 8888 | PDF downloads (files.echo6.co) |
+
+### NFS Mount
+
+```
+pi-nas:/export/library -> /mnt/library (22TB, rw, NFSv3)
+```
+
+Contains ~13,000+ PDFs across:
+- `Survival-Companion-Library/` (~12,900 PDFs in ~220 subdirectories)
+- `Army_Pubs/` (~160 military field manuals)
+- Other: `Gaming/`, `Reference/`, `Technical/`
+
+---
+
+## 3. Architecture Overview
+
+```
+                    /mnt/library/ (NFS)
+                         |
+                    [recon scan]
+                         |
+                    catalogue (SQLite)
+                         |
+                    [recon queue]
+                         |
+    +-----------+   [recon extract]   +-----------+
+    |  PyPDF2   |-->  data/text/      |  Gemini   |
+    | pdftotext |   {hash}/page_N.txt |  Flash    |
+    | tesseract |        |            |  4 keys   |
+    +-----------+   [recon enrich]    +-----------+
+                         |
+                    data/concepts/
+                    {hash}/window_N.json
+                         |
+                    [recon embed]
+                         |
+              +----------+-----------+
+              |   TEI (primary)      |
+              |   bge-m3, 1024-dim   |
+              |   1,711 emb/sec      |
+              +----------+-----------+
+                         |
+                    Qdrant (cortex:6333)
+                    recon_knowledge collection
+                         |
+                    Aurora (OpenWebUI)
+                    RAG search + citations
+```
+
+### Web Content Path
+
+```
+    URL(s) ──> [recon ingest-url / crawl]
+                         |
+                    trafilatura extraction
+                    chunk into ~2000-word pages
+                         |
+                    data/text/{hash}/page_N.txt
+                    (enters at "extracted" status)
+                         |
+                    [enrich] -> [embed]
+                    (same as PDF path)
+```
+
+---
+
+## 4. Pipeline Stages
+
+### Status Flow
+
+```
+catalogued -> queued -> extracting -> extracted -> enriching -> enriched -> embedding -> complete
+                                                                                    \-> failed
+```
+
+Web content enters at `extracted` status (text already extracted by trafilatura).
+
+### Stage Details
+
+| Stage | Tool | Input | Output | Speed |
+|-------|------|-------|--------|-------|
+| Scan | `recon scan` | /mnt/library/*.pdf | catalogue table | ~13K PDFs in ~30 min |
+| Queue | `recon queue` | catalogue entries | documents table (status=queued) | Instant |
+| Extract | `recon extract` | PDF files | data/text/{hash}/page_NNNN.txt | 4 workers, ~200/hr |
+| Enrich | `recon enrich` | Text pages (10-page windows) | data/concepts/{hash}/window_N.json | 16 workers, 4 Gemini keys |
+| Embed | `recon embed` | Concept JSONs | Qdrant vectors | TEI: 1,711 emb/sec |
+
+### Extraction Fallback Chain
+
+1. **PyPDF2** (fast, clean text) -> 2. **pdftotext** (handles complex layouts) -> 3. **Tesseract OCR** (scanned documents)
+
+### Enrichment Details
+
+- Model: `gemini-2.0-flash`
+- Window size: 10 pages per API call (configurable)
+- Workers: 16 concurrent (4 API keys x 4 workers each)
+- Output format: JSON array of concept objects
+- **CRITICAL**: Concept JSONs are saved to disk BEFORE any database operations
+- Key rotation via `KeyRotator` class distributing across 4 Gemini API keys
+
+### Embedding Details
+
+- **Primary**: TEI at cortex:8090 (bge-m3 model, 1024 dimensions, ~1,711 embeddings/sec)
+- **Fallback**: Ollama at cortex:11434 (bge-m3 model, ~8 embeddings/sec)
+- Batch size: 128 embeddings per TEI request
+- Distance metric: Cosine similarity
+- **CRITICAL**: Dimensions are 1024 (bge-m3), NOT 384. Getting this wrong creates silent failures.
+
+---
+
+## 5. Directory Structure
+
+```
+/opt/recon/                          # Application root
+  recon.py                           # CLI entry point
+  config.yaml                        # Central configuration
+  .env                               # Gemini API keys (4 keys)
+  requirements.txt                   # Python dependencies
+  PROJECT-BIBLE.md                   # This file
+  README.md                          # Quick-start reference
+  run-full-pipeline.sh               # Background pipeline runner
+
+  lib/                               # Core modules
+    __init__.py
+    api.py                           # Flask web dashboard + API (port 8420)
+    crawler.py                       # Site crawler (sitemap + BFS link-following)
+    embedder.py                      # Concept -> vector embedding (TEI/Ollama -> Qdrant)
+    enricher.py                      # Text -> concept extraction (Gemini)
+    extractor.py                     # PDF -> text extraction (PyPDF2/pdftotext/OCR)
+    ingester.py                      # ARGUS intel feed intake
+    status.py                        # SQLite DB operations (catalogue + documents)
+    utils.py                         # Config, hashing, URL generation, logging
+    web_scraper.py                   # URL -> text extraction (trafilatura)
+
+  scripts/                           # Operational scripts
+    backup.sh                        # Automated backup to Contabo (cron every 6h)
+    rebuild_qdrant.py                # Nuclear recovery: re-embed all concepts
+    validate.py                      # Pipeline consistency validation
+
+  data/                              # Pipeline data (on local disk)
+    recon.db                         # SQLite status database
+    text/                            # Extracted text
+      {content_hash}/
+        meta.json                    # Document metadata
+        page_0001.txt                # Page text (4-digit, 1-indexed)
+        page_0002.txt
+        ...
+    concepts/                        # Enriched concepts (**BACK THESE UP**)
+      {content_hash}/
+        window_1.json                # Concept JSON array (10-page window)
+        window_2.json
+        ...
+    intel/                           # ARGUS intel feeds
+
+  logs/                              # Application logs
+    recon.log                        # Main rotating log
+    backup.log                       # Backup operation log
+    backup_cron.log                  # Cron backup log
+
+  venv/                              # Python virtual environment
+```
+
+---
+
+## 6. Database Schema
+
+### SQLite (data/recon.db)
+
+Two tables in WAL mode with thread-local connections.
+
+#### catalogue
+
+| Column | Type | Description |
+|--------|------|-------------|
+| hash | TEXT PK | MD5 content hash |
+| filename | TEXT | Original filename |
+| path | TEXT | Full filesystem path |
+| size_bytes | INTEGER | File size |
+| source | TEXT | Top-level directory (e.g., "Survival-Companion-Library") |
+| category | TEXT | Second-level directory (e.g., "Bushcraft") |
+| status | TEXT | "catalogued" or "processed" |
+| discovered_at | TEXT | ISO timestamp |
+
+#### documents
+
+| Column | Type | Description |
+|--------|------|-------------|
+| hash | TEXT PK | MD5 content hash |
+| filename | TEXT | Original filename |
+| path | TEXT | Full path or URL |
+| size_bytes | INTEGER | File/content size |
+| page_count | INTEGER | Number of text pages |
+| book_title | TEXT | Gemini-extracted title |
+| book_author | TEXT | Gemini-extracted author |
+| status | TEXT | Pipeline status |
+| pages_extracted | INTEGER | Pages extracted |
+| concepts_extracted | INTEGER | Concepts generated |
+| vectors_inserted | INTEGER | Vectors in Qdrant |
+| error_message | TEXT | Last error (if failed) |
+| retry_count | INTEGER | Failure retry count |
+| created_at | TEXT | ISO timestamp |
+| updated_at | TEXT | ISO timestamp |
+
+### Qdrant (cortex:6333)
+
+Collection: `recon_knowledge`
+
+| Field | Type | Description |
+|-------|------|-------------|
+| vector | float[1024] | BGE-M3 embedding |
+| doc_hash | keyword | Links to SQLite document |
+| filename | keyword | Source filename |
+| book_title | keyword | Document title |
+| book_author | keyword | Author name |
+| source_type | keyword | "document", "web", or "intel_feed" |
+| download_url | keyword | files.echo6.co URL or source URL |
+| content | text | Concept text (searchable) |
+| summary | text | Concept summary |
+| title | keyword | Concept title |
+| domain | keyword | Knowledge domain |
+| subdomain | keyword | Knowledge subdomain |
+| keywords | keyword[] | Concept keywords |
+| skill_level | keyword | beginner/intermediate/advanced/expert |
+| key_facts | text[] | Key facts list |
+| scenario_applicable | text[] | Applicable scenarios |
+| cross_domain_tags | keyword[] | Cross-references |
+| chapter | keyword | Source chapter |
+| page_ref | keyword | Source page reference |
+| notes | text | Additional notes |
+| _window | integer | Source window number |
+| _start_page | integer | Starting page in document |
+| verification_status | keyword | "unverified" (default) |
+| credibility_score | float | 0.7 (default) |
+| language | keyword | "en" (default) |
+
+---
+
+## 7. CLI Reference
+
+```
+recon <command> [options]
+```
+
+| Command | Description | Key Options |
+|---------|-------------|-------------|
+| `scan` | Scan library, catalogue new PDFs | `--path` |
+| `queue` | Queue catalogued docs for processing | `--hash`, `--source`, `--category`, `--limit` |
+| `extract` | Extract text from queued PDFs | `--workers` |
+| `enrich` | Enrich extracted text via Gemini | `--workers`, `--limit` |
+| `embed` | Embed concepts into Qdrant | `--workers`, `--limit` |
+| `run` | Full pipeline (extract->enrich->embed) | `--workers`, `--enrich-workers`, `--limit` |
+| `status` | Show pipeline status counts | |
+| `catalogue` | Browse catalogue | `--sources`, `--categories`, `--source`, `--limit` |
+| `failures` | Show failed documents | `--retry` |
+| `search` | Semantic search | `query`, `--limit` |
+| `upload` | Upload PDFs | `--file`, `--dir`, `--category` |
+| `ingest-url` | Ingest web content | `url`, `--file`, `--category`, `--process` |
+| `crawl` | Crawl a site | `url`, `--category`, `--include`, `--exclude`, `--max-pages`, `--dry-run`, `--process` |
+| `validate` | Check pipeline consistency | `--deep` |
+| `rebuild` | Rebuild Qdrant from concept JSONs | |
+| `serve` | Start web dashboard (port 8420) | |
+| `ingest` | Ingest ARGUS intel JSON | `--file`, `--directory` |
+
+### Common Workflows
+
+```bash
+# Full library processing
+recon scan && recon queue && recon run
+
+# Ingest a single web page with full processing
+recon ingest-url "https://example.com/article" --category "Reference" --process
+
+# Dry-run crawl to preview URLs
+recon crawl "https://docs.example.com" --include /docs/ --dry-run
+
+# Full crawl with processing
+recon crawl "https://docs.example.com" --include /docs/ --category "Reference" --process
+
+# Upload a PDF
+recon upload --file /path/to/document.pdf --category "Technical"
+
+# Check what failed and retry
+recon failures
+recon failures --retry
+```
+
+---
+
+## 8. Web Dashboard
+
+### URL
+
+```
+http://100.64.0.24:8420
+```
+
+### Pages
+
+| Route | Page | Description |
+|-------|------|-------------|
+| `/` | Dashboard | Knowledge base overview: document/concept/vector counts, source table, domain distribution bars, skill level breakdown, Qdrant health, recent completions, pipeline status |
+| `/search` | Search | Semantic search with score bars, Web/PDF badges, download links |
+| `/catalogue` | Catalogue | Browse all catalogued PDFs with source/category filters |
+| `/upload` | Upload | PDF upload form with category datalist, recent uploads table |
+| `/web-ingest` | Web Ingest | Two tabs: Single/Batch URL ingest, Site Crawl with preview |
+| `/failures` | Failures | Failed documents with error messages and retry button |
+
+### API Endpoints
+
+| Method | Endpoint | Description |
+|--------|----------|-------------|
+| GET | `/api/search?q=...&limit=N` | Semantic search |
+| GET | `/api/catalogue?source=...&limit=N` | Browse catalogue |
+| GET | `/api/knowledge-stats` | Dashboard aggregation (totals, sources, domains, skills, Qdrant health) |
+| POST | `/api/upload` | Upload PDF (multipart: file + category) |
+| GET | `/api/upload/<hash>/status` | Check upload processing status |
+| GET | `/api/upload/categories` | List available categories |
+| POST | `/api/ingest-url` | Ingest single URL (json: url, category, process) |
+| POST | `/api/ingest-urls` | Ingest multiple URLs (json: urls, category, process) |
+| POST | `/api/crawl` | Crawl a site (json: url, category, include, exclude, max_pages, dry_run) |
+| GET | `/api/crawl/<id>/status` | Poll crawl/pipeline progress |
+| POST | `/api/failures/retry` | Re-queue all failed documents |
+
+### Dashboard Features
+
+- **Auto-refresh**: Every 30 seconds via JavaScript fetch
+- **Knowledge cards**: Total documents, concepts, vectors, pages
+- **Source table**: Per-source breakdown with document/concept/vector counts and PDF/WEB type badges
+- **Domain distribution**: Horizontal bars showing top knowledge domains
+- **Skill level breakdown**: beginner/intermediate/advanced/expert percentages
+- **Qdrant health**: Connection status, points count, segments
+- **Pipeline status**: Compact display of documents in each stage
+- **Crawl polling**: Real-time stage tracking (ingesting -> enriching -> embedding)
+
+---
+
+## 9. Concept JSON Schema
+
+Each window file (`data/concepts/{hash}/window_N.json`) contains a JSON array of concept objects:
+
+```json
+[
+  {
+    "title": "Water Purification Methods",
+    "content": "Detailed text about the concept...",
+    "summary": "Brief summary of the concept",
+    "domain": "Survival",
+    "subdomain": "Water",
+    "keywords": ["purification", "filtration", "boiling"],
+    "skill_level": "beginner",
+    "key_facts": ["Boiling kills 99.9% of pathogens", "..."],
+    "scenario_applicable": ["wilderness survival", "disaster preparedness"],
+    "cross_domain_tags": ["health", "camping"],
+    "chapter": "Chapter 3",
+    "page_ref": "pp. 45-48",
+    "notes": "Additional context or caveats",
+    "_window": 1,
+    "_start_page": 1
+  }
+]
+```
+
+---
+
+## 10. Web Ingestion
+
+### Single URL
+
+```bash
+recon ingest-url "https://example.com/article" --category "Reference" --process
+```
+
+Or via API:
+```bash
+curl -X POST http://100.64.0.24:8420/api/ingest-url \
+  -H "Content-Type: application/json" \
+  -d '{"url": "https://example.com/article", "category": "Reference", "process": true}'
+```
+
+### Site Crawl
+
+```bash
+# Preview what would be crawled
+recon crawl "https://docs.example.com" --include /docs/ --dry-run
+
+# Full crawl
+recon crawl "https://docs.example.com" --include /docs/ --category "Reference" --process
+```
+
+### How It Works
+
+1. **URL discovery** (crawler.py):
+   - Tries sitemap.xml first (preferred, finds all pages)
+   - Falls back to BFS link-following if no sitemap
+   - Filters by include/exclude patterns
+
+2. **Content extraction** (web_scraper.py):
+   - Uses trafilatura for clean text extraction
+   - Chunks into ~2,000-word pages
+   - Same output format as PDF extractor: `data/text/{hash}/page_NNNN.txt`
+   - Content hash is MD5 of extracted text (deduplication)
+
+3. **Pipeline integration**:
+   - Web content enters at `extracted` status (no PDF extraction needed)
+   - Enrichment and embedding proceed identically to PDF content
+   - Qdrant vectors get `source_type: "web"` and `download_url` pointing to source URL
+
+---
+
+## 11. Configuration Reference
+
+### config.yaml
+
+```yaml
+# Root path for the PDF library (NFS mount from pi-nas)
+library_root: /mnt/library
+
+processing:
+  extract_workers: 4        # Concurrent PDF extraction threads
+  enrich_workers: 16         # Concurrent Gemini enrichment threads (4 keys x 4)
+  embed_workers: 4           # Concurrent embedding threads
+  enrich_window_size: 5      # Pages per enrichment window (sent to Gemini)
+  embed_batch_size: 500      # Vectors per Qdrant upsert batch
+  rate_limit_delay: 0.1      # Delay between Gemini API calls (seconds)
+  max_retries: 5             # Max retries for failed documents
+
+embedding:
+  backend: tei               # "tei" (primary, ~1,711 emb/sec) or "ollama" (fallback, ~8 emb/sec)
+  tei_host: 100.64.0.14      # TEI server (cortex)
+  tei_port: 8090             # TEI HTTP port
+  ollama_host: 100.64.0.14   # Ollama server (cortex) — fallback only
+  ollama_port: 11434         # Ollama HTTP port
+  model: bge-m3              # Embedding model name
+  dimensions: 1024           # CRITICAL: bge-m3 is 1024-dim, NOT 384
+  batch_size: 128            # Embeddings per TEI batch request
+
+vector_db:
+  host: 100.64.0.14          # Qdrant server (cortex)
+  port: 6333                 # Qdrant HTTP port
+  collection: recon_knowledge  # Collection name
+
+gemini:
+  model: gemini-2.0-flash    # Gemini model for enrichment
+  response_mime_type: application/json  # Force JSON output
+
+web:
+  port: 8420                 # Dashboard HTTP port
+  host: 0.0.0.0              # Bind to all interfaces
+
+paths:
+  base: /opt/recon           # Application root
+  data: /opt/recon/data      # Data directory
+  text: /opt/recon/data/text  # Extracted text output
+  concepts: /opt/recon/data/concepts  # Enriched concept JSONs
+  intel: /opt/recon/data/intel  # ARGUS intel feeds
+  logs: /opt/recon/logs      # Log files
+  db: /opt/recon/data/recon.db  # SQLite database
+
+book_server:
+  base_url: https://files.echo6.co  # Public URL prefix for PDF downloads
+  strip_prefix: /mnt/library  # Path prefix to strip when generating URLs
+
+upload_paths:                 # Category -> filesystem path mapping for uploads
+  Survival Reference: /mnt/library/Survival-Companion-Library/Uploads
+  Military Doctrine: /mnt/library/Army_Pubs/Uploads
+  Gaming: /mnt/library/Gaming
+  Reference: /mnt/library/Reference
+  Technical: /mnt/library/Technical
+  default: /mnt/library      # Fallback for unknown categories
+
+web_scraper:
+  words_per_page: 2000       # Target words per page chunk
+  fetch_timeout: 30          # HTTP request timeout (seconds)
+  rate_limit_delay: 1.0      # Delay between URL fetches (seconds)
+  max_batch_size: 50         # Max URLs per batch ingest
+  user_agent: "Mozilla/5.0 (compatible; RECON/1.0)"
+
+crawler:
+  user_agent: "Mozilla/5.0 (compatible; RECON/1.0)"
+  fetch_timeout: 30          # HTTP request timeout (seconds)
+  rate_limit_delay: 1.0      # Delay between page fetches (seconds)
+  max_pages: 500             # Max pages to discover per crawl
+  max_depth: 3               # Max link-following depth (BFS only)
+  default_exclude:            # URL patterns to always skip
+    - /search
+    - /404
+    - /login
+    - /signup
+    - /auth/
+    - /api/
+    - /assets/
+    - /static/
+```
+
+### .env
+
+```
+GEMINI_KEY_1=<key>
+GEMINI_KEY_2=<key>
+GEMINI_KEY_3=<key>
+GEMINI_KEY_4=<key>
+```
+
+Four Gemini API keys rotated across 16 enrichment workers via `KeyRotator`.
+
+---
+
+## 12. Aurora RAG Integration
+
+Aurora is the RAG-enabled AI assistant running on OpenWebUI (ai.echo6.co).
+
+### How It Works
+
+1. User asks a question in OpenWebUI
+2. Aurora's OpenWebUI function/filter embeds the query via TEI (cortex:8090)
+3. Searches Qdrant `recon_knowledge` collection for similar concepts
+4. Top results are injected into the prompt as context
+5. JOSIEFIED Qwen3 8B generates an answer with citations
+6. Citations include `download_url` links (PDF files via files.echo6.co, web content via source URL)
+
+### Key Components
+
+- **Embedding**: Same TEI endpoint + bge-m3 model as RECON pipeline (ensures vector compatibility)
+- **Search**: Cosine similarity, top-5 results by default
+- **LLM**: `goekdenizguelmez/JOSIEFIED-Qwen3:8b` on Ollama (cortex:11434)
+- **Citations**: Each result includes `download_url` — either `https://files.echo6.co/...` for PDFs or the original URL for web content
+
+---
+
+## 13. Backup & Recovery
+
+### Automated Backups
+
+**Script**: `/opt/recon/scripts/backup.sh`
+**Destination**: Contabo VPS (`root@100.64.0.1:/opt/backups/recon/`)
+**Schedule** (cron):
+- Every 6 hours: Full backup (concepts, text, DB, config, intel)
+- Every 2 hours (off-hours): SQLite DB snapshot only
+
+### What's Backed Up
+
+| Component | Size | Priority | Notes |
+|-----------|------|----------|-------|
+| data/concepts/ | ~11M | **CRITICAL** | $130+ of Gemini API work |
+| data/text/ | ~203M | High | Hours to regenerate |
+| data/recon.db | ~6.5M | **CRITICAL** | All pipeline state |
+| config.yaml + .env | ~2K | Important | Configuration |
+| data/intel/ | ~4K | Low | Intel feed data |
+
+### What's NOT Backed Up
+
+- **Qdrant vectors**: Rebuilt from concept JSONs in ~10 minutes via `recon rebuild`
+- **PDF library**: Lives on pi-nas NFS, backed up separately
+- **venv/**: Recreated from requirements.txt
+
+### Recovery Procedures
+
+```bash
+# Restore from backup
+scp -r root@100.64.0.1:/opt/backups/recon/concepts/ /opt/recon/data/concepts/
+scp -r root@100.64.0.1:/opt/backups/recon/text/ /opt/recon/data/text/
+scp root@100.64.0.1:/opt/backups/recon/recon_LATEST.db /opt/recon/data/recon.db
+
+# Rebuild Qdrant vectors from concept JSONs
+cd /opt/recon && source venv/bin/activate
+python3 scripts/rebuild_qdrant.py
+# Type REBUILD when prompted
+```
+
+---
+
+## 14. Embedding Performance
+
+### TEI (Primary) vs Ollama (Fallback)
+
+| Metric | TEI (cortex:8090) | Ollama (cortex:11434) |
+|--------|-------------------|----------------------|
+| Speed | ~1,711 emb/sec | ~8 emb/sec |
+| Model | bge-m3 | bge-m3 |
+| Dimensions | 1024 | 1024 |
+| Batch size | 128 | 1 |
+| Cosine similarity | 0.999900 | 0.999900 |
+
+TEI is ~214x faster than Ollama for embeddings. Always use TEI unless it's down.
+
+### Qdrant Configuration
+
+- Collection: `recon_knowledge`
+- Distance: Cosine
+- HNSW indexing threshold: 20,000 (below this, brute-force search is used)
+- Current state: Brute-force (under 20K vectors) — this is normal and performant at current scale
+
+---
+
+## 15. Content Hashing
+
+- **PDF content**: `MD5(file_bytes)` — stable across renames, detects exact duplicates
+- **Web content**: `MD5(extracted_text)` — deduplicates by content, not URL
+- Hash is used as the primary key in both SQLite tables and as the directory name for text/concept storage
+
+---
+
+## 16. Source Type Handling
+
+| Source | Path Format | source_type | download_url | Badge |
+|--------|-------------|-------------|--------------|-------|
+| PDF | `/mnt/library/...` | document | `https://files.echo6.co/...` | PDF |
+| Web | `https://...` | web | Original URL | Web |
+| Intel | JSON feed | intel_feed | — | — |
+
+The `generate_download_url()` function in utils.py handles the routing:
+- URLs starting with `http://` or `https://` are returned as-is
+- File paths are converted to `files.echo6.co` URLs
+
+---
+
+## 17. Lessons Learned
+
+### RECON Rebuild Lessons
+
+1. **Verify infrastructure before writing code.** Check Qdrant, TEI, Ollama connectivity first.
+2. **Dimensions are 1024, NOT 384.** BGE-M3 uses 1024-dimensional vectors. This caused silent failures in early builds.
+3. **TEI >> Ollama for embeddings.** 1,711 vs 8 embeddings/sec. A 214x speedup that makes batch processing viable.
+4. **Dynamic discovery over hardcoded paths.** Let the pipeline discover what's on disk rather than maintaining static file lists.
+5. **Web content uses the same pipeline.** After text extraction, web and PDF content follow identical enrichment and embedding paths.
+6. **Sitemap > link-following.** Sitemaps discover all pages reliably; BFS link-following misses orphaned pages and is slower.
+7. **Save to disk before DB operations.** Concept JSONs are written to disk first, then the database is updated. This means recovery is always possible from the JSON files.
+8. **NFS over large file sets is slow.** Scanning 13K PDFs over NFS takes ~30 minutes due to MD5 hashing over the network. Plan accordingly.
+
+### Operational Gotchas
+
+- `recon scan` can appear stuck on large PDFs over NFS — it's hashing, not hung
+- Some PDFs have corrupt metadata that crashes PyPDF2 — the extractor catches this and falls back
+- Gemini rate limits hit with 16 workers — the `KeyRotator` distributes across 4 keys to mitigate
+- `iptables-persistent` hangs on interactive prompts in LXC containers — use manual persistence
+- The recon LXC has no tmux/screen — use `nohup` for long-running background tasks
+
+---
+
+## 18. Monitoring
+
+### Pipeline Status
+
+```bash
+# Quick status
+recon status
+
+# Dashboard
+http://100.64.0.24:8420
+
+# Tail logs
+tail -f /opt/recon/logs/recon.log
+
+# Pipeline run log (when running full background pipeline)
+tail -f /opt/recon/pipeline.log
+```
+
+### Health Checks
+
+```bash
+# Qdrant
+curl -s http://100.64.0.14:6333/collections/recon_knowledge | python3 -m json.tool
+
+# TEI
+curl -s http://100.64.0.14:8090/info
+
+# Ollama
+curl -s http://100.64.0.14:11434/api/tags | python3 -m json.tool
+
+# NFS mount
+df -h /mnt/library
+
+# Backup logs
+tail -20 /opt/recon/logs/backup.log
+```
+
+### Validation
+
+```bash
+# Quick validation
+recon validate
+
+# Deep validation (checks all files on disk)
+recon validate --deep
+```
+
+---
+
+## 19. Current State
+
+*As of 2026-02-16*
+
+### Pipeline Progress
+
+| Status | Count |
+|--------|-------|
+| Catalogued | 10,162 |
+| Queued | 8,982 |
+| Extracted | 872 |
+| Complete | 302 |
+| Failed | 2 |
+
+### Vector Database
+
+- Qdrant points: 4,661 (3,144 PDF + 1,517 web)
+- Segments: 8
+- Indexing: Brute-force (under 20K threshold)
+
+### Active Processing
+
+Full pipeline running in background via `nohup` — extracting through the 8,982 queued documents. Expected to take ~40 hours for full extract -> enrich -> embed cycle.
+
+### Backups
+
+- Schedule: Every 6 hours (full) + every 2 hours (DB only)
+- Destination: Contabo VPS (`/opt/backups/recon/`)
+- Last verified: 2026-02-16 (220M total backup size)
+
+---
+
+## 20. Dependencies
+
+### System Packages
+
+- Python 3.11+
+- pdftotext (poppler-utils)
+- tesseract-ocr
+- sqlite3
+
+### Python Packages (key)
+
+| Package | Version | Purpose |
+|---------|---------|---------|
+| Flask | 3.1.2 | Web dashboard |
+| google-generativeai | 0.8.6 | Gemini API for enrichment |
+| qdrant-client | 1.16.2 | Vector database client |
+| PyPDF2 | 3.0.1 | PDF text extraction |
+| trafilatura | 2.0.0 | Web content extraction |
+| beautifulsoup4 | 4.14.3 | HTML parsing for crawler |
+| lxml | 6.0.2 | XML/HTML parsing |
+| pytesseract | 0.3.13 | OCR fallback |
+| requests | 2.32.5 | HTTP client |
+| PyYAML | 6.0.3 | Config file parsing |
+
+Full list in `requirements.txt`.