Initial commit: RECON codebase baseline

Current state of the pipeline code as of 2026-04-14 (Phase 1 scaffolding complete). Config has new_pipeline.enabled=false and crawler.sites=[] per refactor plan. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-06-10 00:44:37 +02:00 · 2026-04-14 14:57:23 +00:00 · 2026-04-14 14:57:23 +00:00 · 563c16bb71
commit 563c16bb71
59 changed files with 18327 additions and 0 deletions
--- a/.gitignore
+++ b/.gitignore
@ -0,0 +1,26 @@
+# Python
+venv/
+__pycache__/
+*.pyc
+*.pyo
+
+# Secrets
+.env
+
+# Runtime data
+data/
+logs/
+pipeline.log
+recon.db
+
+# Backups
+*.bak
+*.bak-*
+*.bak.*
+*.bak2.*
+
+# Junk
+-.png
+
+# OS
+.DS_Store
--- a/PROJECT-BIBLE.md
+++ b/PROJECT-BIBLE.md
@ -0,0 +1,785 @@
+# RECON Project Bible v2.0
+
+*Last updated: 2026-02-16*
+
+---
+
+## 1. Mission Statement
+
+RECON (Reconnaissance, Extraction, Conceptualization, and Operationalization of kNowledge) is a knowledge extraction pipeline that processes PDFs and web content into structured concepts stored in a Qdrant vector database. These concepts power Aurora, the RAG-enabled AI assistant running on OpenWebUI.
+
+**The core loop:** Content in (PDF/web) -> Text extracted -> Concepts enriched (Gemini) -> Vectors embedded (TEI/BGE-M3) -> Searchable knowledge (Qdrant) -> Aurora answers questions with citations.
+
+---
+
+## 2. Infrastructure
+
+### Hosts
+
+| Host | IP (Tailscale) | Role |
+|------|---------------|------|
+| recon LXC | 100.64.0.24 (CT 130 on toc) | RECON application, dashboard, pipeline |
+| cortex VM | 100.64.0.14 (VM 150 on toc) | Qdrant, TEI, Ollama, OpenWebUI |
+| pi-nas | 100.64.0.21 (192.168.1.245) | NFS file server for PDF library |
+| Contabo VPS | 100.64.0.1 (5.189.158.149) | Backup destination |
+
+### Services on cortex (100.64.0.14)
+
+| Service | Port | Purpose |
+|---------|------|---------|
+| Qdrant | 6333 | Vector database (recon_knowledge collection) |
+| TEI (text-embeddings-inference) | 8090 | Embedding server (bge-m3, 1024-dim, ~1,711 emb/sec) |
+| Ollama | 11434 | LLM server + fallback embeddings (~8 emb/sec) |
+| OpenWebUI | 8080 | Aurora chat interface (ai.echo6.co) |
+
+### Services on recon LXC (100.64.0.24)
+
+| Service | Port | Purpose |
+|---------|------|---------|
+| RECON Dashboard | 8420 | Web UI + API for pipeline management |
+| File Server | 8888 | PDF downloads (files.echo6.co) |
+
+### NFS Mount
+
+```
+pi-nas:/export/library -> /mnt/library (22TB, rw, NFSv3)
+```
+
+Contains ~13,000+ PDFs across:
+- `Survival-Companion-Library/` (~12,900 PDFs in ~220 subdirectories)
+- `Army_Pubs/` (~160 military field manuals)
+- Other: `Gaming/`, `Reference/`, `Technical/`
+
+---
+
+## 3. Architecture Overview
+
+```
+                    /mnt/library/ (NFS)
+                         |
+                    [recon scan]
+                         |
+                    catalogue (SQLite)
+                         |
+                    [recon queue]
+                         |
+    +-----------+   [recon extract]   +-----------+
+    |  PyPDF2   |-->  data/text/      |  Gemini   |
+    | pdftotext |   {hash}/page_N.txt |  Flash    |
+    | tesseract |        |            |  4 keys   |
+    +-----------+   [recon enrich]    +-----------+
+                         |
+                    data/concepts/
+                    {hash}/window_N.json
+                         |
+                    [recon embed]
+                         |
+              +----------+-----------+
+              |   TEI (primary)      |
+              |   bge-m3, 1024-dim   |
+              |   1,711 emb/sec      |
+              +----------+-----------+
+                         |
+                    Qdrant (cortex:6333)
+                    recon_knowledge collection
+                         |
+                    Aurora (OpenWebUI)
+                    RAG search + citations
+```
+
+### Web Content Path
+
+```
+    URL(s) ──> [recon ingest-url / crawl]
+                         |
+                    trafilatura extraction
+                    chunk into ~2000-word pages
+                         |
+                    data/text/{hash}/page_N.txt
+                    (enters at "extracted" status)
+                         |
+                    [enrich] -> [embed]
+                    (same as PDF path)
+```
+
+---
+
+## 4. Pipeline Stages
+
+### Status Flow
+
+```
+catalogued -> queued -> extracting -> extracted -> enriching -> enriched -> embedding -> complete
+                                                                                    \-> failed
+```
+
+Web content enters at `extracted` status (text already extracted by trafilatura).
+
+### Stage Details
+
+| Stage | Tool | Input | Output | Speed |
+|-------|------|-------|--------|-------|
+| Scan | `recon scan` | /mnt/library/*.pdf | catalogue table | ~13K PDFs in ~30 min |
+| Queue | `recon queue` | catalogue entries | documents table (status=queued) | Instant |
+| Extract | `recon extract` | PDF files | data/text/{hash}/page_NNNN.txt | 4 workers, ~200/hr |
+| Enrich | `recon enrich` | Text pages (10-page windows) | data/concepts/{hash}/window_N.json | 16 workers, 4 Gemini keys |
+| Embed | `recon embed` | Concept JSONs | Qdrant vectors | TEI: 1,711 emb/sec |
+
+### Extraction Fallback Chain
+
+1. **PyPDF2** (fast, clean text) -> 2. **pdftotext** (handles complex layouts) -> 3. **Tesseract OCR** (scanned documents)
+
+### Enrichment Details
+
+- Model: `gemini-2.0-flash`
+- Window size: 10 pages per API call (configurable)
+- Workers: 16 concurrent (4 API keys x 4 workers each)
+- Output format: JSON array of concept objects
+- **CRITICAL**: Concept JSONs are saved to disk BEFORE any database operations
+- Key rotation via `KeyRotator` class distributing across 4 Gemini API keys
+
+### Embedding Details
+
+- **Primary**: TEI at cortex:8090 (bge-m3 model, 1024 dimensions, ~1,711 embeddings/sec)
+- **Fallback**: Ollama at cortex:11434 (bge-m3 model, ~8 embeddings/sec)
+- Batch size: 128 embeddings per TEI request
+- Distance metric: Cosine similarity
+- **CRITICAL**: Dimensions are 1024 (bge-m3), NOT 384. Getting this wrong creates silent failures.
+
+---
+
+## 5. Directory Structure
+
+```
+/opt/recon/                          # Application root
+  recon.py                           # CLI entry point
+  config.yaml                        # Central configuration
+  .env                               # Gemini API keys (4 keys)
+  requirements.txt                   # Python dependencies
+  PROJECT-BIBLE.md                   # This file
+  README.md                          # Quick-start reference
+  run-full-pipeline.sh               # Background pipeline runner
+
+  lib/                               # Core modules
+    __init__.py
+    api.py                           # Flask web dashboard + API (port 8420)
+    crawler.py                       # Site crawler (sitemap + BFS link-following)
+    embedder.py                      # Concept -> vector embedding (TEI/Ollama -> Qdrant)
+    enricher.py                      # Text -> concept extraction (Gemini)
+    extractor.py                     # PDF -> text extraction (PyPDF2/pdftotext/OCR)
+    ingester.py                      # ARGUS intel feed intake
+    status.py                        # SQLite DB operations (catalogue + documents)
+    utils.py                         # Config, hashing, URL generation, logging
+    web_scraper.py                   # URL -> text extraction (trafilatura)
+
+  scripts/                           # Operational scripts
+    backup.sh                        # Automated backup to Contabo (cron every 6h)
+    rebuild_qdrant.py                # Nuclear recovery: re-embed all concepts
+    validate.py                      # Pipeline consistency validation
+
+  data/                              # Pipeline data (on local disk)
+    recon.db                         # SQLite status database
+    text/                            # Extracted text
+      {content_hash}/
+        meta.json                    # Document metadata
+        page_0001.txt                # Page text (4-digit, 1-indexed)
+        page_0002.txt
+        ...
+    concepts/                        # Enriched concepts (**BACK THESE UP**)
+      {content_hash}/
+        window_1.json                # Concept JSON array (10-page window)
+        window_2.json
+        ...
+    intel/                           # ARGUS intel feeds
+
+  logs/                              # Application logs
+    recon.log                        # Main rotating log
+    backup.log                       # Backup operation log
+    backup_cron.log                  # Cron backup log
+
+  venv/                              # Python virtual environment
+```
+
+---
+
+## 6. Database Schema
+
+### SQLite (data/recon.db)
+
+Two tables in WAL mode with thread-local connections.
+
+#### catalogue
+
+| Column | Type | Description |
+|--------|------|-------------|
+| hash | TEXT PK | MD5 content hash |
+| filename | TEXT | Original filename |
+| path | TEXT | Full filesystem path |
+| size_bytes | INTEGER | File size |
+| source | TEXT | Top-level directory (e.g., "Survival-Companion-Library") |
+| category | TEXT | Second-level directory (e.g., "Bushcraft") |
+| status | TEXT | "catalogued" or "processed" |
+| discovered_at | TEXT | ISO timestamp |
+
+#### documents
+
+| Column | Type | Description |
+|--------|------|-------------|
+| hash | TEXT PK | MD5 content hash |
+| filename | TEXT | Original filename |
+| path | TEXT | Full path or URL |
+| size_bytes | INTEGER | File/content size |
+| page_count | INTEGER | Number of text pages |
+| book_title | TEXT | Gemini-extracted title |
+| book_author | TEXT | Gemini-extracted author |
+| status | TEXT | Pipeline status |
+| pages_extracted | INTEGER | Pages extracted |
+| concepts_extracted | INTEGER | Concepts generated |
+| vectors_inserted | INTEGER | Vectors in Qdrant |
+| error_message | TEXT | Last error (if failed) |
+| retry_count | INTEGER | Failure retry count |
+| created_at | TEXT | ISO timestamp |
+| updated_at | TEXT | ISO timestamp |
+
+### Qdrant (cortex:6333)
+
+Collection: `recon_knowledge`
+
+| Field | Type | Description |
+|-------|------|-------------|
+| vector | float[1024] | BGE-M3 embedding |
+| doc_hash | keyword | Links to SQLite document |
+| filename | keyword | Source filename |
+| book_title | keyword | Document title |
+| book_author | keyword | Author name |
+| source_type | keyword | "document", "web", or "intel_feed" |
+| download_url | keyword | files.echo6.co URL or source URL |
+| content | text | Concept text (searchable) |
+| summary | text | Concept summary |
+| title | keyword | Concept title |
+| domain | keyword | Knowledge domain |
+| subdomain | keyword | Knowledge subdomain |
+| keywords | keyword[] | Concept keywords |
+| skill_level | keyword | beginner/intermediate/advanced/expert |
+| key_facts | text[] | Key facts list |
+| scenario_applicable | text[] | Applicable scenarios |
+| cross_domain_tags | keyword[] | Cross-references |
+| chapter | keyword | Source chapter |
+| page_ref | keyword | Source page reference |
+| notes | text | Additional notes |
+| _window | integer | Source window number |
+| _start_page | integer | Starting page in document |
+| verification_status | keyword | "unverified" (default) |
+| credibility_score | float | 0.7 (default) |
+| language | keyword | "en" (default) |
+
+---
+
+## 7. CLI Reference
+
+```
+recon <command> [options]
+```
+
+| Command | Description | Key Options |
+|---------|-------------|-------------|
+| `scan` | Scan library, catalogue new PDFs | `--path` |
+| `queue` | Queue catalogued docs for processing | `--hash`, `--source`, `--category`, `--limit` |
+| `extract` | Extract text from queued PDFs | `--workers` |
+| `enrich` | Enrich extracted text via Gemini | `--workers`, `--limit` |
+| `embed` | Embed concepts into Qdrant | `--workers`, `--limit` |
+| `run` | Full pipeline (extract->enrich->embed) | `--workers`, `--enrich-workers`, `--limit` |
+| `status` | Show pipeline status counts | |
+| `catalogue` | Browse catalogue | `--sources`, `--categories`, `--source`, `--limit` |
+| `failures` | Show failed documents | `--retry` |
+| `search` | Semantic search | `query`, `--limit` |
+| `upload` | Upload PDFs | `--file`, `--dir`, `--category` |
+| `ingest-url` | Ingest web content | `url`, `--file`, `--category`, `--process` |
+| `crawl` | Crawl a site | `url`, `--category`, `--include`, `--exclude`, `--max-pages`, `--dry-run`, `--process` |
+| `validate` | Check pipeline consistency | `--deep` |
+| `rebuild` | Rebuild Qdrant from concept JSONs | |
+| `serve` | Start web dashboard (port 8420) | |
+| `ingest` | Ingest ARGUS intel JSON | `--file`, `--directory` |
+
+### Common Workflows
+
+```bash
+# Full library processing
+recon scan && recon queue && recon run
+
+# Ingest a single web page with full processing
+recon ingest-url "https://example.com/article" --category "Reference" --process
+
+# Dry-run crawl to preview URLs
+recon crawl "https://docs.example.com" --include /docs/ --dry-run
+
+# Full crawl with processing
+recon crawl "https://docs.example.com" --include /docs/ --category "Reference" --process
+
+# Upload a PDF
+recon upload --file /path/to/document.pdf --category "Technical"
+
+# Check what failed and retry
+recon failures
+recon failures --retry
+```
+
+---
+
+## 8. Web Dashboard
+
+### URL
+
+```
+http://100.64.0.24:8420
+```
+
+### Pages
+
+| Route | Page | Description |
+|-------|------|-------------|
+| `/` | Dashboard | Knowledge base overview: document/concept/vector counts, source table, domain distribution bars, skill level breakdown, Qdrant health, recent completions, pipeline status |
+| `/search` | Search | Semantic search with score bars, Web/PDF badges, download links |
+| `/catalogue` | Catalogue | Browse all catalogued PDFs with source/category filters |
+| `/upload` | Upload | PDF upload form with category datalist, recent uploads table |
+| `/web-ingest` | Web Ingest | Two tabs: Single/Batch URL ingest, Site Crawl with preview |
+| `/failures` | Failures | Failed documents with error messages and retry button |
+
+### API Endpoints
+
+| Method | Endpoint | Description |
+|--------|----------|-------------|
+| GET | `/api/search?q=...&limit=N` | Semantic search |
+| GET | `/api/catalogue?source=...&limit=N` | Browse catalogue |
+| GET | `/api/knowledge-stats` | Dashboard aggregation (totals, sources, domains, skills, Qdrant health) |
+| POST | `/api/upload` | Upload PDF (multipart: file + category) |
+| GET | `/api/upload/<hash>/status` | Check upload processing status |
+| GET | `/api/upload/categories` | List available categories |
+| POST | `/api/ingest-url` | Ingest single URL (json: url, category, process) |
+| POST | `/api/ingest-urls` | Ingest multiple URLs (json: urls, category, process) |
+| POST | `/api/crawl` | Crawl a site (json: url, category, include, exclude, max_pages, dry_run) |
+| GET | `/api/crawl/<id>/status` | Poll crawl/pipeline progress |
+| POST | `/api/failures/retry` | Re-queue all failed documents |
+
+### Dashboard Features
+
+- **Auto-refresh**: Every 30 seconds via JavaScript fetch
+- **Knowledge cards**: Total documents, concepts, vectors, pages
+- **Source table**: Per-source breakdown with document/concept/vector counts and PDF/WEB type badges
+- **Domain distribution**: Horizontal bars showing top knowledge domains
+- **Skill level breakdown**: beginner/intermediate/advanced/expert percentages
+- **Qdrant health**: Connection status, points count, segments
+- **Pipeline status**: Compact display of documents in each stage
+- **Crawl polling**: Real-time stage tracking (ingesting -> enriching -> embedding)
+
+---
+
+## 9. Concept JSON Schema
+
+Each window file (`data/concepts/{hash}/window_N.json`) contains a JSON array of concept objects:
+
+```json
+[
+  {
+    "title": "Water Purification Methods",
+    "content": "Detailed text about the concept...",
+    "summary": "Brief summary of the concept",
+    "domain": "Survival",
+    "subdomain": "Water",
+    "keywords": ["purification", "filtration", "boiling"],
+    "skill_level": "beginner",
+    "key_facts": ["Boiling kills 99.9% of pathogens", "..."],
+    "scenario_applicable": ["wilderness survival", "disaster preparedness"],
+    "cross_domain_tags": ["health", "camping"],
+    "chapter": "Chapter 3",
+    "page_ref": "pp. 45-48",
+    "notes": "Additional context or caveats",
+    "_window": 1,
+    "_start_page": 1
+  }
+]
+```
+
+---
+
+## 10. Web Ingestion
+
+### Single URL
+
+```bash
+recon ingest-url "https://example.com/article" --category "Reference" --process
+```
+
+Or via API:
+```bash
+curl -X POST http://100.64.0.24:8420/api/ingest-url \
+  -H "Content-Type: application/json" \
+  -d '{"url": "https://example.com/article", "category": "Reference", "process": true}'
+```
+
+### Site Crawl
+
+```bash
+# Preview what would be crawled
+recon crawl "https://docs.example.com" --include /docs/ --dry-run
+
+# Full crawl
+recon crawl "https://docs.example.com" --include /docs/ --category "Reference" --process
+```
+
+### How It Works
+
+1. **URL discovery** (crawler.py):
+   - Tries sitemap.xml first (preferred, finds all pages)
+   - Falls back to BFS link-following if no sitemap
+   - Filters by include/exclude patterns
+
+2. **Content extraction** (web_scraper.py):
+   - Uses trafilatura for clean text extraction
+   - Chunks into ~2,000-word pages
+   - Same output format as PDF extractor: `data/text/{hash}/page_NNNN.txt`
+   - Content hash is MD5 of extracted text (deduplication)
+
+3. **Pipeline integration**:
+   - Web content enters at `extracted` status (no PDF extraction needed)
+   - Enrichment and embedding proceed identically to PDF content
+   - Qdrant vectors get `source_type: "web"` and `download_url` pointing to source URL
+
+---
+
+## 11. Configuration Reference
+
+### config.yaml
+
+```yaml
+# Root path for the PDF library (NFS mount from pi-nas)
+library_root: /mnt/library
+
+processing:
+  extract_workers: 4        # Concurrent PDF extraction threads
+  enrich_workers: 16         # Concurrent Gemini enrichment threads (4 keys x 4)
+  embed_workers: 4           # Concurrent embedding threads
+  enrich_window_size: 5      # Pages per enrichment window (sent to Gemini)
+  embed_batch_size: 500      # Vectors per Qdrant upsert batch
+  rate_limit_delay: 0.1      # Delay between Gemini API calls (seconds)
+  max_retries: 5             # Max retries for failed documents
+
+embedding:
+  backend: tei               # "tei" (primary, ~1,711 emb/sec) or "ollama" (fallback, ~8 emb/sec)
+  tei_host: 100.64.0.14      # TEI server (cortex)
+  tei_port: 8090             # TEI HTTP port
+  ollama_host: 100.64.0.14   # Ollama server (cortex) — fallback only
+  ollama_port: 11434         # Ollama HTTP port
+  model: bge-m3              # Embedding model name
+  dimensions: 1024           # CRITICAL: bge-m3 is 1024-dim, NOT 384
+  batch_size: 128            # Embeddings per TEI batch request
+
+vector_db:
+  host: 100.64.0.14          # Qdrant server (cortex)
+  port: 6333                 # Qdrant HTTP port
+  collection: recon_knowledge  # Collection name
+
+gemini:
+  model: gemini-2.0-flash    # Gemini model for enrichment
+  response_mime_type: application/json  # Force JSON output
+
+web:
+  port: 8420                 # Dashboard HTTP port
+  host: 0.0.0.0              # Bind to all interfaces
+
+paths:
+  base: /opt/recon           # Application root
+  data: /opt/recon/data      # Data directory
+  text: /opt/recon/data/text  # Extracted text output
+  concepts: /opt/recon/data/concepts  # Enriched concept JSONs
+  intel: /opt/recon/data/intel  # ARGUS intel feeds
+  logs: /opt/recon/logs      # Log files
+  db: /opt/recon/data/recon.db  # SQLite database
+
+book_server:
+  base_url: https://files.echo6.co  # Public URL prefix for PDF downloads
+  strip_prefix: /mnt/library  # Path prefix to strip when generating URLs
+
+upload_paths:                 # Category -> filesystem path mapping for uploads
+  Survival Reference: /mnt/library/Survival-Companion-Library/Uploads
+  Military Doctrine: /mnt/library/Army_Pubs/Uploads
+  Gaming: /mnt/library/Gaming
+  Reference: /mnt/library/Reference
+  Technical: /mnt/library/Technical
+  default: /mnt/library      # Fallback for unknown categories
+
+web_scraper:
+  words_per_page: 2000       # Target words per page chunk
+  fetch_timeout: 30          # HTTP request timeout (seconds)
+  rate_limit_delay: 1.0      # Delay between URL fetches (seconds)
+  max_batch_size: 50         # Max URLs per batch ingest
+  user_agent: "Mozilla/5.0 (compatible; RECON/1.0)"
+
+crawler:
+  user_agent: "Mozilla/5.0 (compatible; RECON/1.0)"
+  fetch_timeout: 30          # HTTP request timeout (seconds)
+  rate_limit_delay: 1.0      # Delay between page fetches (seconds)
+  max_pages: 500             # Max pages to discover per crawl
+  max_depth: 3               # Max link-following depth (BFS only)
+  default_exclude:            # URL patterns to always skip
+    - /search
+    - /404
+    - /login
+    - /signup
+    - /auth/
+    - /api/
+    - /assets/
+    - /static/
+```
+
+### .env
+
+```
+GEMINI_KEY_1=<key>
+GEMINI_KEY_2=<key>
+GEMINI_KEY_3=<key>
+GEMINI_KEY_4=<key>
+```
+
+Four Gemini API keys rotated across 16 enrichment workers via `KeyRotator`.
+
+---
+
+## 12. Aurora RAG Integration
+
+Aurora is the RAG-enabled AI assistant running on OpenWebUI (ai.echo6.co).
+
+### How It Works
+
+1. User asks a question in OpenWebUI
+2. Aurora's OpenWebUI function/filter embeds the query via TEI (cortex:8090)
+3. Searches Qdrant `recon_knowledge` collection for similar concepts
+4. Top results are injected into the prompt as context
+5. JOSIEFIED Qwen3 8B generates an answer with citations
+6. Citations include `download_url` links (PDF files via files.echo6.co, web content via source URL)
+
+### Key Components
+
+- **Embedding**: Same TEI endpoint + bge-m3 model as RECON pipeline (ensures vector compatibility)
+- **Search**: Cosine similarity, top-5 results by default
+- **LLM**: `goekdenizguelmez/JOSIEFIED-Qwen3:8b` on Ollama (cortex:11434)
+- **Citations**: Each result includes `download_url` — either `https://files.echo6.co/...` for PDFs or the original URL for web content
+
+---
+
+## 13. Backup & Recovery
+
+### Automated Backups
+
+**Script**: `/opt/recon/scripts/backup.sh`
+**Destination**: Contabo VPS (`root@100.64.0.1:/opt/backups/recon/`)
+**Schedule** (cron):
+- Every 6 hours: Full backup (concepts, text, DB, config, intel)
+- Every 2 hours (off-hours): SQLite DB snapshot only
+
+### What's Backed Up
+
+| Component | Size | Priority | Notes |
+|-----------|------|----------|-------|
+| data/concepts/ | ~11M | **CRITICAL** | $130+ of Gemini API work |
+| data/text/ | ~203M | High | Hours to regenerate |
+| data/recon.db | ~6.5M | **CRITICAL** | All pipeline state |
+| config.yaml + .env | ~2K | Important | Configuration |
+| data/intel/ | ~4K | Low | Intel feed data |
+
+### What's NOT Backed Up
+
+- **Qdrant vectors**: Rebuilt from concept JSONs in ~10 minutes via `recon rebuild`
+- **PDF library**: Lives on pi-nas NFS, backed up separately
+- **venv/**: Recreated from requirements.txt
+
+### Recovery Procedures
+
+```bash
+# Restore from backup
+scp -r root@100.64.0.1:/opt/backups/recon/concepts/ /opt/recon/data/concepts/
+scp -r root@100.64.0.1:/opt/backups/recon/text/ /opt/recon/data/text/
+scp root@100.64.0.1:/opt/backups/recon/recon_LATEST.db /opt/recon/data/recon.db
+
+# Rebuild Qdrant vectors from concept JSONs
+cd /opt/recon && source venv/bin/activate
+python3 scripts/rebuild_qdrant.py
+# Type REBUILD when prompted
+```
+
+---
+
+## 14. Embedding Performance
+
+### TEI (Primary) vs Ollama (Fallback)
+
+| Metric | TEI (cortex:8090) | Ollama (cortex:11434) |
+|--------|-------------------|----------------------|
+| Speed | ~1,711 emb/sec | ~8 emb/sec |
+| Model | bge-m3 | bge-m3 |
+| Dimensions | 1024 | 1024 |
+| Batch size | 128 | 1 |
+| Cosine similarity | 0.999900 | 0.999900 |
+
+TEI is ~214x faster than Ollama for embeddings. Always use TEI unless it's down.
+
+### Qdrant Configuration
+
+- Collection: `recon_knowledge`
+- Distance: Cosine
+- HNSW indexing threshold: 20,000 (below this, brute-force search is used)
+- Current state: Brute-force (under 20K vectors) — this is normal and performant at current scale
+
+---
+
+## 15. Content Hashing
+
+- **PDF content**: `MD5(file_bytes)` — stable across renames, detects exact duplicates
+- **Web content**: `MD5(extracted_text)` — deduplicates by content, not URL
+- Hash is used as the primary key in both SQLite tables and as the directory name for text/concept storage
+
+---
+
+## 16. Source Type Handling
+
+| Source | Path Format | source_type | download_url | Badge |
+|--------|-------------|-------------|--------------|-------|
+| PDF | `/mnt/library/...` | document | `https://files.echo6.co/...` | PDF |
+| Web | `https://...` | web | Original URL | Web |
+| Intel | JSON feed | intel_feed | — | — |
+
+The `generate_download_url()` function in utils.py handles the routing:
+- URLs starting with `http://` or `https://` are returned as-is
+- File paths are converted to `files.echo6.co` URLs
+
+---
+
+## 17. Lessons Learned
+
+### RECON Rebuild Lessons
+
+1. **Verify infrastructure before writing code.** Check Qdrant, TEI, Ollama connectivity first.
+2. **Dimensions are 1024, NOT 384.** BGE-M3 uses 1024-dimensional vectors. This caused silent failures in early builds.
+3. **TEI >> Ollama for embeddings.** 1,711 vs 8 embeddings/sec. A 214x speedup that makes batch processing viable.
+4. **Dynamic discovery over hardcoded paths.** Let the pipeline discover what's on disk rather than maintaining static file lists.
+5. **Web content uses the same pipeline.** After text extraction, web and PDF content follow identical enrichment and embedding paths.
+6. **Sitemap > link-following.** Sitemaps discover all pages reliably; BFS link-following misses orphaned pages and is slower.
+7. **Save to disk before DB operations.** Concept JSONs are written to disk first, then the database is updated. This means recovery is always possible from the JSON files.
+8. **NFS over large file sets is slow.** Scanning 13K PDFs over NFS takes ~30 minutes due to MD5 hashing over the network. Plan accordingly.
+
+### Operational Gotchas
+
+- `recon scan` can appear stuck on large PDFs over NFS — it's hashing, not hung
+- Some PDFs have corrupt metadata that crashes PyPDF2 — the extractor catches this and falls back
+- Gemini rate limits hit with 16 workers — the `KeyRotator` distributes across 4 keys to mitigate
+- `iptables-persistent` hangs on interactive prompts in LXC containers — use manual persistence
+- The recon LXC has no tmux/screen — use `nohup` for long-running background tasks
+
+---
+
+## 18. Monitoring
+
+### Pipeline Status
+
+```bash
+# Quick status
+recon status
+
+# Dashboard
+http://100.64.0.24:8420
+
+# Tail logs
+tail -f /opt/recon/logs/recon.log
+
+# Pipeline run log (when running full background pipeline)
+tail -f /opt/recon/pipeline.log
+```
+
+### Health Checks
+
+```bash
+# Qdrant
+curl -s http://100.64.0.14:6333/collections/recon_knowledge | python3 -m json.tool
+
+# TEI
+curl -s http://100.64.0.14:8090/info
+
+# Ollama
+curl -s http://100.64.0.14:11434/api/tags | python3 -m json.tool
+
+# NFS mount
+df -h /mnt/library
+
+# Backup logs
+tail -20 /opt/recon/logs/backup.log
+```
+
+### Validation
+
+```bash
+# Quick validation
+recon validate
+
+# Deep validation (checks all files on disk)
+recon validate --deep
+```
+
+---
+
+## 19. Current State
+
+*As of 2026-02-16*
+
+### Pipeline Progress
+
+| Status | Count |
+|--------|-------|
+| Catalogued | 10,162 |
+| Queued | 8,982 |
+| Extracted | 872 |
+| Complete | 302 |
+| Failed | 2 |
+
+### Vector Database
+
+- Qdrant points: 4,661 (3,144 PDF + 1,517 web)
+- Segments: 8
+- Indexing: Brute-force (under 20K threshold)
+
+### Active Processing
+
+Full pipeline running in background via `nohup` — extracting through the 8,982 queued documents. Expected to take ~40 hours for full extract -> enrich -> embed cycle.
+
+### Backups
+
+- Schedule: Every 6 hours (full) + every 2 hours (DB only)
+- Destination: Contabo VPS (`/opt/backups/recon/`)
+- Last verified: 2026-02-16 (220M total backup size)
+
+---
+
+## 20. Dependencies
+
+### System Packages
+
+- Python 3.11+
+- pdftotext (poppler-utils)
+- tesseract-ocr
+- sqlite3
+
+### Python Packages (key)
+
+| Package | Version | Purpose |
+|---------|---------|---------|
+| Flask | 3.1.2 | Web dashboard |
+| google-generativeai | 0.8.6 | Gemini API for enrichment |
+| qdrant-client | 1.16.2 | Vector database client |
+| PyPDF2 | 3.0.1 | PDF text extraction |
+| trafilatura | 2.0.0 | Web content extraction |
+| beautifulsoup4 | 4.14.3 | HTML parsing for crawler |
+| lxml | 6.0.2 | XML/HTML parsing |
+| pytesseract | 0.3.13 | OCR fallback |
+| requests | 2.32.5 | HTTP client |
+| PyYAML | 6.0.3 | Config file parsing |
+
+Full list in `requirements.txt`.
--- a/README.md
+++ b/README.md
@ -0,0 +1,89 @@
+# RECON -- Knowledge Extraction Pipeline
+
+Extracts structured knowledge from PDFs and web content into a Qdrant vector database for RAG retrieval by Aurora.
+
+## Quick Start
+
+```bash
+# Activate
+cd /opt/recon && source venv/bin/activate
+
+# Scan library for new PDFs
+recon scan
+
+# Queue and process
+recon queue
+recon extract
+recon enrich
+recon embed
+
+# Or run full pipeline
+recon run
+
+# Ingest a web page
+recon ingest-url "https://example.com/article" --category "Category" --process
+
+# Crawl an entire docs site
+recon crawl "https://docs.example.com" --include /docs/ --category "Category" --process
+
+# Upload a PDF
+recon upload --file /path/to/document.pdf --category "Category"
+
+# Search
+recon search "water purification methods"
+
+# Check status
+recon status
+recon failures
+```
+
+## Dashboard
+
+http://100.64.0.24:8420
+
+## Services
+
+| Service | Location | Purpose |
+|---------|----------|---------|
+| RECON Dashboard | recon:8420 | Pipeline management + API |
+| Qdrant | cortex:6333 | Vector database |
+| TEI | cortex:8090 | Embeddings (1,711/sec) |
+| Ollama | cortex:11434 | Chat + fallback embeddings |
+| OpenWebUI | cortex:8080 (ai.echo6.co) | Aurora chat with RAG |
+| File Server | recon:8888 (files.echo6.co) | PDF downloads |
+
+## Key Paths
+
+| Path | Contents |
+|------|----------|
+| /opt/recon/ | Application code |
+| /opt/recon/data/concepts/ | Gemini extractions (**CRITICAL -- back these up**) |
+| /opt/recon/data/text/ | Extracted text |
+| /opt/recon/data/recon.db | SQLite status DB |
+| /mnt/library/ | PDF library (NFS from pi-nas) |
+
+## Backups
+
+Automated every 6 hours to Contabo VPS via `/opt/recon/scripts/backup.sh`.
+Concept JSONs are the most valuable data ($130+ of Gemini API work).
+Qdrant is NOT backed up -- rebuilt from JSONs in ~10 minutes via `recon rebuild`.
+
+## Monitoring
+
+```bash
+# Pipeline status
+recon status
+
+# Tail logs
+tail -f /opt/recon/logs/recon.log
+
+# Pipeline run log
+tail -f /opt/recon/pipeline.log
+
+# Validate consistency
+recon validate --deep
+```
+
+## Full Documentation
+
+See [PROJECT-BIBLE.md](PROJECT-BIBLE.md) for complete system documentation.
--- a/api.py
+++ b/api.py
@ -0,0 +1,348 @@
+import json
+import os
+
+import requests as http_requests
+from flask import Flask, request, jsonify, redirect
+from qdrant_client import QdrantClient
+from qdrant_client.models import Filter, FieldCondition, MatchValue
+
+from .utils import get_config, content_hash, setup_logging
+from .status import StatusDB
+
+logger = setup_logging('recon.api')
+
+app = Flask(__name__)
+
+HTML_TEMPLATE = """<!DOCTYPE html>
+<html>
+<head>
+<title>RECON</title>
+<meta charset="utf-8">
+<style>
+* { margin: 0; padding: 0; box-sizing: border-box; }
+body { font-family: 'Courier New', monospace; background: #0a0a0a; color: #c0c0c0; }
+.header { background: #111; border-bottom: 1px solid #333; padding: 12px 24px; display: flex; justify-content: space-between; align-items: center; }
+.header h1 { color: #00ff41; font-size: 18px; letter-spacing: 2px; }
+.header .stats { font-size: 12px; color: #666; }
+.nav { background: #0d0d0d; border-bottom: 1px solid #222; padding: 8px 24px; }
+.nav a { color: #888; text-decoration: none; margin-right: 16px; font-size: 13px; }
+.nav a:hover, .nav a.active { color: #00ff41; }
+.content { padding: 24px; max-width: 1400px; margin: 0 auto; }
+.search-box { width: 100%; padding: 10px 16px; background: #111; border: 1px solid #333; color: #c0c0c0; font-family: inherit; font-size: 14px; margin-bottom: 16px; }
+.search-box:focus { outline: none; border-color: #00ff41; }
+table { width: 100%; border-collapse: collapse; font-size: 13px; }
+th { background: #111; color: #00ff41; text-align: left; padding: 8px 12px; border-bottom: 1px solid #333; }
+td { padding: 6px 12px; border-bottom: 1px solid #1a1a1a; }
+tr:hover { background: #111; }
+.status { padding: 2px 8px; border-radius: 3px; font-size: 11px; }
+.status-complete { color: #00ff41; }
+.status-enriched { color: #00bfff; }
+.status-extracted { color: #ffa500; }
+.status-failed { color: #ff4444; }
+.status-queued { color: #888; }
+.stat-grid { display: grid; grid-template-columns: repeat(auto-fit, minmax(200px, 1fr)); gap: 16px; margin-bottom: 24px; }
+.stat-card { background: #111; border: 1px solid #222; padding: 16px; }
+.stat-card .label { color: #666; font-size: 11px; text-transform: uppercase; }
+.stat-card .value { color: #00ff41; font-size: 28px; margin-top: 4px; }
+.result { background: #111; border: 1px solid #222; padding: 16px; margin-bottom: 12px; }
+.result .title { color: #00ff41; font-size: 14px; margin-bottom: 4px; }
+.result .meta { color: #666; font-size: 11px; margin-bottom: 8px; }
+.result .content-text { color: #999; font-size: 12px; line-height: 1.5; }
+.result .score { color: #ffa500; font-size: 12px; float: right; }
+.btn { background: #1a1a1a; border: 1px solid #333; color: #c0c0c0; padding: 6px 14px; cursor: pointer; font-family: inherit; font-size: 12px; }
+.btn:hover { border-color: #00ff41; color: #00ff41; }
+.domain-tag { display: inline-block; background: #1a1a1a; border: 1px solid #333; padding: 1px 6px; margin: 1px; font-size: 10px; color: #888; }
+</style>
+</head>
+<body>
+<div class="header">
+    <h1>RECON</h1>
+    <div class="stats">Knowledge Base Management System</div>
+</div>
+<div class="nav">
+    <a href="/" id="nav-dash">Dashboard</a>
+    <a href="/search" id="nav-search">Search</a>
+    <a href="/catalogue" id="nav-cat">Catalogue</a>
+    <a href="/failures" id="nav-fail">Failures</a>
+</div>
+<div class="content" id="main">
+    {{CONTENT}}
+</div>
+</body>
+</html>"""
+
+
+def render(content):
+    return HTML_TEMPLATE.replace('{{CONTENT}}', content)
+
+
+@app.route('/')
+def dashboard():
+    db = StatusDB()
+    counts = db.get_status_counts()
+    cat = counts.get('catalogue', {})
+    doc = counts.get('documents', {})
+
+    total_cat = sum(cat.values())
+    total_doc = sum(doc.values())
+    complete = doc.get('complete', 0)
+    failed = doc.get('failed', 0)
+
+    stats = f"""
+    <div class="stat-grid">
+        <div class="stat-card"><div class="label">Catalogued PDFs</div><div class="value">{total_cat}</div></div>
+        <div class="stat-card"><div class="label">In Pipeline</div><div class="value">{total_doc}</div></div>
+        <div class="stat-card"><div class="label">Complete</div><div class="value">{complete}</div></div>
+        <div class="stat-card"><div class="label">Failed</div><div class="value">{failed}</div></div>
+    </div>
+    <h3 style="color:#00ff41;margin-bottom:12px;">Pipeline Status</h3>
+    <table>
+    <tr><th>Status</th><th>Count</th></tr>
+    """
+    for status in ['queued', 'extracting', 'extracted', 'enriching', 'enriched', 'embedding', 'complete', 'failed']:
+        count = doc.get(status, 0)
+        stats += f'<tr><td><span class="status status-{status}">{status}</span></td><td>{count}</td></tr>\n'
+
+    stats += "</table>"
+
+    sources = db.source_breakdown()
+    if sources:
+        stats += '<h3 style="color:#00ff41;margin:24px 0 12px;">Sources</h3><table><tr><th>Source</th><th>Count</th><th>Size</th></tr>'
+        for s in sources:
+            size_mb = (s.get('total_bytes', 0) or 0) / (1024 * 1024)
+            stats += f"<tr><td>{s['source']}</td><td>{s['count']}</td><td>{size_mb:.1f} MB</td></tr>"
+        stats += "</table>"
+
+    return render(stats)
+
+
+@app.route('/search')
+def search_page():
+    query = request.args.get('q', '')
+    if not query:
+        content = """
+        <h3 style="color:#00ff41;margin-bottom:16px;">Semantic Search</h3>
+        <form method="get" action="/search">
+            <input type="text" name="q" class="search-box" placeholder="Search the knowledge base..." autofocus>
+        </form>
+        <p style="color:#666;font-size:12px;margin-top:8px;">Enter a query to search across all embedded concepts.</p>
+        """
+        return render(content)
+
+    config = get_config()
+    limit = int(request.args.get('limit', 20))
+    source_filter = request.args.get('source_type', None)
+
+    try:
+        url = f"http://{config['embedding']['host']}:{config['embedding']['port']}/api/embed"
+        resp = http_requests.post(url, json={
+            "model": config['embedding']['model'],
+            "input": query
+        }, timeout=120)
+        resp.raise_for_status()
+        query_vector = resp.json()['embeddings'][0]
+
+        qdrant = QdrantClient(
+            host=config['vector_db']['host'],
+            port=config['vector_db']['port'],
+            timeout=60
+        )
+
+        search_filter = None
+        if source_filter:
+            search_filter = Filter(must=[
+                FieldCondition(key="source_type", match=MatchValue(value=source_filter))
+            ])
+
+        results = qdrant.query_points(
+            collection_name=config['vector_db']['collection'],
+            query=query_vector,
+            limit=limit,
+            query_filter=search_filter
+        ).points
+
+        content = f"""
+        <h3 style="color:#00ff41;margin-bottom:16px;">Results for: {query}</h3>
+        <form method="get" action="/search">
+            <input type="text" name="q" class="search-box" value="{query}">
+        </form>
+        <p style="color:#666;font-size:12px;margin-bottom:16px;">{len(results)} results</p>
+        """
+
+        for r in results:
+            p = r.payload
+            title = p.get('title', 'Untitled')
+            summary = p.get('summary', p.get('content', '')[:200])
+            score = r.score
+            domains = p.get('domain', [])
+            book = p.get('book_title', p.get('filename', ''))
+            source_type = p.get('source_type', 'document')
+
+            domain_tags = ''.join(f'<span class="domain-tag">{d}</span>' for d in (domains if isinstance(domains, list) else []))
+
+            content += f"""
+            <div class="result">
+                <span class="score">{score:.4f}</span>
+                <div class="title">{title}</div>
+                <div class="meta">{book} | {source_type} | {p.get('skill_level', 'unknown')}</div>
+                <div class="content-text">{summary}</div>
+                <div style="margin-top:6px;">{domain_tags}</div>
+            </div>
+            """
+
+        return render(content)
+
+    except Exception as e:
+        return render(f'<p style="color:#ff4444;">Search error: {e}</p>')
+
+
+@app.route('/catalogue')
+def catalogue_page():
+    db = StatusDB()
+    source = request.args.get('source', None)
+    category = request.args.get('category', None)
+    limit = int(request.args.get('limit', 100))
+
+    docs = db.get_all_documents(source=source, category=category, limit=limit)
+
+    content = '<h3 style="color:#00ff41;margin-bottom:16px;">Document Catalogue</h3>'
+
+    sources = db.get_sources()
+    if sources:
+        content += '<div style="margin-bottom:12px;">'
+        content += '<a href="/catalogue" class="btn" style="margin-right:4px;">All</a>'
+        for s in sources:
+            content += f'<a href="/catalogue?source={s}" class="btn" style="margin-right:4px;">{s}</a>'
+        content += '</div>'
+
+    content += """<table>
+    <tr><th>Filename</th><th>Source</th><th>Status</th><th>Pages</th><th>Concepts</th><th>Vectors</th></tr>"""
+
+    for d in docs:
+        status = d.get('status', 'unknown')
+        content += f"""<tr>
+            <td>{d.get('filename', '?')}</td>
+            <td>{d.get('source', '')}</td>
+            <td><span class="status status-{status}">{status}</span></td>
+            <td>{d.get('pages_extracted', 0)}</td>
+            <td>{d.get('concepts_extracted', 0)}</td>
+            <td>{d.get('vectors_inserted', 0)}</td>
+        </tr>"""
+
+    content += "</table>"
+    return render(content)
+
+
+@app.route('/failures')
+def failures_page():
+    db = StatusDB()
+    failures = db.get_failures()
+
+    content = '<h3 style="color:#ff4444;margin-bottom:16px;">Failed Documents</h3>'
+
+    if not failures:
+        content += '<p style="color:#666;">No failures.</p>'
+        return render(content)
+
+    content += '<table><tr><th>Filename</th><th>Error</th><th>Retries</th><th>Actions</th></tr>'
+    for f in failures:
+        content += f"""<tr>
+            <td>{f.get('filename', '?')}</td>
+            <td style="color:#ff4444;font-size:11px;">{f.get('error_message', 'unknown')[:100]}</td>
+            <td>{f.get('retry_count', 0)}</td>
+            <td><form method="post" action="/api/retry/{f['hash']}" style="display:inline;">
+                <button class="btn" type="submit">Retry</button>
+            </form></td>
+        </tr>"""
+
+    content += "</table>"
+    return render(content)
+
+
+@app.route('/api/search', methods=['POST'])
+def api_search():
+    config = get_config()
+    data = request.get_json()
+    if not data or 'query' not in data:
+        return jsonify({'error': 'Missing query'}), 400
+
+    query = data['query']
+    limit = data.get('limit', 20)
+    source_type = data.get('source_type', None)
+
+    try:
+        url = f"http://{config['embedding']['host']}:{config['embedding']['port']}/api/embed"
+        resp = http_requests.post(url, json={
+            "model": config['embedding']['model'],
+            "input": query
+        }, timeout=120)
+        resp.raise_for_status()
+        query_vector = resp.json()['embeddings'][0]
+
+        qdrant = QdrantClient(
+            host=config['vector_db']['host'],
+            port=config['vector_db']['port'],
+            timeout=60
+        )
+
+        search_filter = None
+        if source_type:
+            search_filter = Filter(must=[
+                FieldCondition(key="source_type", match=MatchValue(value=source_type))
+            ])
+
+        results = qdrant.query_points(
+            collection_name=config['vector_db']['collection'],
+            query=query_vector,
+            limit=limit,
+            query_filter=search_filter
+        ).points
+
+        return jsonify({
+            'query': query,
+            'results': [
+                {
+                    'score': r.score,
+                    'payload': r.payload
+                }
+                for r in results
+            ]
+        })
+
+    except Exception as e:
+        return jsonify({'error': str(e)}), 500
+
+
+@app.route('/api/status')
+def api_status():
+    db = StatusDB()
+    return jsonify(db.get_status_counts())
+
+
+@app.route('/api/retry/<file_hash>', methods=['POST'])
+def api_retry(file_hash):
+    db = StatusDB()
+    db.increment_retry(file_hash)
+    return redirect('/failures')
+
+
+@app.route('/api/ingest', methods=['POST'])
+def api_ingest():
+    from .ingester import ingest_intel
+    data = request.get_json()
+    if not data:
+        return jsonify({'error': 'No JSON body'}), 400
+
+    config = get_config()
+    result = ingest_intel(data, config)
+    if result is not None:
+        return jsonify({'intel_id': result})
+    return jsonify({'error': 'Ingestion failed'}), 500
+
+
+def run_server():
+    config = get_config()
+    host = config['web']['host']
+    port = config['web']['port']
+    logger.info(f"Starting RECON web dashboard on {host}:{port}")
+    app.run(host=host, port=port, debug=False)
--- a/config.yaml
+++ b/config.yaml
@ -0,0 +1,440 @@
+# RECON Configuration
+# See PROJECT-BIBLE.md Section 11 for full documentation
+
+# Root path for the PDF library (NFS mount from pi-nas)
+library_root: /mnt/library
+
+processing:
+  max_pdf_size_mb: 2000         # Raised from 200MB default for large scanned books
+  extract_workers: 4          # Concurrent PDF extraction threads
+  enrich_workers: 16          # Concurrent Gemini enrichment threads (4 keys x 4)
+  embed_workers: 4            # Concurrent embedding threads
+  enrich_window_size: 5       # Pages per enrichment window (sent to Gemini)
+  embed_batch_size: 500       # Vectors per Qdrant upsert batch
+  rate_limit_delay: 0.1       # Delay between Gemini API calls (seconds)
+  max_retries: 5              # Max retries for failed documents
+  extract_timeout: 1800      # Max seconds per document extraction (30 min, allows vision OCR)
+  page_timeout: 30           # Max seconds per page extraction
+  enrich_max_retries: 5        # Max retries per enrichment window
+  enrich_base_delay: 5.0       # Base backoff delay (seconds) — ~5s, 10s, 20s, 40s, 80s
+  enrich_max_delay: 120.0      # Maximum backoff delay cap (seconds)
+
+embedding:
+  backend: tei                # "tei" (primary, ~1,711 emb/sec) or "ollama" (fallback, ~8 emb/sec)
+  tei_host: 100.64.0.14       # TEI server (cortex)
+  tei_port: 8090              # TEI HTTP port
+  ollama_host: 100.64.0.14    # Ollama server (cortex) — fallback only
+  ollama_port: 11434          # Ollama HTTP port
+  model: bge-m3               # Embedding model name
+  dimensions: 1024            # CRITICAL: bge-m3 is 1024-dim, NOT 384
+  batch_size: 128             # Embeddings per TEI batch request
+
+sparse_embedding:
+  enabled: true
+  host: 100.64.0.14            # Sparse embedding service (cortex)
+  port: 8091                   # Sparse embedding HTTP port
+
+vector_db:
+  host: 100.64.0.14           # Qdrant server (cortex)
+  port: 6333                  # Qdrant HTTP port
+  collection: recon_knowledge_hybrid # Collection name
+
+gemini:
+  model: gemini-2.0-flash     # Gemini model for enrichment
+  response_mime_type: application/json  # Force JSON output from Gemini
+
+web:
+  port: 8420                  # Dashboard HTTP port
+  host: 0.0.0.0               # Bind address (all interfaces)
+
+paths:
+  base: /opt/recon             # Application root
+  data: /opt/recon/data        # Data directory
+  text: /opt/recon/data/text   # Extracted text output (data/text/{hash}/page_NNNN.txt)
+  concepts: /opt/recon/data/concepts  # Enriched concept JSONs (data/concepts/{hash}/window_N.json)
+  intel: /opt/recon/data/intel # ARGUS intel feeds
+  logs: /opt/recon/logs        # Log files
+  db: /opt/recon/data/recon.db # SQLite database (WAL mode)
+
+book_server:
+  base_url: https://files.echo6.co   # Public URL prefix for PDF downloads
+  strip_prefix: /mnt/library         # Path prefix stripped when generating download URLs
+
+upload_paths:                  # Category -> filesystem path mapping for uploads
+  Survival Reference: /mnt/library/Survival-Companion-Library/Uploads
+  Military Doctrine: /mnt/library/Army_Pubs/Uploads
+  Gaming: /mnt/library/Gaming
+  Reference: /mnt/library/Reference
+  Technical: /mnt/library/Technical
+  default: /mnt/library        # Fallback for unknown categories
+
+web_scraper:
+  words_per_page: 2000         # Target words per page chunk for web content
+  fetch_timeout: 30            # HTTP request timeout (seconds)
+  rate_limit_delay: 1.0        # Delay between URL fetches (seconds)
+  max_batch_size: 50           # Max URLs per batch ingest
+  user_agent: "Mozilla/5.0 (compatible; RECON/1.0)"
+
+crawler:
+  user_agent: "Mozilla/5.0 (compatible; RECON/1.0)"
+  fetch_timeout: 30            # HTTP request timeout (seconds)
+  rate_limit_delay: 1.0        # Delay between page fetches (seconds)
+  max_pages: 500               # Max pages to discover per crawl
+  max_depth: 3                 # Max link-following depth (BFS only, not sitemap)
+  inter_site_cooldown: 30      # Seconds to wait between crawling different sites
+  recrawl_interval_days: 7     # Skip sites crawled within this many days
+
+  default_exclude:             # URL patterns always excluded from crawling
+    - /search
+    - /404
+    - /login
+    - /signup
+    - /auth/
+    - /api/
+    - /assets/
+    - /static/
+    - /cart
+    - /checkout
+    - /account
+    - /register
+    - /subscribe
+    - /membership
+    - /shop
+    - /store
+    - /product
+    - /wp-admin
+    - /feed
+    - /wp-json
+    - /xmlrpc
+    - /.well-known
+    - /cdn-cgi
+
+  # ─── Crawl Targets ─────────────────────────────────────────────
+  # Sites are crawled by the scheduler loop in tier order (1 first).
+  # Per-site delay overrides global rate_limit_delay for that site.
+  # Per-site max_pages/max_depth override global defaults.
+
+  # Disabled 2026-04-14 for refactor — see refactored-recon repo for context
+  sites: []
+
+  # sites:
+  #
+  # # ═══ TIER 1 — Free, authoritative, high-density ═══
+  #
+  # - url: https://hesperian.org/all-hesperian-health-guides
+  # category: Medical
+  # max_depth: 3
+  # delay: 3.0
+  # tier: 1
+  # notes: "Free health guides — WTIND, midwives, community health"
+  #
+  # - url: https://swsbm.com
+  # category: Medical
+  # max_depth: 3
+  # delay: 3.0
+  # tier: 1
+  # notes: "Michael Moore's entire free clinical herbal library — PDFs"
+  #
+  # - url: https://swsbm.henriettesherbal.com
+  # category: Medical
+  # max_depth: 3
+  # delay: 3.0
+  # tier: 1
+  # notes: "Mirror of Moore's library — grab both"
+  #
+  # - url: https://nchfp.uga.edu
+  # category: Sustainment Systems
+  # max_depth: 3
+  # delay: 2.0
+  # tier: 1
+  # notes: "USDA canning/preservation safety authority"
+  #
+  # - url: https://extension.uidaho.edu
+  # category: Foundational Skills
+  # max_depth: 3
+  # delay: 2.0
+  # tier: 1
+  # notes: "Idaho-specific — soil, water, crops, livestock"
+  #
+  # - url: https://extension.usu.edu
+  # category: Foundational Skills
+  # max_depth: 3
+  # delay: 2.0
+  # tier: 1
+  # notes: "Utah State — Idaho-adjacent climate"
+  #
+  # - url: https://attra.ncat.org
+  # category: Sustainment Systems
+  # max_depth: 3
+  # delay: 3.0
+  # tier: 1
+  # notes: "ATTRA sustainable ag — hundreds of free publications"
+  #
+  # - url: https://pfaf.org
+  # category: Sustainment Systems
+  # max_depth: 3
+  # delay: 3.0
+  # tier: 1
+  # notes: "Plants For A Future — 7,000+ edible/medicinal plant profiles"
+  #
+  # - url: https://eattheweeds.com
+  # category: Sustainment Systems
+  # max_depth: 3
+  # delay: 3.0
+  # tier: 1
+  # notes: "Green Deane — 1,000+ foraging plant articles"
+  #
+  # - url: https://lowtechmagazine.com
+  # category: Off-Grid Systems
+  # max_depth: 3
+  # delay: 3.0
+  # tier: 1
+  # notes: "Exceptional low-tech systems analysis"
+  #
+  # - url: https://appropedia.org
+  # category: Off-Grid Systems
+  # max_depth: 3
+  # delay: 3.0
+  # tier: 1
+  # notes: "Appropriate technology wiki"
+  #
+  # - url: https://journeytoforever.org
+  # category: Off-Grid Systems
+  # max_depth: 3
+  # delay: 3.0
+  # tier: 1
+  # notes: "VITA manuals, biodiesel, biogas, hand tools archive"
+  #
+  # - url: https://cd3wd.com
+  # category: Off-Grid Systems
+  # max_depth: 2
+  # delay: 3.0
+  # tier: 1
+  # notes: "1,050+ appropriate technology eBooks — index pages only"
+  #
+  # - url: https://practicalselfreliance.com
+  # category: Sustainment Systems
+  # max_depth: 3
+  # delay: 3.0
+  # tier: 1
+  # notes: "Ashley Adamant — foraging, preservation, homesteading"
+  #
+  # - url: https://open.oregonstate.edu/permaculture
+  # category: Off-Grid Systems
+  # max_depth: 3
+  # delay: 3.0
+  # tier: 1
+  # notes: "Millison's free permaculture textbook"
+  #
+  # - url: https://open.oregonstate.edu/permaculturedesign
+  # category: Off-Grid Systems
+  # max_depth: 3
+  # delay: 3.0
+  # tier: 1
+  # notes: "Millison's advanced permaculture textbook"
+  #
+  # - url: https://mushroomexpert.com
+  # category: Sustainment Systems
+  # max_depth: 3
+  # delay: 3.0
+  # tier: 1
+  # notes: "Michael Kuo — mushroom ID, taxonomy, regional coverage"
+  #
+  # # ═══ TIER 2 — High value, second pass ═══
+  #
+  # - url: https://motherearthnews.com
+  # category: Foundational Skills
+  # max_depth: 2
+  # max_pages: 200
+  # delay: 8.0
+  # tier: 2
+  # notes: "50 years of homesteading archive — large commercial site, be polite"
+  #
+  # - url: https://permacultureresearchinstitute.com
+  # category: Off-Grid Systems
+  # max_depth: 3
+  # delay: 5.0
+  # tier: 2
+  # notes: "Geoff Lawton — articles, case studies"
+  #
+  # - url: https://learnyourland.com
+  # category: Sustainment Systems
+  # max_depth: 3
+  # delay: 5.0
+  # tier: 2
+  # notes: "Adam Haritan — foraging articles"
+  #
+  # - url: https://herbswithRosalee.com
+  # category: Medical
+  # max_depth: 3
+  # delay: 5.0
+  # tier: 2
+  # notes: "Rosalee de la Foret — clinical herbalism articles"
+  #
+  # - url: https://commonwealthherbs.com
+  # category: Medical
+  # max_depth: 3
+  # delay: 5.0
+  # tier: 2
+  # notes: "Katja and Ryn — clinical herbalism"
+  #
+  # - url: https://soilfoodweb.com
+  # category: Off-Grid Systems
+  # max_depth: 3
+  # delay: 5.0
+  # tier: 2
+  # notes: "Elaine Ingham soil biology — archive before it goes dark"
+  #
+  # - url: https://rocketstoves.com
+  # category: Off-Grid Systems
+  # max_depth: 3
+  # delay: 5.0
+  # tier: 2
+  # notes: "Ianto Evans — rocket mass heater designs and PDFs"
+  #
+  # - url: https://farmsteadmeatsmith.com
+  # category: Sustainment Systems
+  # max_depth: 2
+  # delay: 5.0
+  # tier: 2
+  # notes: "Brandon Sheard — butchering articles (free content only)"
+  #
+  # - url: https://deeranddeerhunting.com
+  # category: Sustainment Systems
+  # max_depth: 2
+  # delay: 5.0
+  # tier: 2
+  # notes: "Field dressing, processing, hunting technique library"
+  #
+  # # ═══ TIER 3 — Government (authoritative) ═══
+  #
+  # - url: https://plants.usda.gov
+  # category: Sustainment Systems
+  # max_depth: 2
+  # delay: 2.0
+  # tier: 3
+  # notes: "USDA native plant database"
+  #
+  # - url: https://ars.usda.gov
+  # category: Sustainment Systems
+  # max_depth: 2
+  # delay: 2.0
+  # tier: 3
+  # notes: "USDA Agricultural Research publications"
+  #
+  # - url: https://nrcs.usda.gov
+  # category: Off-Grid Systems
+  # max_depth: 2
+  # delay: 2.0
+  # tier: 3
+  # notes: "Soil surveys, conservation practice standards"
+  #
+  # - url: https://ready.gov
+  # category: Scenario Playbooks
+  # max_depth: 3
+  # delay: 2.0
+  # tier: 3
+  # notes: "FEMA emergency preparedness guides"
+  #
+  # - url: https://emergency.cdc.gov
+  # category: Medical
+  # max_depth: 3
+  # delay: 2.0
+  # tier: 3
+  # notes: "Public health emergency references"
+  #
+  # - url: https://agri.idaho.gov
+  # category: Foundational Skills
+  # max_depth: 2
+  # delay: 2.0
+  # tier: 3
+  # notes: "Idaho Dept of Agriculture — local relevance"
+  #
+  # - url: https://driveonwood.com
+  # category: Off-Grid Systems
+  # max_depth: 3
+  # delay: 3.0
+  # tier: 3
+  # notes: "Wood gasification — FEMA manual + modern improvements"
+  #
+  # # ═══ TIER 4 — Selective scrape (specific sections only) ═══
+  #
+  # - url: https://richsoil.com
+  # category: Off-Grid Systems
+  # max_depth: 2
+  # delay: 5.0
+  # tier: 4
+  # notes: "Paul Wheaton — rocket mass heaters, natural building"
+  #
+  # - url: https://wildfoodgirl.com
+  # category: Sustainment Systems
+  # max_depth: 3
+  # delay: 5.0
+  # tier: 4
+  # notes: "Colorado foraging — Mountain West species"
+  #
+  # - url: https://foragersharvest.com
+  # category: Sustainment Systems
+  # max_depth: 3
+  # delay: 5.0
+  # tier: 4
+  # notes: "Sam Thayer's site — articles"
+  #
+  # - url: https://mountainroseherbs.com/blog
+  # category: Medical
+  # max_depth: 2
+  # delay: 5.0
+  # tier: 4
+  # notes: "Herb profiles and preparations — blog section only"
+  #
+  # - url: https://herbalprepper.com
+  # category: Medical
+  # max_depth: 3
+  # delay: 5.0
+  # tier: 4
+  # notes: "Cat Ellis — grid-down herbalism"
+  #
+  # - url: https://prolongedfieldcare.org
+  # category: Medical
+  # max_depth: 3
+  # delay: 5.0
+  # tier: 4
+  # notes: "PFC Collective — austere medical protocols"
+  #
+service:
+  scan_interval: 3600          # Seconds between library scans (1 hour)
+  stage_poll_interval: 30      # Seconds stages sleep when idle
+  progress_interval: 60        # Seconds between progress log lines
+
+peertube:
+  api_base: http://192.168.1.170       # Internal PeerTube API (CT 110 nginx)
+  public_url: https://stream.echo6.co  # Public URL for video links
+  fetch_timeout: 30                     # HTTP timeout for API/VTT requests
+  rate_limit_delay: 0.5                 # Delay between video ingestions (seconds)
+
+# Stream B: New Library Pipeline
+new_pipeline:
+  # Disabled 2026-04-14 for refactor — see refactored-recon repo for context
+  enabled: false
+  acquired_dir: /mnt/library/_acquired
+  ingest_dir: /mnt/library/_ingest
+  duplicates_dir: /mnt/library/_ingest/_duplicates
+  failed_dir: /mnt/library/_ingest/_failed
+  poll_interval: 60
+  mtime_stability: 10
+  pilot_domain: "Civil Organization"
+  spaces_to_underscores: true
+
+# Refactored pipeline configuration (2026-04-14)
+# See https://forge.echo6.co/matt/refactored-recon for design
+pipeline:
+  acquired_root: /opt/recon/data/acquired
+  processing_root: /opt/recon/data/processing
+  # Subfolder name -> processor module mapping
+  # Processors do not exist yet; this is scaffolding for Phase 3+
+  dispatch:
+    pdf: pdf_processor
+    stream: transcript_processor
+    html: html_processor
+  # mtime stability threshold for picking up files from acquired/
+  mtime_stability_seconds: 10
--- a/enricher.py
+++ b/enricher.py
@ -0,0 +1,264 @@
+import json
+import os
+import re
+import time
+import traceback
+from concurrent.futures import ThreadPoolExecutor, as_completed
+
+import google.generativeai as genai
+
+from .utils import get_config, setup_logging
+from .status import StatusDB
+
+logger = setup_logging('recon.enricher')
+
+
+def repair_json(text):
+    """Attempt to repair common LLM JSON output issues including truncation."""
+    # Remove control characters except newlines and tabs
+    text = re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f]', '', text)
+    # Remove trailing commas before } or ]
+    text = re.sub(r',\s*([}\]])', r'\1', text)
+
+    # Handle truncated JSON: try to find the last complete object in the array
+    try:
+        json.loads(text, strict=False)
+        return text
+    except json.JSONDecodeError:
+        pass
+
+    # Find the last complete }, then close the array
+    # Walk backward to find the last valid closing brace
+    last_complete = -1
+    depth_brace = 0
+    depth_bracket = 0
+    in_string = False
+    escape = False
+
+    for i, ch in enumerate(text):
+        if escape:
+            escape = False
+            continue
+        if ch == '\\' and in_string:
+            escape = True
+            continue
+        if ch == '"' and not escape:
+            in_string = not in_string
+            continue
+        if in_string:
+            continue
+        if ch == '{':
+            depth_brace += 1
+        elif ch == '}':
+            depth_brace -= 1
+            if depth_brace == 0:
+                last_complete = i
+        elif ch == '[':
+            depth_bracket += 1
+        elif ch == ']':
+            depth_bracket -= 1
+
+    if last_complete > 0:
+        truncated = text[:last_complete + 1].rstrip().rstrip(',')
+        # Close any open arrays
+        open_brackets = truncated.count('[') - truncated.count(']')
+        truncated += ']' * open_brackets
+        return truncated
+
+    return text
+
+ENRICH_PROMPT = """Extract knowledge concepts from this document text.
+
+A concept is a SELF-CONTAINED piece of knowledge that can stand alone.
+
+For each concept, provide ALL fields:
+
+Required:
+- content: Full text of the concept (complete procedure, definition, etc.)
+- summary: 1-2 sentence summary
+- title: Brief descriptive title
+- domain: Array of 1-5 from: Foundational Skills, Sustainment Systems, Defense & Tactics, Off-Grid Systems, Communications, Scenario Playbooks, Reference
+- subdomain: Array of specific subcategories (up to 10)
+- keywords: Array of 3-30 searchable terms
+- skill_level: novice | intermediate | advanced
+- key_facts: Array of specific extractable claims, measurements, data points
+
+Optional (include when present):
+- scenario_applicable: Array from: tuesday_prepper, month_prepper, year_prepper, multi_year, eotwawki
+- cross_domain_tags: Array from: sustainment, medical, security, communications, leadership, logistics, navigation, power_systems, water_systems, food_systems, tactical_ops, community_coordination
+- chapter: Chapter name if identifiable
+- page_ref: Page reference
+- notes: Any additional context
+
+Return JSON array. If no extractable concepts, return [].
+
+Document text:
+"""
+
+
+class KeyRotator:
+    def __init__(self, keys):
+        self.keys = keys
+        self.index = 0
+
+    def next(self):
+        if not self.keys:
+            raise ValueError("No Gemini API keys configured")
+        key = self.keys[self.index % len(self.keys)]
+        self.index += 1
+        return key
+
+
+def enrich_window(text, key, config):
+    genai.configure(api_key=key)
+    model = genai.GenerativeModel(
+        config['gemini']['model'],
+        generation_config={"response_mime_type": config['gemini']['response_mime_type']}
+    )
+    response = model.generate_content(ENRICH_PROMPT + text)
+    raw = response.text
+    try:
+        return json.loads(raw, strict=False)
+    except json.JSONDecodeError:
+        repaired = repair_json(raw)
+        return json.loads(repaired, strict=False)
+
+
+def enrich_single(file_hash, db, config, key_rotator):
+    doc = db.get_document(file_hash)
+    if not doc:
+        return False
+
+    text_dir = os.path.join(config['paths']['text'], file_hash)
+    concepts_dir = os.path.join(config['paths']['concepts'], file_hash)
+    window_size = config['processing']['enrich_window_size']
+    delay = config['processing']['rate_limit_delay']
+    max_retries = config['processing']['max_retries']
+
+    if not os.path.exists(text_dir):
+        db.mark_failed(file_hash, f"Text directory not found: {text_dir}")
+        return False
+
+    db.update_status(file_hash, 'enriching')
+
+    try:
+        os.makedirs(concepts_dir, exist_ok=True)
+
+        page_files = sorted([f for f in os.listdir(text_dir) if f.startswith('page_') and f.endswith('.txt')])
+        if not page_files:
+            db.mark_failed(file_hash, "No page files found")
+            return False
+
+        pages_text = []
+        for pf in page_files:
+            with open(os.path.join(text_dir, pf), encoding='utf-8') as f:
+                pages_text.append(f.read())
+
+        windows = []
+        for i in range(0, len(pages_text), window_size):
+            window_pages = pages_text[i:i + window_size]
+            combined = "\n\n".join(f"--- Page {i + j + 1} ---\n{t}" for j, t in enumerate(window_pages))
+            windows.append((i, combined))
+
+        total_concepts = 0
+        for w_idx, (start_page, window_text) in enumerate(windows):
+            window_file = os.path.join(concepts_dir, f"window_{w_idx+1:04d}.json")
+
+            if os.path.exists(window_file):
+                with open(window_file, encoding='utf-8') as f:
+                    existing = json.load(f)
+                total_concepts += len(existing)
+                logger.debug(f"  Window {w_idx+1} already exists, skipping")
+                continue
+
+            if len(window_text.strip()) < 50:
+                with open(window_file, 'w') as f:
+                    json.dump([], f)
+                continue
+
+            concepts = None
+            for attempt in range(max_retries):
+                try:
+                    key = key_rotator.next()
+                    concepts = enrich_window(window_text, key, config)
+                    break
+                except Exception as e:
+                    logger.warning(f"  Window {w_idx+1} attempt {attempt+1} failed: {e}")
+                    if attempt < max_retries - 1:
+                        time.sleep(delay * (attempt + 1) * 2)
+
+            if concepts is None:
+                db.mark_failed(file_hash, f"All retries failed for window {w_idx+1}")
+                return False
+
+            if not isinstance(concepts, list):
+                concepts = [concepts] if isinstance(concepts, dict) else []
+
+            for c_idx, concept in enumerate(concepts):
+                concept['_window'] = w_idx + 1
+                concept['_start_page'] = start_page + 1
+                concept['_doc_hash'] = file_hash
+
+            # JSON FIRST: save before anything else
+            with open(window_file, 'w', encoding='utf-8') as f:
+                json.dump(concepts, f, indent=2, ensure_ascii=False)
+
+            total_concepts += len(concepts)
+            logger.debug(f"  Window {w_idx+1}/{len(windows)}: {len(concepts)} concepts")
+            time.sleep(delay)
+
+        meta = {
+            'hash': file_hash,
+            'total_windows': len(windows),
+            'total_concepts': total_concepts,
+            'window_size': window_size,
+            'timestamp': time.strftime('%Y-%m-%dT%H:%M:%SZ', time.gmtime()),
+        }
+        with open(os.path.join(concepts_dir, 'meta.json'), 'w') as f:
+            json.dump(meta, f, indent=2)
+
+        db.update_status(file_hash, 'enriched', concepts_extracted=total_concepts)
+        logger.info(f"Enriched {doc['filename']}: {total_concepts} concepts from {len(windows)} windows")
+        return True
+
+    except Exception as e:
+        logger.error(f"Enrichment failed for {file_hash}: {e}\n{traceback.format_exc()}")
+        db.mark_failed(file_hash, str(e))
+        return False
+
+
+def run_enrichment(workers=None, limit=None):
+    config = get_config()
+    db = StatusDB()
+    workers = workers or config['processing']['enrich_workers']
+
+    keys = config.get('gemini_keys', [])
+    if not keys:
+        logger.error("No Gemini API keys configured in .env")
+        return 0
+
+    key_rotator = KeyRotator(keys)
+
+    extracted = db.get_by_status('extracted', limit=limit)
+    if not extracted:
+        logger.info("No extracted documents to enrich")
+        return 0
+
+    logger.info(f"Enriching {len(extracted)} documents with {workers} workers, {len(keys)} API key(s)")
+    success = 0
+
+    with ThreadPoolExecutor(max_workers=workers) as pool:
+        futures = {
+            pool.submit(enrich_single, doc['hash'], StatusDB(), config, key_rotator): doc
+            for doc in extracted
+        }
+        for future in as_completed(futures):
+            doc = futures[future]
+            try:
+                if future.result():
+                    success += 1
+            except Exception as e:
+                logger.error(f"Worker error for {doc['hash']}: {e}")
+
+    logger.info(f"Enrichment complete: {success}/{len(extracted)} succeeded")
+    return success
--- a/lib/init.py
+++ b/lib/init.py
--- a/lib/api.py
+++ b/lib/api.py
--- a/lib/crawler.py
+++ b/lib/crawler.py
@ -0,0 +1,432 @@
+"""
+RECON Site Crawler — URL discovery for bulk web ingestion.
+
+Two discovery strategies:
+1. Sitemap-based (preferred) — parses sitemap.xml for all URLs
+2. Link-following (fallback) — crawls from root URL following internal links
+
+Discovered URLs are fed into web_scraper.ingest_url() for processing.
+"""
+
+import re
+import time
+from collections import deque
+from urllib.parse import urlparse, urljoin, urldefrag
+
+import requests
+from lxml import etree
+
+from .utils import get_config, setup_logging
+
+logger = setup_logging('recon.crawler')
+
+
+def _get_crawler_config(config=None):
+    """Load crawler config with defaults."""
+    if config is None:
+        config = get_config()
+    crawler_cfg = config.get('crawler', {})
+    web_cfg = config.get('web_scraper', {})
+    return {
+        'user_agent': (
+            crawler_cfg.get('user_agent') or
+            web_cfg.get('user_agent') or
+            'Mozilla/5.0 (compatible; RECON/1.0)'
+        ),
+        'fetch_timeout': crawler_cfg.get('fetch_timeout', 30),
+        'rate_limit_delay': crawler_cfg.get('rate_limit_delay', 1.0),
+        'max_pages': crawler_cfg.get('max_pages', 500),
+        'max_depth': crawler_cfg.get('max_depth', 3),
+        'default_exclude': crawler_cfg.get('default_exclude', [
+            '/search', '/404', '/login', '/signup', '/auth/', '/api/', '/assets/', '/static/'
+        ]),
+    }
+
+
+# ─── Sitemap Discovery ─────────────────────────────────────────────
+
+def discover_sitemap_url(base_url, config=None):
+    """
+    Find the sitemap URL for a site.
+
+    Checks: robots.txt Sitemap: directive, /sitemap.xml,
+    /sitemap_index.xml, /sitemap-0.xml.
+
+    Returns sitemap URL or None.
+    """
+    cfg = _get_crawler_config(config)
+    headers = {'User-Agent': cfg['user_agent']}
+    parsed = urlparse(base_url)
+    root = f"{parsed.scheme}://{parsed.netloc}"
+
+    # Check robots.txt first
+    try:
+        resp = requests.get(
+            f"{root}/robots.txt",
+            headers=headers,
+            timeout=cfg['fetch_timeout']
+        )
+        if resp.status_code == 200:
+            for line in resp.text.splitlines():
+                if line.strip().lower().startswith('sitemap:'):
+                    sitemap_url = line.split(':', 1)[1].strip()
+                    # Handle "Sitemap: https://..." — split(':',1) keeps the URL intact
+                    # but "Sitemap: https://..." splits into "Sitemap" and " https://..."
+                    # Need to rejoin properly
+                    if not sitemap_url.startswith('http'):
+                        sitemap_url = line[line.index(':') + 1:].strip()
+                    logger.info(f"Found sitemap in robots.txt: {sitemap_url}")
+                    return sitemap_url
+    except Exception as e:
+        logger.debug(f"robots.txt fetch failed: {e}")
+
+    # Try common sitemap locations
+    candidates = [
+        f"{root}/sitemap.xml",
+        f"{root}/sitemap_index.xml",
+        f"{root}/sitemap-0.xml",
+    ]
+
+    for url in candidates:
+        try:
+            resp = requests.head(
+                url,
+                headers=headers,
+                timeout=cfg['fetch_timeout'],
+                allow_redirects=True
+            )
+            if resp.status_code == 200:
+                logger.info(f"Found sitemap at: {url}")
+                return url
+        except Exception:
+            continue
+
+    logger.warning(f"No sitemap found for {base_url}")
+    return None
+
+
+def parse_sitemap(sitemap_url, config=None):
+    """
+    Parse a sitemap XML and return all page URLs.
+
+    Handles standard sitemaps (<urlset>) and sitemap indexes
+    (<sitemapindex>) with recursive sub-sitemap fetching.
+    """
+    cfg = _get_crawler_config(config)
+    headers = {'User-Agent': cfg['user_agent']}
+    all_urls = []
+
+    def _fetch_and_parse(url, depth=0):
+        if depth > 3:
+            return
+
+        try:
+            resp = requests.get(url, headers=headers, timeout=cfg['fetch_timeout'])
+            resp.raise_for_status()
+        except Exception as e:
+            logger.error(f"Failed to fetch sitemap {url}: {e}")
+            return
+
+        try:
+            root = etree.fromstring(resp.content)
+        except etree.XMLSyntaxError as e:
+            logger.error(f"Invalid XML in sitemap {url}: {e}")
+            return
+
+        nsmap = {'ns': 'http://www.sitemaps.org/schemas/sitemap/0.9'}
+
+        # Check if this is a sitemap index
+        sitemap_locs = root.findall('.//ns:sitemap/ns:loc', nsmap)
+        if sitemap_locs:
+            logger.info(f"Sitemap index at {url} — {len(sitemap_locs)} sub-sitemaps")
+            for loc in sitemap_locs:
+                if loc.text:
+                    _fetch_and_parse(loc.text.strip(), depth + 1)
+            return
+
+        # Standard sitemap — extract URLs
+        url_locs = root.findall('.//ns:loc', nsmap)
+
+        # Fallback: try without namespace
+        if not url_locs:
+            url_locs = root.findall('.//loc')
+
+        for loc in url_locs:
+            if loc.text:
+                all_urls.append(loc.text.strip())
+
+        logger.info(f"Parsed {len(url_locs)} URLs from {url}")
+
+    _fetch_and_parse(sitemap_url)
+
+    # Deduplicate preserving order
+    seen = set()
+    unique = []
+    for url in all_urls:
+        url_clean = urldefrag(url)[0]
+        if url_clean not in seen:
+            seen.add(url_clean)
+            unique.append(url_clean)
+
+    logger.info(f"Total unique URLs from sitemap: {len(unique)}")
+    return unique
+
+
+# ─── Link-Following Discovery (Fallback) ───────────────────────────
+
+def crawl_links(base_url, max_depth=3, max_pages=500, config=None):
+    """
+    Discover URLs by following internal links (BFS).
+    Fallback when no sitemap is available.
+    """
+    from bs4 import BeautifulSoup
+
+    cfg = _get_crawler_config(config)
+    headers = {'User-Agent': cfg['user_agent']}
+
+    parsed_base = urlparse(base_url)
+    base_domain = parsed_base.netloc
+
+    discovered = []
+    visited = set()
+    queue = deque([(base_url, 0)])
+
+    skip_extensions = (
+        '.pdf', '.png', '.jpg', '.jpeg', '.gif', '.svg',
+        '.css', '.js', '.zip', '.tar', '.gz', '.mp4', '.mp3',
+        '.ico', '.woff', '.woff2', '.ttf', '.eot',
+    )
+    skip_paths = (
+        '/tag/', '/tags/', '/page/', '/feed/', '/rss/',
+        '/wp-json/', '/wp-admin/', '/wp-includes/',
+    )
+
+    while queue and len(discovered) < max_pages:
+        url, depth = queue.popleft()
+        url = urldefrag(url)[0]
+
+        if url in visited:
+            continue
+        if depth > max_depth:
+            continue
+
+        visited.add(url)
+        discovered.append(url)
+
+        if depth >= max_depth:
+            continue
+
+        try:
+            resp = requests.get(url, headers=headers, timeout=cfg['fetch_timeout'])
+            if resp.status_code != 200:
+                continue
+            if 'text/html' not in resp.headers.get('content-type', ''):
+                continue
+        except Exception:
+            continue
+
+        try:
+            soup = BeautifulSoup(resp.text, 'lxml')
+        except Exception:
+            continue
+
+        for a_tag in soup.find_all('a', href=True):
+            href = a_tag['href']
+            full_url = urljoin(url, href)
+            full_url = urldefrag(full_url)[0]
+
+            parsed = urlparse(full_url)
+            if parsed.netloc != base_domain:
+                continue
+            if any(parsed.path.lower().endswith(ext) for ext in skip_extensions):
+                continue
+            if any(skip in parsed.path.lower() for skip in skip_paths):
+                continue
+
+            if full_url not in visited:
+                queue.append((full_url, depth + 1))
+
+        time.sleep(cfg['rate_limit_delay'])
+
+    logger.info(f"Link crawl: {len(discovered)} URLs (visited {len(visited)}, depth {max_depth})")
+    return discovered
+
+
+# ─── URL Filtering ──────────────────────────────────────────────────
+
+def filter_urls(urls, include=None, exclude=None):
+    """
+    Filter URLs by path prefix include/exclude rules.
+
+    include: URL must match at least one prefix (if provided)
+    exclude: URL must not match any prefix
+    """
+    filtered = []
+
+    for url in urls:
+        path = urlparse(url).path
+
+        if include:
+            if not any(path.startswith(prefix) for prefix in include):
+                continue
+
+        if exclude:
+            if any(path.startswith(prefix) for prefix in exclude):
+                continue
+
+        filtered.append(url)
+
+    logger.info(f"Filtered {len(urls)} -> {len(filtered)} URLs "
+                f"(include={include}, exclude={exclude})")
+    return filtered
+
+
+# ─── Main Crawl Orchestrator ────────────────────────────────────────
+
+def crawl_site(
+    base_url,
+    category='Web',
+    source=None,
+    include=None,
+    exclude=None,
+    max_pages=None,
+    max_depth=None,
+    delay=None,
+    dry_run=False,
+    use_sitemap=True,
+    use_links=True,
+    config=None,
+):
+    """
+    Crawl a site and ingest all discovered pages.
+
+    1. Discover URLs via sitemap or link-following
+    2. Apply include/exclude filters
+    3. Feed each URL through web_scraper.ingest_url()
+
+    Returns summary dict with counts and per-URL results.
+    """
+    if config is None:
+        config = get_config()
+    cfg = _get_crawler_config(config)
+
+    if max_pages is None:
+        max_pages = cfg['max_pages']
+    if max_depth is None:
+        max_depth = cfg['max_depth']
+    if delay is None:
+        delay = cfg['rate_limit_delay']
+    if source is None:
+        source = urlparse(base_url).netloc
+
+    logger.info(f"Crawling {base_url} (category={category}, max_pages={max_pages})")
+
+    # ── Phase 1: Discover URLs ──
+
+    urls = []
+    discovery_method = None
+
+    if use_sitemap:
+        sitemap_url = discover_sitemap_url(base_url, config)
+        if sitemap_url:
+            urls = parse_sitemap(sitemap_url, config)
+            discovery_method = 'sitemap'
+
+    if not urls and use_links:
+        logger.info("No sitemap URLs, falling back to link crawl...")
+        urls = crawl_links(base_url, max_depth=max_depth, max_pages=max_pages, config=config)
+        discovery_method = 'link_crawl'
+
+    if not urls:
+        logger.warning(f"No URLs discovered for {base_url}")
+        return {
+            'site': base_url,
+            'discovery_method': None,
+            'urls_discovered': 0,
+            'urls_after_filter': 0,
+            'results': [],
+            'summary': {'total': 0, 'succeeded': 0, 'duplicates': 0, 'failed': 0},
+        }
+
+    # ── Phase 2: Filter URLs ──
+
+    all_exclude = list(cfg['default_exclude'])
+    if exclude:
+        all_exclude.extend(exclude)
+
+    urls = filter_urls(urls, include=include, exclude=all_exclude)
+
+    if len(urls) > max_pages:
+        logger.info(f"Limiting to {max_pages} pages (discovered {len(urls)})")
+        urls = urls[:max_pages]
+
+    logger.info(f"After filtering: {len(urls)} URLs to process")
+
+    # ── Dry run ──
+
+    if dry_run:
+        return {
+            'site': base_url,
+            'discovery_method': discovery_method,
+            'dry_run': True,
+            'urls_discovered': len(urls),
+            'urls': urls,
+        }
+
+    # ── Phase 3: Ingest each URL ──
+
+    from .web_scraper import ingest_url
+
+    results = []
+    total = len(urls)
+
+    for i, url in enumerate(urls, 1):
+        logger.info(f"[{i}/{total}] Ingesting: {url}")
+
+        try:
+            result = ingest_url(url, category=category, source=source, config=config)
+            result['url'] = url
+            results.append(result)
+
+            status = result.get('status', 'unknown')
+            title = result.get('title', '')
+            if status == 'duplicate':
+                logger.info(f"  DUPLICATE: {title}")
+            else:
+                logger.info(f"  OK: {title} ({result.get('page_count', 0)} pages)")
+
+        except Exception as e:
+            logger.error(f"  FAILED: {url} -- {e}")
+            results.append({
+                'url': url,
+                'status': 'failed',
+                'error': str(e),
+            })
+
+        if i < total and delay > 0:
+            time.sleep(delay)
+
+    # ── Summary ──
+
+    succeeded = sum(1 for r in results if r.get('status') not in ('failed', 'duplicate'))
+    duplicates = sum(1 for r in results if r.get('status') == 'duplicate')
+    failed = sum(1 for r in results if r.get('status') == 'failed')
+
+    summary = {
+        'total': len(results),
+        'succeeded': succeeded,
+        'duplicates': duplicates,
+        'failed': failed,
+    }
+
+    logger.info(f"Crawl complete: {succeeded} new, {duplicates} duplicates, {failed} failed out of {total}")
+
+    return {
+        'site': base_url,
+        'domain': urlparse(base_url).netloc,
+        'category': category,
+        'discovery_method': discovery_method,
+        'urls_discovered': total,
+        'results': results,
+        'summary': summary,
+    }
--- a/lib/embedder.py
+++ b/lib/embedder.py
@ -0,0 +1,430 @@
+"""
+RECON Embedder
+
+Concepts to vectors via TEI (primary, 1024-dim bge-m3, ~1,711 emb/sec)
+or Ollama (fallback, ~8 emb/sec). Inserts into Qdrant on cortex:6333.
+
+Supports hybrid dense+sparse vectors when sparse_embedding service is configured.
+
+Dependencies: requests, qdrant-client
+Config: embedding, vector_db, processing.embed_workers
+"""
+import json
+import os
+import time
+import traceback
+from concurrent.futures import ThreadPoolExecutor, as_completed
+
+import requests as http_requests
+from qdrant_client import QdrantClient
+from qdrant_client.models import PointStruct, SparseVector
+
+from .utils import get_config, concept_id, generate_download_url, setup_logging
+from .status import StatusDB
+
+logger = setup_logging('recon.embedder')
+
+# ── Classification allowlists ───────────────────────────────────────────────
+VALID_DOMAINS = {
+    'Agriculture & Livestock', 'Civil Organization', 'Communications',
+    'Food Systems', 'Foundational Skills', 'Logistics', 'Medical',
+    'Navigation', 'Operations', 'Power Systems', 'Preservation & Storage',
+    'Security', 'Shelter & Construction', 'Technology', 'Tools & Equipment',
+    'Vehicles', 'Water Systems', 'Wilderness Skills',
+}
+VALID_KNOWLEDGE_TYPES = {'foundational', 'procedural', 'operational'}
+VALID_COMPLEXITIES = {'basic', 'intermediate', 'advanced'}
+
+DOMAIN_FALLBACK = 'Foundational Skills'
+KNOWLEDGE_TYPE_FALLBACK = 'foundational'
+COMPLEXITY_FALLBACK = 'basic'
+
+
+def _validate_classification(payload):
+    """Validate domain, knowledge_type, complexity before upsert.
+
+    Logs WARNING and applies safe fallback for any invalid values.
+    Returns the payload (modified in place if needed).
+    """
+    title = payload.get('title', payload.get('filename', '?'))
+
+    # ── domain ──────────────────────────────────────────────────────────
+    domain = payload.get('domain')
+    if isinstance(domain, list):
+        valid = [d for d in domain if d in VALID_DOMAINS]
+        if valid:
+            payload['domain'] = valid[0]
+        else:
+            logger.warning(f"Invalid domain {domain} for '{title}', fallback → {DOMAIN_FALLBACK}")
+            payload['domain'] = DOMAIN_FALLBACK
+    elif isinstance(domain, str):
+        if domain not in VALID_DOMAINS:
+            logger.warning(f"Invalid domain '{domain}' for '{title}', fallback → {DOMAIN_FALLBACK}")
+            payload['domain'] = DOMAIN_FALLBACK
+    else:
+        payload['domain'] = DOMAIN_FALLBACK
+
+    # ── knowledge_type ──────────────────────────────────────────────────
+    kt = payload.get('knowledge_type', '')
+    if isinstance(kt, str):
+        kt = kt.lower().strip()
+    else:
+        kt = ''
+    if kt not in VALID_KNOWLEDGE_TYPES:
+        logger.warning(f"Invalid knowledge_type '{kt}' for '{title}', fallback → {KNOWLEDGE_TYPE_FALLBACK}")
+        payload['knowledge_type'] = KNOWLEDGE_TYPE_FALLBACK
+    else:
+        payload['knowledge_type'] = kt
+
+    # ── complexity ──────────────────────────────────────────────────────
+    cx = payload.get('complexity', '')
+    if isinstance(cx, str):
+        cx = cx.lower().strip()
+    else:
+        cx = ''
+    if cx not in VALID_COMPLEXITIES:
+        logger.warning(f"Invalid complexity '{cx}' for '{title}', fallback → {COMPLEXITY_FALLBACK}")
+        payload['complexity'] = COMPLEXITY_FALLBACK
+    else:
+        payload['complexity'] = cx
+
+    return payload
+
+
+def get_embedding_single(text, config):
+    """Get a single embedding — uses TEI or Ollama depending on config."""
+    backend = config['embedding'].get('backend', 'ollama')
+
+    if backend == 'tei':
+        url = f"http://{config['embedding']['tei_host']}:{config['embedding']['tei_port']}/embed"
+        resp = http_requests.post(url, json={"inputs": text}, timeout=120)
+        resp.raise_for_status()
+        return resp.json()[0]
+    else:
+        url = f"http://{config['embedding']['ollama_host']}:{config['embedding']['ollama_port']}/api/embed"
+        resp = http_requests.post(url, json={
+            "model": config['embedding']['model'],
+            "input": text
+        }, timeout=120)
+        resp.raise_for_status()
+        return resp.json()['embeddings'][0]
+
+
+def get_embeddings_batch(texts, config):
+    """Get embeddings for a batch of texts via TEI. Falls back to sequential on error."""
+    url = f"http://{config['embedding']['tei_host']}:{config['embedding']['tei_port']}/embed"
+
+    try:
+        resp = http_requests.post(url, json={"inputs": texts}, timeout=300)
+        resp.raise_for_status()
+        return resp.json()
+    except Exception as e:
+        if len(texts) <= 1:
+            raise
+        # Split batch in half and retry each half
+        mid = len(texts) // 2
+        logger.warning(f"  Batch of {len(texts)} failed ({e}), splitting in half")
+        left = get_embeddings_batch(texts[:mid], config)
+        right = get_embeddings_batch(texts[mid:], config)
+        return left + right
+
+
+def get_sparse_embeddings_batch(texts, config):
+    """Get sparse embeddings from the sparse embedding service on cortex.
+
+    Returns a list of dicts with 'indices' and 'values' keys, or None on failure.
+    """
+    sparse_cfg = config.get('sparse_embedding')
+    if not sparse_cfg or not sparse_cfg.get('enabled', False):
+        return None
+
+    url = f"http://{sparse_cfg['host']}:{sparse_cfg['port']}/embed_sparse"
+
+    try:
+        resp = http_requests.post(url, json={"inputs": texts}, timeout=300)
+        resp.raise_for_status()
+        return resp.json()
+    except Exception as e:
+        logger.warning(f"  Sparse embedding failed for batch of {len(texts)}: {e}")
+        return None
+
+
+def _validate_content(content):
+    """Validate and normalize concept content for embedding. Returns clean string or None."""
+    if content is None:
+        return None
+    if not isinstance(content, str):
+        content = str(content)
+    content = content.strip()
+    if len(content) < 10:
+        return None
+    # Truncate to 8192 chars (Ollama/TEI input limit)
+    if len(content) > 8192:
+        content = content[:8192]
+    return content
+
+
+def _build_payload(doc, concept, idx, source, download_url, source_type, page_timestamps):
+    """Build and validate payload for a single concept point."""
+    start_page = concept.get('_start_page', 0)
+
+    payload = {
+        'doc_hash': doc.get('hash', ''),
+        'filename': doc['filename'],
+        'book_title': doc.get('book_title', ''),
+        'book_author': doc.get('book_author', ''),
+        'source': source,
+        'download_url': download_url,
+        'source_type': source_type,
+        'verification_status': 'unverified',
+        'credibility_score': 0.7,
+        'language': 'en',
+    }
+
+    for field in ['content', 'summary', 'title', 'domain', 'subdomain',
+                  'keywords', 'knowledge_type', 'complexity',
+                  'key_facts', 'scenario_applicable',
+                  'cross_domain_tags', 'chapter', 'page_ref', 'notes',
+                  '_window', '_start_page']:
+        if field in concept:
+            payload[field] = concept[field]
+
+    # Add video timestamp for transcript sources
+    if source_type == 'transcript' and page_timestamps:
+        page_key = f"page_{start_page:04d}"
+        if page_key in page_timestamps:
+            payload['video_timestamp'] = page_timestamps[page_key]
+
+    # Validate classification fields before returning
+    payload = _validate_classification(payload)
+
+    return payload
+
+
+def _build_point(point_id, dense_vector, sparse_vec, payload, config):
+    """Build a PointStruct with dense vector and optional sparse vector."""
+    sparse_cfg = config.get('sparse_embedding')
+    if sparse_cfg and sparse_cfg.get('enabled', False) and sparse_vec:
+        vector = {
+            "": dense_vector,
+            "bge-m3-sparse": SparseVector(
+                indices=sparse_vec['indices'],
+                values=sparse_vec['values'],
+            ),
+        }
+    else:
+        vector = {"": dense_vector}
+
+    return PointStruct(id=point_id, vector=vector, payload=payload)
+
+
+def embed_single(file_hash, db, config):
+    doc = db.get_document(file_hash)
+    if not doc:
+        return False
+
+    concepts_dir = os.path.join(config['paths']['concepts'], file_hash)
+    if not os.path.exists(concepts_dir):
+        db.mark_failed(file_hash, f"Concepts directory not found: {concepts_dir}")
+        return False
+
+    db.update_status(file_hash, 'embedding')
+
+    try:
+        qdrant = QdrantClient(
+            host=config['vector_db']['host'],
+            port=config['vector_db']['port'],
+            timeout=60
+        )
+        collection = config['vector_db']['collection']
+        qdrant_batch_size = config['processing']['embed_batch_size']
+        embed_batch_size = config['embedding'].get('batch_size', 128)
+        backend = config['embedding'].get('backend', 'ollama')
+
+        window_files = sorted([
+            f for f in os.listdir(concepts_dir)
+            if f.startswith('window_') and f.endswith('.json')
+        ])
+
+        if not window_files:
+            db.mark_failed(file_hash, "No window files found")
+            return False
+
+        all_concepts = []
+        for wf in window_files:
+            with open(os.path.join(concepts_dir, wf), encoding='utf-8') as f:
+                concepts = json.load(f)
+            if isinstance(concepts, list):
+                all_concepts.extend([c for c in concepts if isinstance(c, dict)])
+
+        if not all_concepts:
+            db.update_status(file_hash, 'complete', vectors_inserted=0)
+            logger.info(f"No concepts to embed for {doc['filename']}")
+            return True
+
+        # Look up source from catalogue once per doc
+        cat_conn = db._get_conn()
+        cat_row = cat_conn.execute(
+            "SELECT source FROM catalogue WHERE hash = ?", (file_hash,)
+        ).fetchone()
+        source = dict(cat_row)['source'] if cat_row else ''
+
+        download_url = ''
+        is_web = doc.get('path', '').startswith(('http://', 'https://'))
+        source_type = 'web' if is_web else 'document'
+
+        # Check meta.json for explicit source_type (e.g. 'transcript')
+        text_dir = os.path.join(config['paths']['text'], file_hash)
+        meta_path = os.path.join(text_dir, 'meta.json')
+        page_timestamps = {}
+        if os.path.exists(meta_path):
+            try:
+                with open(meta_path) as mf:
+                    meta = json.load(mf)
+                if meta.get('source_type'):
+                    source_type = meta['source_type']
+                if not download_url and meta.get('url'):
+                    download_url = meta['url']
+                if meta.get('page_timestamps'):
+                    page_timestamps = meta['page_timestamps']
+            except Exception:
+                pass
+        if doc.get('path'):
+            download_url = generate_download_url(
+                doc['path'], config.get('library_root', '/mnt/library')
+            )
+
+        # Build list of valid concepts with their indices
+        valid = []
+        skipped = 0
+        for idx, concept in enumerate(all_concepts):
+            content = _validate_content(concept.get('content', ''))
+            if content is None:
+                skipped += 1
+                continue
+            valid.append((idx, concept, content))
+
+        if skipped > 0:
+            logger.info(f"  Skipped {skipped} concepts with invalid/empty content")
+
+        if not valid:
+            db.update_status(file_hash, 'complete', vectors_inserted=0)
+            logger.info(f"No valid concepts to embed for {doc['filename']}")
+            return True
+
+        points = []
+        embedded_count = 0
+
+        if backend == 'tei':
+            # TEI: batch embedding
+            for batch_start in range(0, len(valid), embed_batch_size):
+                batch = valid[batch_start:batch_start + embed_batch_size]
+                texts = [content for _, _, content in batch]
+
+                try:
+                    vectors = get_embeddings_batch(texts, config)
+                except Exception as e:
+                    logger.error(f"  Batch embedding failed at offset {batch_start}: {e}")
+                    # Skip entire batch on unrecoverable error
+                    continue
+
+                # Get sparse embeddings for the same batch
+                sparse_results = get_sparse_embeddings_batch(texts, config)
+
+                for i, ((idx, concept, content), vector) in enumerate(zip(batch, vectors)):
+                    start_page = concept.get('_start_page', 0)
+                    point_id = concept_id(file_hash, start_page, idx)
+
+                    payload = _build_payload(
+                        doc, concept, idx, source, download_url,
+                        source_type, page_timestamps
+                    )
+
+                    sparse_vec = sparse_results[i] if sparse_results and i < len(sparse_results) else None
+                    points.append(_build_point(point_id, vector, sparse_vec, payload, config))
+                    embedded_count += 1
+
+                    if len(points) >= qdrant_batch_size:
+                        qdrant.upsert(collection_name=collection, points=points)
+                        logger.debug(f"  Upserted batch of {len(points)} points")
+                        points = []
+
+        else:
+            # Ollama: one-at-a-time with retry
+            for idx, concept, content in valid:
+                try:
+                    vector = get_embedding_single(content, config)
+                except Exception as e:
+                    logger.warning(f"  Embedding failed for concept {idx}: {e}")
+                    time.sleep(2)
+                    try:
+                        vector = get_embedding_single(content, config)
+                    except Exception as e2:
+                        logger.error(f"  Embedding retry failed for concept {idx}: {e2}")
+                        continue
+
+                # Get sparse embedding for single text
+                sparse_results = get_sparse_embeddings_batch([content], config)
+                sparse_vec = sparse_results[0] if sparse_results else None
+
+                start_page = concept.get('_start_page', 0)
+                point_id = concept_id(file_hash, start_page, idx)
+
+                payload = _build_payload(
+                    doc, concept, idx, source, download_url,
+                    source_type, page_timestamps
+                )
+
+                points.append(_build_point(point_id, vector, sparse_vec, payload, config))
+                embedded_count += 1
+
+                if len(points) >= qdrant_batch_size:
+                    qdrant.upsert(collection_name=collection, points=points)
+                    logger.debug(f"  Upserted batch of {len(points)} points")
+                    points = []
+
+        if points:
+            qdrant.upsert(collection_name=collection, points=points)
+            logger.debug(f"  Upserted final batch of {len(points)} points")
+
+        db.update_status(file_hash, 'complete', vectors_inserted=embedded_count)
+        logger.info(f"Embedded {doc['filename']}: {embedded_count} vectors ({skipped} skipped)")
+        return True
+
+    except Exception as e:
+        logger.error(f"Embedding failed for {file_hash}: {e}\n{traceback.format_exc()}")
+        db.mark_failed(file_hash, str(e))
+        return False
+
+
+def run_embedding(workers=None, limit=None):
+    config = get_config()
+    db = StatusDB()
+    workers = workers or config['processing']['embed_workers']
+
+    enriched = db.get_by_status('enriched', limit=limit)
+    if not enriched:
+        logger.info("No enriched documents to embed")
+        return 0
+
+    backend = config['embedding'].get('backend', 'ollama')
+    sparse_cfg = config.get('sparse_embedding')
+    sparse_status = "enabled" if (sparse_cfg and sparse_cfg.get('enabled')) else "disabled"
+    logger.info(f"Embedding {len(enriched)} documents with {workers} workers (backend: {backend}, sparse: {sparse_status})")
+    success = 0
+
+    with ThreadPoolExecutor(max_workers=workers) as pool:
+        futures = {
+            pool.submit(embed_single, doc['hash'], StatusDB(), config): doc
+            for doc in enriched
+        }
+        for future in as_completed(futures):
+            doc = futures[future]
+            try:
+                if future.result():
+                    success += 1
+            except Exception as e:
+                logger.error(f"Worker error for {doc['hash']}: {e}")
+
+    logger.info(f"Embedding complete: {success}/{len(enriched)} succeeded")
+    return success
--- a/lib/enricher.py
+++ b/lib/enricher.py
@ -0,0 +1,561 @@
+"""
+RECON Enricher
+
+Text to structured concepts via Gemini API. Saves JSON to data/concepts/{hash}/
+BEFORE any DB operations. Uses 10-page windows, 4 API keys, 16 workers.
+
+Resilience:
+  - Exponential backoff with jitter for transient errors (429, 500, 503, timeout)
+  - Permanent errors (JSON parse, auth) fail immediately without wasting retries
+  - Window failures skip that window and continue — partial enrichment beats zero
+  - Document marked enriched if ANY windows succeeded, failed only if ALL failed
+
+Dependencies: google-generativeai
+Config: processing.enrich_workers, processing.enrich_window_size, gemini, paths.concepts
+"""
+import json
+import os
+import random
+import re
+import time
+import traceback
+from concurrent.futures import ThreadPoolExecutor, as_completed
+
+import google.generativeai as genai
+
+from .utils import get_config, setup_logging
+from .status import StatusDB
+
+logger = setup_logging('recon.enricher')
+
+# Docs stuck in "enriching" longer than this get reset to "extracted" for retry
+STALE_ENRICHING_HOURS = 2
+
+# ── Classification allowlists ───────────────────────────────────────────────
+VALID_DOMAINS = {
+    'Agriculture & Livestock', 'Civil Organization', 'Communications',
+    'Food Systems', 'Foundational Skills', 'Logistics', 'Medical',
+    'Navigation', 'Operations', 'Power Systems', 'Preservation & Storage',
+    'Security', 'Shelter & Construction', 'Technology', 'Tools & Equipment',
+    'Vehicles', 'Water Systems', 'Wilderness Skills',
+}
+VALID_KNOWLEDGE_TYPES = {'foundational', 'procedural', 'operational'}
+VALID_COMPLEXITIES = {'basic', 'intermediate', 'advanced'}
+
+DOMAIN_FALLBACK = 'Foundational Skills'
+KNOWLEDGE_TYPE_FALLBACK = 'foundational'
+COMPLEXITY_FALLBACK = 'basic'
+
+
+def repair_json(text):
+    """Attempt to repair common LLM JSON output issues including truncation."""
+    # Remove control characters except newlines and tabs
+    text = re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f]', '', text)
+    # Fix invalid JSON escape sequences (e.g. \e, \p, \c from Gemini)
+    # Valid JSON escapes: \", \\, \/, \b, \f, \n, \r, \t, \uXXXX
+    text = re.sub(r'\\(?!["\\/bfnrtu])', r'\\\\', text)
+    # Remove trailing commas before } or ]
+    text = re.sub(r',\s*([}\]])', r'\1', text)
+
+    # Handle truncated JSON: try to find the last complete object in the array
+    try:
+        json.loads(text, strict=False)
+        return text
+    except json.JSONDecodeError:
+        pass
+
+    # Find the last complete }, then close the array
+    # Walk backward to find the last valid closing brace
+    last_complete = -1
+    depth_brace = 0
+    depth_bracket = 0
+    in_string = False
+    escape = False
+
+    for i, ch in enumerate(text):
+        if escape:
+            escape = False
+            continue
+        if ch == '\\' and in_string:
+            escape = True
+            continue
+        if ch == '"' and not escape:
+            in_string = not in_string
+            continue
+        if in_string:
+            continue
+        if ch == '{':
+            depth_brace += 1
+        elif ch == '}':
+            depth_brace -= 1
+            if depth_brace == 0:
+                last_complete = i
+        elif ch == '[':
+            depth_bracket += 1
+        elif ch == ']':
+            depth_bracket -= 1
+
+    if last_complete > 0:
+        truncated = text[:last_complete + 1].rstrip().rstrip(',')
+        # Close any open arrays
+        open_brackets = truncated.count('[') - truncated.count(']')
+        truncated += ']' * open_brackets
+        return truncated
+
+    return text
+
+ENRICH_PROMPT = """Extract knowledge concepts from this document text.
+
+A concept is a SELF-CONTAINED piece of knowledge that can stand alone.
+
+For each concept, provide ALL fields:
+
+Required:
+- content: Full text of the concept (complete procedure, definition, etc.)
+- summary: 1-2 sentence summary
+- title: Brief descriptive title
+- domain: must be exactly one of: Agriculture & Livestock, Civil Organization, Communications, Food Systems, Foundational Skills, Logistics, Medical, Navigation, Operations, Power Systems, Preservation & Storage, Security, Shelter & Construction, Technology, Tools & Equipment, Vehicles, Water Systems, Wilderness Skills — return ONLY this exact string, no variations, no new domains, no underscores, no synonyms
+  CRITICAL: Medical content (first aid, anatomy, pharmacology, herbs, veterinary, austere medicine) → Medical
+  CRITICAL: Food growing, farming, animal husbandry, livestock → Agriculture & Livestock
+  CRITICAL: Foraging, hunting, fishing, bushcraft, wilderness survival → Wilderness Skills
+  CRITICAL: Food preservation, storage, canning, dehydration, processing → Preservation & Storage
+  CRITICAL: Solar, wind, hydro, batteries, generators → Power Systems
+  CRITICAL: Water sourcing, filtration, sanitation, purification → Water Systems
+  CRITICAL: Building, carpentry, structural construction, shelter → Shelter & Construction
+  CRITICAL: Tactical operations, mission execution, combat maneuvers, search & rescue → Operations
+  CRITICAL: Governance, civil administration, community leadership → Civil Organization
+  CRITICAL: Electronics, IT, computing, engineering → Technology
+  CRITICAL: Hand tools, power tools, equipment maintenance → Tools & Equipment
+  CRITICAL: Motor vehicles, aircraft, watercraft, vehicle maintenance → Vehicles
+  CRITICAL: Radio, signals, networking, comms equipment → Communications
+  CRITICAL: Supply chain, transport, distribution, inventory → Logistics
+  CRITICAL: Physical security, OPSEC, threat assessment → Security
+  CRITICAL: Map reading, orienteering, GPS, celestial navigation → Navigation
+  CRITICAL: Cooking methods, food production, recipes, nutrition → Food Systems
+- subdomain: Array of specific subcategories (up to 10)
+- keywords: Array of 3-30 searchable terms
+- knowledge_type: foundational | procedural | operational
+    foundational — concepts, definitions, theory, background knowledge, explanations of how things work
+    procedural — step-by-step techniques, instructions, how-to skills, methods you execute
+    operational — application under real conditions, decision-making, mission execution, judgment calls in context
+    Valid values are ONLY: foundational, procedural, operational — do not use any other values
+- complexity: basic | intermediate | advanced
+    basic — requires little or no prior knowledge, introductory material, simple concepts
+    intermediate — requires some domain familiarity, assumes foundational knowledge is in place
+    advanced — requires significant experience or expertise, high-stakes or highly technical material
+    Valid values are ONLY: basic, intermediate, advanced — do not use any other values
+- key_facts: Array of specific extractable claims, measurements, data points
+
+Optional (include when present):
+- scenario_applicable: Array from: tuesday_prepper, month_prepper, year_prepper, multi_year, eotwawki
+- cross_domain_tags: Array from: sustainment, medical, security, communications, leadership, logistics, navigation, power_systems, water_systems, food_systems, tactical_ops, community_coordination
+- chapter: Chapter name if identifiable
+- page_ref: Page reference
+- notes: Any additional context
+
+EXAMPLES (knowledge_type + complexity):
+- "Needle chest decompression procedure" → knowledge_type: "procedural", complexity: "advanced"
+- "What is soil texture and why does it matter" → knowledge_type: "foundational", complexity: "basic"
+- "Coordinating a fire team withdrawal under contact" → knowledge_type: "operational", complexity: "advanced"
+
+Return JSON array. If no extractable concepts, return [].
+
+Document text:
+"""
+
+
+class KeyRotator:
+    def __init__(self, keys):
+        self.keys = keys
+        self.index = 0
+
+    def next(self):
+        if not self.keys:
+            raise ValueError("No Gemini API keys configured")
+        key = self.keys[self.index % len(self.keys)]
+        self.index += 1
+        return key
+
+
+def enrich_window(text, key, config):
+    genai.configure(api_key=key)
+    model = genai.GenerativeModel(
+        config['gemini']['model'],
+        generation_config={"response_mime_type": config['gemini']['response_mime_type']}
+    )
+    response = model.generate_content(ENRICH_PROMPT + text)
+    raw = response.text
+    try:
+        result = json.loads(raw, strict=False)
+    except json.JSONDecodeError:
+        repaired = repair_json(raw)
+        result = json.loads(repaired, strict=False)
+    # Filter out non-dict items (nested lists from truncated responses)
+    if isinstance(result, list):
+        result = [c for c in result if isinstance(c, dict)]
+    return result
+
+
+def _is_transient(error_str):
+    """Classify whether an error is transient (worth retrying) or permanent."""
+    s = error_str.lower()
+    transient_signals = ['429', 'resource_exhausted', 'quota', 'rate',
+                         '500', '503', 'unavailable', 'timeout',
+                         'connection', 'reset by peer', 'broken pipe']
+    return any(sig in s for sig in transient_signals)
+
+
+def _retry_with_backoff(fn, max_retries=5, base_delay=5.0, max_delay=120.0):
+    """Retry with exponential backoff + jitter for transient errors.
+
+    Backoff: ~5s, ~10s, ~20s, ~40s, ~80s (total ~155s before giving up).
+    Permanent errors (JSON parse, auth) raise immediately without retrying.
+    """
+    last_exc = None
+    for attempt in range(max_retries):
+        try:
+            return fn()
+        except Exception as e:
+            last_exc = e
+            err = str(e)
+            if not _is_transient(err):
+                raise  # permanent — don't waste retries
+            if attempt < max_retries - 1:
+                delay = min(base_delay * (2 ** attempt) + random.uniform(0, base_delay), max_delay)
+                logger.info(f"    Transient error (attempt {attempt+1}/{max_retries}), "
+                            f"retrying in {delay:.0f}s: {err[:120]}")
+                time.sleep(delay)
+            else:
+                logger.warning(f"    Transient error, max retries exhausted: {err[:150]}")
+    raise last_exc
+
+
+def _reclassify_field(field_name, allowlist, concept, key, config, max_retries=3):
+    """Retry Gemini up to max_retries to get a valid value for a specific field."""
+    content = concept.get('content', concept.get('summary', ''))
+    if isinstance(content, str):
+        content = content[:400]
+    else:
+        content = str(content)[:400]
+    title = concept.get('title', '(untitled)')
+    allowlist_str = ', '.join(sorted(allowlist))
+
+    for attempt in range(max_retries):
+        try:
+            prompt = (
+                f"Your previous response for '{field_name}' was invalid. "
+                f"You must return ONLY one of these exact strings: {allowlist_str}\n\n"
+                f"Title: {title}\n"
+                f"Content: {content}\n\n"
+                f"Return ONLY the exact string, nothing else. No explanation, no punctuation, no quotes."
+            )
+            genai.configure(api_key=key)
+            model = genai.GenerativeModel(
+                config['gemini']['model'],
+                generation_config={"response_mime_type": "text/plain"}
+            )
+            resp = model.generate_content(prompt)
+            value = resp.text.strip().strip('"').strip("'").strip()
+            if value in allowlist:
+                return value
+            # Try case-insensitive match for knowledge_type/complexity
+            for valid in allowlist:
+                if value.lower() == valid.lower():
+                    return valid
+        except Exception as e:
+            err = str(e).lower()
+            if any(s in err for s in ['429', 'quota', 'rate', '503']):
+                time.sleep(min(3 * (2 ** attempt) + random.uniform(0, 2), 30))
+            else:
+                logger.warning(f"  Reclassify retry {attempt+1} for {field_name} failed: {e}")
+    return None
+
+
+def validate_and_fix_concepts(concepts, key, config):
+    """Validate domain, knowledge_type, complexity on each concept.
+
+    For invalid values: retry Gemini up to 3 times, then apply safe fallback.
+    """
+    for concept in concepts:
+        if not isinstance(concept, dict):
+            continue
+
+        # ── Validate domain ─────────────────────────────────────────────
+        domain = concept.get('domain')
+        if isinstance(domain, list):
+            # Legacy array format — find first valid or reclassify
+            valid = [d for d in domain if d in VALID_DOMAINS]
+            if valid:
+                concept['domain'] = valid[0]
+            else:
+                new_val = _reclassify_field('domain', VALID_DOMAINS, concept, key, config)
+                if new_val:
+                    concept['domain'] = new_val
+                else:
+                    logger.warning(f"Invalid domain {domain} for '{concept.get('title', '?')}', using fallback")
+                    concept['domain'] = DOMAIN_FALLBACK
+        elif isinstance(domain, str):
+            if domain not in VALID_DOMAINS:
+                new_val = _reclassify_field('domain', VALID_DOMAINS, concept, key, config)
+                if new_val:
+                    concept['domain'] = new_val
+                else:
+                    logger.warning(f"Invalid domain '{domain}' for '{concept.get('title', '?')}', using fallback")
+                    concept['domain'] = DOMAIN_FALLBACK
+        else:
+            concept['domain'] = DOMAIN_FALLBACK
+
+        # ── Validate knowledge_type ─────────────────────────────────────
+        kt = concept.get('knowledge_type', '')
+        if isinstance(kt, str):
+            kt = kt.lower().strip()
+        else:
+            kt = ''
+        if kt not in VALID_KNOWLEDGE_TYPES:
+            new_val = _reclassify_field('knowledge_type', VALID_KNOWLEDGE_TYPES, concept, key, config)
+            if new_val:
+                concept['knowledge_type'] = new_val
+            else:
+                logger.warning(f"Invalid knowledge_type '{kt}' for '{concept.get('title', '?')}', using fallback")
+                concept['knowledge_type'] = KNOWLEDGE_TYPE_FALLBACK
+        else:
+            concept['knowledge_type'] = kt
+
+        # ── Validate complexity ─────────────────────────────────────────
+        cx = concept.get('complexity', '')
+        if isinstance(cx, str):
+            cx = cx.lower().strip()
+        else:
+            cx = ''
+        if cx not in VALID_COMPLEXITIES:
+            new_val = _reclassify_field('complexity', VALID_COMPLEXITIES, concept, key, config)
+            if new_val:
+                concept['complexity'] = new_val
+            else:
+                logger.warning(f"Invalid complexity '{cx}' for '{concept.get('title', '?')}', using fallback")
+                concept['complexity'] = COMPLEXITY_FALLBACK
+        else:
+            concept['complexity'] = cx
+
+    return concepts
+
+
+def enrich_single(file_hash, db, config, key_rotator):
+    doc = db.get_document(file_hash)
+    if not doc:
+        return False
+
+    text_dir = os.path.join(config['paths']['text'], file_hash)
+    concepts_dir = os.path.join(config['paths']['concepts'], file_hash)
+    window_size = config['processing']['enrich_window_size']
+    delay = config['processing']['rate_limit_delay']
+    proc = config.get('processing', {})
+    max_retries = proc.get('enrich_max_retries', proc.get('max_retries', 5))
+    base_delay = proc.get('enrich_base_delay', 5.0)
+    max_delay = proc.get('enrich_max_delay', 120.0)
+
+    if not os.path.exists(text_dir):
+        db.mark_failed(file_hash, f"Text directory not found: {text_dir}")
+        return False
+
+    db.update_status(file_hash, 'enriching')
+
+    try:
+        os.makedirs(concepts_dir, exist_ok=True)
+
+        page_files = sorted([f for f in os.listdir(text_dir) if f.startswith('page_') and f.endswith('.txt')])
+        if not page_files:
+            db.mark_failed(file_hash, "No page files found")
+            return False
+
+        pages_text = []
+        for pf in page_files:
+            with open(os.path.join(text_dir, pf), encoding='utf-8') as f:
+                pages_text.append(f.read())
+
+        windows = []
+        for i in range(0, len(pages_text), window_size):
+            window_pages = pages_text[i:i + window_size]
+            combined = "\n\n".join(f"--- Page {i + j + 1} ---\n{t}" for j, t in enumerate(window_pages))
+            windows.append((i, combined))
+
+        total_concepts = 0
+        failed_windows = []
+
+        for w_idx, (start_page, window_text) in enumerate(windows):
+            window_file = os.path.join(concepts_dir, f"window_{w_idx+1:04d}.json")
+
+            if os.path.exists(window_file):
+                with open(window_file, encoding='utf-8') as f:
+                    existing = json.load(f)
+                total_concepts += len(existing)
+                logger.debug(f"  Window {w_idx+1} already exists, skipping")
+                continue
+
+            if len(window_text.strip()) < 50:
+                with open(window_file, 'w') as f:
+                    json.dump([], f)
+                continue
+
+            # Attempt enrichment with backoff — failures skip the window, not the doc
+            try:
+                key = key_rotator.next()
+                concepts = _retry_with_backoff(
+                    lambda k=key: enrich_window(window_text, k, config),
+                    max_retries=max_retries,
+                    base_delay=base_delay,
+                    max_delay=max_delay,
+                )
+            except Exception as e:
+                failed_windows.append((w_idx + 1, str(e)[:100]))
+                logger.warning(f"  Window {w_idx+1}/{len(windows)} failed: {e}")
+                continue  # skip this window, keep going
+
+            if not isinstance(concepts, list):
+                concepts = [concepts] if isinstance(concepts, dict) else []
+            concepts = [c for c in concepts if isinstance(c, dict)]
+
+            # Validate domain, knowledge_type, complexity — retry then fallback
+            validation_key = key_rotator.next()
+            concepts = validate_and_fix_concepts(concepts, validation_key, config)
+
+            for c_idx, concept in enumerate(concepts):
+                concept['_window'] = w_idx + 1
+                concept['_start_page'] = start_page + 1
+                concept['_doc_hash'] = file_hash
+
+            # JSON FIRST: save before anything else
+            with open(window_file, 'w', encoding='utf-8') as f:
+                json.dump(concepts, f, indent=2, ensure_ascii=False)
+
+            total_concepts += len(concepts)
+            logger.debug(f"  Window {w_idx+1}/{len(windows)}: {len(concepts)} concepts")
+            time.sleep(delay)
+
+        # Decide document status based on results
+        meta = {
+            'hash': file_hash,
+            'total_windows': len(windows),
+            'total_concepts': total_concepts,
+            'failed_windows': len(failed_windows),
+            'window_size': window_size,
+            'timestamp': time.strftime('%Y-%m-%dT%H:%M:%SZ', time.gmtime()),
+        }
+        with open(os.path.join(concepts_dir, 'meta.json'), 'w') as f:
+            json.dump(meta, f, indent=2)
+
+        if total_concepts > 0 or not failed_windows:
+            # Some concepts extracted, or all windows were empty — mark enriched
+            error_msg = None
+            if total_concepts == 0 and doc.get('page_count', 0) >= 3:
+                error_msg = (f"0 concepts from {doc.get('page_count', '?')} pages — "
+                             f"likely image-only PDF, may need manual review")
+                logger.warning(f"  {doc['filename']}: {error_msg}")
+            elif failed_windows:
+                wins = ', '.join(str(w) for w, _ in failed_windows[:10])
+                error_msg = (f"Partial: {len(failed_windows)}/{len(windows)} "
+                             f"windows failed (windows {wins})")
+                logger.warning(f"  {doc['filename']}: {error_msg}")
+            db.update_status(file_hash, 'enriched', concepts_extracted=total_concepts,
+                             error_message=error_msg)
+            fw_note = f", {len(failed_windows)} windows failed" if failed_windows else ""
+            logger.info(f"Enriched {doc['filename']}: {total_concepts} concepts "
+                        f"from {len(windows)} windows{fw_note}")
+            return True
+        else:
+            # Every window failed — document truly failed
+            first_err = failed_windows[0][1] if failed_windows else 'unknown'
+            db.mark_failed(file_hash,
+                           f"All {len(windows)} windows failed: {first_err}")
+            logger.error(f"  {doc['filename']}: all {len(windows)} windows failed")
+            return False
+
+    except Exception as e:
+        logger.error(f"Enrichment failed for {file_hash}: {e}\n{traceback.format_exc()}")
+        db.mark_failed(file_hash, str(e))
+        return False
+
+
+def _recover_stale_enriching(db, max_hours=STALE_ENRICHING_HOURS):
+    """Reset docs stuck in enriching back to extracted so they get retried.
+
+    This handles the case where a previous enrichment run crashed mid-document.
+    The enricher skips already-completed window files, so no work is lost.
+    """
+    import sqlite3
+    conn = db._get_conn()
+    rows = conn.execute(
+        "SELECT hash, filename FROM documents WHERE status = 'enriching'",
+    ).fetchall()
+    if not rows:
+        return
+
+    # Check extracted_at timestamp — if enriching started > max_hours ago, reset
+    now = __import__('datetime').datetime.now(__import__('datetime').timezone.utc)
+    reset = []
+    for row in rows:
+        doc = db.get_document(row['hash'])
+        extracted_at = doc.get('extracted_at', '')
+        if not extracted_at:
+            reset.append(row)
+            continue
+        try:
+            from datetime import datetime, timezone
+            ts = datetime.fromisoformat(extracted_at)
+            if ts.tzinfo is None:
+                ts = ts.replace(tzinfo=timezone.utc)
+            age_hours = (now - ts).total_seconds() / 3600
+            if age_hours > max_hours:
+                reset.append(row)
+        except Exception:
+            reset.append(row)
+
+    for row in reset:
+        conn.execute(
+            "UPDATE documents SET status = 'extracted' WHERE hash = ?",
+            (row['hash'],)
+        )
+        logger.warning(f"Recovered stale enriching doc: {row['filename']} ({row['hash'][:12]}...)")
+    if reset:
+        conn.commit()
+        logger.info(f"Reset {len(reset)} stale enriching docs back to extracted")
+
+
+def run_enrichment(workers=None, limit=None):
+    config = get_config()
+    db = StatusDB()
+    workers = workers or config['processing']['enrich_workers']
+
+    # Recover docs orphaned by previous crashed enrichment runs
+    _recover_stale_enriching(db)
+
+    keys = config.get('gemini_keys', [])
+    if not keys:
+        logger.error("No Gemini API keys configured in .env")
+        return 0
+
+    key_rotator = KeyRotator(keys)
+
+    extracted = db.get_by_status('extracted', limit=limit)
+    if not extracted:
+        logger.info("No extracted documents to enrich")
+        return 0
+
+    logger.info(f"Enriching {len(extracted)} documents with {workers} workers, {len(keys)} API key(s)")
+    success = 0
+
+    with ThreadPoolExecutor(max_workers=workers) as pool:
+        futures = {
+            pool.submit(enrich_single, doc['hash'], StatusDB(), config, key_rotator): doc
+            for doc in extracted
+        }
+        for future in as_completed(futures):
+            doc = futures[future]
+            try:
+                if future.result():
+                    success += 1
+            except Exception as e:
+                logger.error(f"Worker error for {doc['hash']}: {e}")
+
+    logger.info(f"Enrichment complete: {success}/{len(extracted)} succeeded")
+    return success
--- a/lib/extractor.py
+++ b/lib/extractor.py
@ -0,0 +1,601 @@
+"""
+RECON Text Extractor
+
+PDF to text via PyPDF2 -> pdftotext -> Tesseract -> Gemini Vision fallback chain.
+Saves to data/text/{hash}/page_NNNN.txt (4-digit zero-padded, 1-indexed).
+
+Safety guards:
+  - Layer 1: Pre-flight size check (max_pdf_size_mb, default 200)
+  - Layer 2: Per-document timeout (extract_timeout, default 300s)
+  - Layer 3: Per-page timeout (page_timeout, default 30s)
+  - Partial extractions saved as 'extracted' with error_message noting incompleteness
+
+Fallback chain per page:
+  1. PyPDF2 (fast, free, text-based PDFs)
+  2. pdftotext/poppler (handles some PDFs PyPDF2 misses)
+  3. Tesseract OCR (renders page → local OCR)
+  4. Gemini Vision (renders page → cloud vision API, last resort for scanned docs)
+
+Dependencies: PyPDF2, pdftotext (poppler-utils), pytesseract, google-generativeai
+Config: processing.extract_workers, processing.max_pdf_size_mb,
+        processing.extract_timeout, processing.page_timeout
+"""
+import base64
+import json
+import os
+import random
+import subprocess
+import tempfile
+import threading
+import time
+import traceback
+from concurrent.futures import ThreadPoolExecutor, as_completed, TimeoutError as FuturesTimeoutError
+from pathlib import Path
+
+import google.generativeai as genai
+from PyPDF2 import PdfReader
+
+from .utils import get_config, content_hash, clean_filename_to_title, setup_logging
+from .status import StatusDB
+
+logger = setup_logging('recon.extractor')
+
+# ── Gemini Vision singleton (lazy, thread-safe) ──
+
+_vision_keys = None
+_vision_key_index = 0
+_vision_lock = threading.Lock()
+
+
+def _get_vision_keys():
+    """Load Gemini API keys once from .env (same keys the enricher uses)."""
+    global _vision_keys
+    if _vision_keys is not None:
+        return _vision_keys
+
+    with _vision_lock:
+        if _vision_keys is not None:
+            return _vision_keys
+
+        keys = []
+        env_path = os.path.join(os.path.dirname(os.path.dirname(os.path.abspath(__file__))), '.env')
+        if os.path.exists(env_path):
+            with open(env_path) as f:
+                for line in f:
+                    line = line.strip()
+                    if not line or line.startswith('#') or '=' not in line:
+                        continue
+                    key_name, val = line.split('=', 1)
+                    val = val.strip().strip('"').strip("'")
+                    if key_name.strip().startswith('GEMINI_KEY_') and val != 'PASTE_KEY_HERE':
+                        keys.append(val)
+
+        _vision_keys = keys
+        if keys:
+            logger.info(f"Gemini vision OCR: {len(keys)} API key(s) available")
+        else:
+            logger.warning("No Gemini API keys found — vision OCR fallback disabled")
+        return keys
+
+
+def _next_vision_key():
+    """Round-robin through available Gemini keys."""
+    global _vision_key_index
+    keys = _get_vision_keys()
+    if not keys:
+        return None
+    with _vision_lock:
+        key = keys[_vision_key_index % len(keys)]
+        _vision_key_index += 1
+    return key
+
+
+def _is_transient(error_str):
+    """Classify whether an error is transient (worth retrying)."""
+    s = error_str.lower()
+    transient_signals = ['429', 'resource_exhausted', 'quota', 'rate',
+                         '500', '503', 'unavailable', 'timeout',
+                         'connection', 'reset by peer', 'broken pipe']
+    return any(sig in s for sig in transient_signals)
+
+
+def _render_page_to_png(pdf_path, page_num_1indexed, dpi=200, timeout=30):
+    """Render a single PDF page to PNG bytes using pdftoppm.
+
+    Args:
+        pdf_path: Path to PDF file
+        page_num_1indexed: 1-indexed page number
+        dpi: Resolution (200 = readable text, reasonable file size)
+        timeout: Subprocess timeout in seconds
+
+    Returns:
+        bytes or None: PNG image data, or None if render fails/blank
+    """
+    with tempfile.TemporaryDirectory() as tmpdir:
+        prefix = os.path.join(tmpdir, 'page')
+        try:
+            subprocess.run(
+                ['pdftoppm', '-f', str(page_num_1indexed), '-l', str(page_num_1indexed),
+                 '-png', '-r', str(dpi), pdf_path, prefix],
+                capture_output=True, timeout=timeout, check=True
+            )
+            png_files = list(Path(tmpdir).glob('*.png'))
+            if not png_files:
+                return None
+
+            img_data = png_files[0].read_bytes()
+
+            # Skip blank pages (tiny image = solid white/blank page)
+            if len(img_data) < 5000:
+                return None
+
+            return img_data
+
+        except (subprocess.TimeoutExpired, subprocess.CalledProcessError, OSError):
+            return None
+
+
+def _try_gemini_vision(pdf_path, page_num_1indexed, page_timeout=60):
+    """Last-resort OCR: render page to image, send to Gemini vision.
+
+    Only called when PyPDF2, pdftotext, AND Tesseract all failed.
+
+    Args:
+        pdf_path: Path to PDF file
+        page_num_1indexed: 1-indexed page number
+        page_timeout: Max time for the render + API call
+
+    Returns:
+        str: Extracted text, or empty string if vision fails
+    """
+    api_key = _next_vision_key()
+    if api_key is None:
+        return ''
+
+    # Render page to PNG
+    img_data = _render_page_to_png(pdf_path, page_num_1indexed, timeout=min(page_timeout, 30))
+    if img_data is None:
+        return ''
+
+    # Call Gemini vision with retry for transient errors
+    last_exc = None
+    for attempt in range(3):
+        try:
+            genai.configure(api_key=api_key)
+            model = genai.GenerativeModel('gemini-2.0-flash')
+            response = model.generate_content([
+                {
+                    'mime_type': 'image/png',
+                    'data': base64.b64encode(img_data).decode('utf-8')
+                },
+                "Extract ALL text from this scanned document page exactly as written. "
+                "Preserve headings, lists, numbered items, tables, and paragraph structure. "
+                "Return ONLY the extracted text, no commentary or markdown formatting."
+            ])
+            if response and response.text:
+                text = response.text.strip()
+                if len(text) > 10:
+                    return text
+            return ''
+
+        except Exception as e:
+            last_exc = e
+            if not _is_transient(str(e)):
+                break  # permanent error — don't retry
+            if attempt < 2:
+                delay = 5.0 * (2 ** attempt) + random.uniform(0, 3)
+                time.sleep(delay)
+                # Rotate to next key on rate limit
+                api_key = _next_vision_key() or api_key
+
+    if last_exc:
+        logger.debug(f"  Vision OCR failed page {page_num_1indexed}: {last_exc}")
+    return ''
+
+
+
+def _get_page_count(pdf_path):
+    """Get page count using pdfinfo (poppler) as fallback when PdfReader fails."""
+    try:
+        result = subprocess.run(
+            ['pdfinfo', pdf_path],
+            capture_output=True, text=True, timeout=30
+        )
+        if result.returncode == 0:
+            for line in result.stdout.splitlines():
+                if line.startswith('Pages:'):
+                    return int(line.split(':', 1)[1].strip())
+    except Exception:
+        pass
+    return 0
+
+
+def _extract_page_without_reader(pdf_path, page_num_0indexed, page_timeout=30):
+    """Extract text from a single page WITHOUT PyPDF2 reader.
+
+    Used when PdfReader() fails entirely (corrupt/encrypted PDFs).
+    Runs the pdftotext -> Tesseract -> Gemini Vision fallback chain.
+
+    Returns:
+        tuple: (text, ocr_method)
+    """
+    text = ''
+
+    # Method 1: pdftotext (poppler)
+    try:
+        result = subprocess.run(
+            ['pdftotext', '-f', str(page_num_0indexed + 1),
+             '-l', str(page_num_0indexed + 1), pdf_path, '-'],
+            capture_output=True, text=True, timeout=page_timeout
+        )
+        if result.returncode == 0:
+            text = result.stdout
+    except Exception:
+        pass
+
+    if len(text.strip()) >= 50:
+        return text, 'pdftotext'
+
+    # Method 2: pdftoppm + Tesseract OCR
+    try:
+        from PIL import Image
+        import pytesseract
+
+        result = subprocess.run(
+            ['pdftoppm', '-f', str(page_num_0indexed + 1),
+             '-l', str(page_num_0indexed + 1),
+             '-png', '-singlefile', pdf_path, '-'],
+            capture_output=True, timeout=page_timeout * 2
+        )
+        if result.returncode == 0 and result.stdout:
+            with tempfile.NamedTemporaryFile(suffix='.png', delete=True) as tmp:
+                tmp.write(result.stdout)
+                tmp.flush()
+                img = Image.open(tmp.name)
+                ocr_text = pytesseract.image_to_string(img)
+                if len(ocr_text.strip()) > len(text.strip()):
+                    text = ocr_text
+    except Exception:
+        pass
+
+    if len(text.strip()) >= 50:
+        return text, 'tesseract'
+
+    # Method 3: Gemini Vision (last resort)
+    vision_text = _try_gemini_vision(pdf_path, page_num_0indexed + 1,
+                                     page_timeout=page_timeout * 2)
+    if len(vision_text.strip()) > len(text.strip()):
+        text = vision_text
+
+    if len(text.strip()) >= 10:
+        return text, 'gemini_vision'
+
+    return text, 'none'
+
+
+# ── Core extraction functions ──
+
+def _pypdf2_extract(reader, page_num):
+    """Extract text from a PyPDF2 page object. Runs inside a thread for timeout."""
+    return reader.pages[page_num].extract_text() or ''
+
+
+def extract_text_from_page(reader, page_num, pdf_path, page_timeout=30):
+    """Extract text from a single page with fallback chain.
+
+    Returns:
+        tuple: (text, ocr_method) where ocr_method is one of:
+            'pypdf2', 'pdftotext', 'tesseract', 'gemini_vision', 'none'
+    """
+    # Method 1: PyPDF2 (wrapped in thread for timeout — extract_text() can hang)
+    text = ''
+    try:
+        ex = ThreadPoolExecutor(1)
+        future = ex.submit(_pypdf2_extract, reader, page_num)
+        try:
+            text = future.result(timeout=page_timeout)
+        except FuturesTimeoutError:
+            logger.warning(f"  PyPDF2 timeout on page {page_num + 1}")
+            text = ''
+        finally:
+            ex.shutdown(wait=False, cancel_futures=True)
+    except Exception:
+        text = ''
+
+    if len(text.strip()) >= 50:
+        return text, 'pypdf2'
+
+    # Method 2: pdftotext via subprocess (inherently timeout-safe)
+    try:
+        result = subprocess.run(
+            ['pdftotext', '-f', str(page_num + 1), '-l', str(page_num + 1), pdf_path, '-'],
+            capture_output=True, text=True, timeout=page_timeout
+        )
+        if result.returncode == 0 and len(result.stdout.strip()) > len(text.strip()):
+            text = result.stdout
+    except Exception:
+        pass
+
+    if len(text.strip()) >= 50:
+        return text, 'pdftotext'
+
+    # Method 3: pdftoppm + Tesseract OCR
+    try:
+        from PIL import Image
+        import pytesseract
+
+        result = subprocess.run(
+            ['pdftoppm', '-f', str(page_num + 1), '-l', str(page_num + 1),
+             '-png', '-singlefile', pdf_path, '-'],
+            capture_output=True, timeout=page_timeout * 2
+        )
+        if result.returncode == 0 and result.stdout:
+            with tempfile.NamedTemporaryFile(suffix='.png', delete=True) as tmp:
+                tmp.write(result.stdout)
+                tmp.flush()
+                img = Image.open(tmp.name)
+                ocr_text = pytesseract.image_to_string(img)
+                if len(ocr_text.strip()) > len(text.strip()):
+                    text = ocr_text
+    except Exception:
+        pass
+
+    if len(text.strip()) >= 50:
+        return text, 'tesseract'
+
+    # Method 4: Gemini Vision (last resort — costs API calls but handles scanned docs)
+    vision_text = _try_gemini_vision(pdf_path, page_num + 1, page_timeout=page_timeout * 2)
+    if len(vision_text.strip()) > len(text.strip()):
+        text = vision_text
+
+    if len(text.strip()) >= 10:
+        return text, 'gemini_vision'
+
+    return text, 'none'
+
+
+def extract_book_metadata(first_page_text, config):
+    keys = config.get('gemini_keys', [])
+    if not keys or len(first_page_text.strip()) < 20:
+        return None, None
+
+    try:
+        genai.configure(api_key=keys[0])
+        model = genai.GenerativeModel(
+            config['gemini']['model'],
+            generation_config={"response_mime_type": config['gemini']['response_mime_type']}
+        )
+        prompt = f"""Extract the book title and author from this first page text.
+Return JSON: {{"title": "...", "author": "..."}}
+If unknown, use null for that field.
+
+Text:
+{first_page_text[:3000]}"""
+
+        response = model.generate_content(prompt)
+        data = json.loads(response.text)
+        return data.get('title'), data.get('author')
+    except Exception as e:
+        logger.warning(f"Metadata extraction failed: {e}")
+        return None, None
+
+
+def extract_single(file_hash, db, config):
+    doc = db.get_document(file_hash)
+    if not doc:
+        return False
+
+    pdf_path = doc['path']
+    filename = doc['filename']
+    text_dir = os.path.join(config['paths']['text'], file_hash)
+
+    if not os.path.exists(pdf_path):
+        db.mark_failed(file_hash, f"File not found: {pdf_path}")
+        return False
+
+    # Layer 1: Pre-flight size check
+    proc = config.get('processing', {})
+    max_size_mb = proc.get('max_pdf_size_mb', 200)
+    try:
+        file_size_mb = os.path.getsize(pdf_path) / 1048576
+    except OSError as e:
+        db.mark_failed(file_hash, f"Cannot stat file: {e}")
+        return False
+
+    if file_size_mb > max_size_mb:
+        msg = f"Skipped: {file_size_mb:.0f}MB exceeds {max_size_mb}MB limit"
+        logger.warning(f"SIZE SKIP: {filename} — {msg}")
+        db.mark_failed(file_hash, msg)
+        return False
+
+    db.update_status(file_hash, 'extracting')
+
+    # Layer 2/3 setup
+    max_doc_seconds = proc.get('extract_timeout', 300)
+    page_timeout = proc.get('page_timeout', 30)
+    start_time = time.time()
+    page_count = 0
+    pages_extracted = 0
+    skipped_pages = 0
+    ocr_pages = []
+    ocr_methods = {'pypdf2': 0, 'pdftotext': 0, 'tesseract': 0, 'gemini_vision': 0, 'none': 0}
+
+    try:
+        os.makedirs(text_dir, exist_ok=True)
+        # Try PyPDF2 first; fall back to poppler-only extraction if it fails
+        reader = None
+        use_reader = True
+        try:
+            reader = PdfReader(pdf_path)
+            page_count = len(reader.pages)
+        except Exception as pdf_err:
+            logger.warning(f"PdfReader failed for {filename}: {pdf_err} — using poppler fallback")
+            use_reader = False
+            page_count = _get_page_count(pdf_path)
+            if page_count == 0:
+                db.mark_failed(file_hash, f"PdfReader failed and pdfinfo returned 0 pages: {str(pdf_err)[:200]}")
+                return False
+
+        for i in range(page_count):
+            # Layer 2: Check total document time budget
+            elapsed = time.time() - start_time
+            if elapsed > max_doc_seconds:
+                msg = f"Timed out after {elapsed:.0f}s at page {i}/{page_count}"
+                logger.warning(f"TIMEOUT: {filename} — {msg}")
+                if pages_extracted > 0:
+                    _save_partial(file_hash, db, doc, config, text_dir,
+                                  page_count, pages_extracted, ocr_pages,
+                                  f"Partial: {pages_extracted}/{page_count} pages "
+                                  f"(timed out after {elapsed:.0f}s)",
+                                  ocr_methods=ocr_methods)
+                    return True
+                else:
+                    db.mark_failed(file_hash, msg)
+                    return False
+
+            # Layer 3: Per-page extraction with fallback chain
+            try:
+                if use_reader:
+                    text, method = extract_text_from_page(reader, i, pdf_path, page_timeout)
+                else:
+                    text, method = _extract_page_without_reader(pdf_path, i, page_timeout)
+                ocr_methods[method] += 1
+                if method in ('tesseract', 'gemini_vision'):
+                    ocr_pages.append(i + 1)
+            except Exception as e:
+                logger.warning(f"  Page {i+1}/{page_count} failed: {e} — skipping")
+                text = ''
+                skipped_pages += 1
+                ocr_methods['none'] += 1
+
+            page_file = os.path.join(text_dir, f"page_{i+1:04d}.txt")
+            with open(page_file, 'w', encoding='utf-8') as f:
+                f.write(text)
+
+            if text.strip():
+                pages_extracted += 1
+
+            # Progress logging every 50 pages (more frequent since vision is slower)
+            if (i + 1) % 50 == 0:
+                el = time.time() - start_time
+                rate = (i + 1) / el if el > 0 else 0
+                vision_n = ocr_methods['gemini_vision']
+                vision_note = f", {vision_n} vision" if vision_n else ""
+                logger.info(f"  {filename}: page {i+1}/{page_count} "
+                            f"({rate:.1f} pages/sec, {skipped_pages} skipped{vision_note})")
+
+        # Full extraction complete — save metadata
+        first_page_text = ''
+        first_page_file = os.path.join(text_dir, 'page_0001.txt')
+        if os.path.exists(first_page_file):
+            with open(first_page_file, encoding='utf-8') as f:
+                first_page_text = f.read()
+
+        book_title, book_author = extract_book_metadata(first_page_text, config)
+
+        if not book_title:
+            book_title = clean_filename_to_title(filename)
+
+        meta = {
+            'hash': file_hash,
+            'filename': filename,
+            'page_count': page_count,
+            'ocr_pages': ocr_pages,
+            'skipped_pages': skipped_pages,
+            'ocr_methods': ocr_methods,
+        }
+        with open(os.path.join(text_dir, 'meta.json'), 'w') as f:
+            json.dump(meta, f, indent=2)
+
+        kwargs = {
+            'page_count': page_count,
+            'pages_extracted': pages_extracted,
+            'book_title': book_title,
+        }
+        if book_author:
+            kwargs['book_author'] = book_author
+        if skipped_pages > 0:
+            kwargs['error_message'] = (f"Partial: {pages_extracted}/{page_count} pages "
+                                       f"({skipped_pages} pages timed out)")
+
+        elapsed = time.time() - start_time
+        db.update_status(file_hash, 'extracted', **kwargs)
+        ocr_note = f", {len(ocr_pages)} OCR" if ocr_pages else ""
+        skip_note = f", {skipped_pages} skipped" if skipped_pages > 0 else ""
+        vision_note = f", {ocr_methods['gemini_vision']} vision" if ocr_methods['gemini_vision'] else ""
+        logger.info(f"Extracted {filename}: {pages_extracted}/{page_count} pages "
+                     f"({elapsed:.1f}s{ocr_note}{vision_note}{skip_note})")
+        return True
+
+    except Exception as e:
+        logger.error(f"Extraction failed for {file_hash}: {e}\n{traceback.format_exc()}")
+        if pages_extracted > 0:
+            _save_partial(file_hash, db, doc, config, text_dir,
+                          page_count, pages_extracted, ocr_pages,
+                          f"Partial: {pages_extracted}/{page_count} pages "
+                          f"({str(e)[:150]})",
+                          ocr_methods=ocr_methods)
+            return True
+        db.mark_failed(file_hash, str(e)[:500])
+        return False
+
+
+def _save_partial(file_hash, db, doc, config, text_dir, page_count,
+                  pages_extracted, ocr_pages, error_msg, ocr_methods=None):
+    """Save metadata and mark a partial extraction as 'extracted'."""
+    book_title = clean_filename_to_title(doc['filename'])
+
+    first_page_file = os.path.join(text_dir, 'page_0001.txt')
+    if os.path.exists(first_page_file):
+        with open(first_page_file, encoding='utf-8') as f:
+            first_text = f.read()
+        if len(first_text.strip()) > 20:
+            title, _ = extract_book_metadata(first_text, config)
+            if title:
+                book_title = title
+
+    meta = {
+        'hash': file_hash,
+        'filename': doc['filename'],
+        'page_count': page_count,
+        'ocr_pages': ocr_pages,
+        'partial': True,
+    }
+    if ocr_methods:
+        meta['ocr_methods'] = ocr_methods
+    with open(os.path.join(text_dir, 'meta.json'), 'w') as f:
+        json.dump(meta, f, indent=2)
+
+    db.update_status(file_hash, 'extracted',
+                     page_count=page_count,
+                     pages_extracted=pages_extracted,
+                     book_title=book_title,
+                     error_message=error_msg)
+    logger.info(f"  Saved partial extraction: {pages_extracted}/{page_count} pages")
+
+
+def run_extraction(workers=None):
+    config = get_config()
+    db = StatusDB()
+    workers = workers or config['processing']['extract_workers']
+
+    queued = db.get_by_status('queued')
+    if not queued:
+        logger.info("No queued documents to extract")
+        return 0
+
+    logger.info(f"Extracting {len(queued)} documents with {workers} workers")
+    success = 0
+
+    with ThreadPoolExecutor(max_workers=workers) as pool:
+        futures = {pool.submit(extract_single, doc['hash'], StatusDB(), config): doc for doc in queued}
+        for future in as_completed(futures):
+            doc = futures[future]
+            try:
+                if future.result():
+                    success += 1
+            except Exception as e:
+                logger.error(f"Worker error for {doc['hash']}: {e}")
+
+    logger.info(f"Extraction complete: {success}/{len(queued)} succeeded")
+    return success
--- a/lib/ingester.py
+++ b/lib/ingester.py
@ -0,0 +1,159 @@
+"""
+RECON Intel Ingester
+
+ARGUS intelligence feed intake. Embeds intel JSON and inserts into Qdrant
+with source_type='intel_feed'.
+
+Dependencies: requests, qdrant-client
+Config: embedding, vector_db
+"""
+import json
+import os
+import time
+import traceback
+
+import requests as http_requests
+from qdrant_client import QdrantClient
+from qdrant_client.models import PointStruct
+
+from .utils import get_config, setup_logging
+from .status import StatusDB
+
+logger = setup_logging('recon.ingester')
+
+
+def ingest_intel(intel_data, config=None):
+    if config is None:
+        config = get_config()
+
+    db = StatusDB()
+
+    required = ['source', 'category', 'content']
+    for field in required:
+        if field not in intel_data:
+            logger.error(f"Missing required field: {field}")
+            return None
+
+    try:
+        conn = db._get_conn()
+        cursor = conn.execute(
+            """INSERT INTO intel (source, timestamp, region, category, content,
+               summary, key_facts, credibility_score, verification_status)
+               VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)""",
+            (
+                intel_data.get('source', 'unknown'),
+                intel_data.get('timestamp', time.strftime('%Y-%m-%dT%H:%M:%SZ', time.gmtime())),
+                intel_data.get('region', 'unknown'),
+                intel_data['category'],
+                intel_data['content'],
+                intel_data.get('summary', ''),
+                json.dumps(intel_data.get('key_facts', [])),
+                intel_data.get('credibility_score', 0.5),
+                intel_data.get('verification_status', 'unverified'),
+            )
+        )
+        intel_id = cursor.lastrowid
+        conn.commit()
+
+        url = f"http://{config['embedding']['host']}:{config['embedding']['port']}/api/embed"
+        resp = http_requests.post(url, json={
+            "model": config['embedding']['model'],
+            "input": intel_data['content']
+        }, timeout=120)
+        resp.raise_for_status()
+        vector = resp.json()['embeddings'][0]
+
+        qdrant = QdrantClient(
+            host=config['vector_db']['host'],
+            port=config['vector_db']['port'],
+            timeout=60
+        )
+
+        point_id = intel_id + 2**60
+
+        payload = {
+            'source_type': 'intel_feed',
+            'intel_id': intel_id,
+            'source': intel_data.get('source', 'unknown'),
+            'region': intel_data.get('region', 'unknown'),
+            'category': intel_data['category'],
+            'content': intel_data['content'],
+            'summary': intel_data.get('summary', ''),
+            'key_facts': intel_data.get('key_facts', []),
+            'credibility_score': intel_data.get('credibility_score', 0.5),
+            'verification_status': intel_data.get('verification_status', 'unverified'),
+            'timestamp': intel_data.get('timestamp', ''),
+            'language': 'en',
+        }
+
+        qdrant.upsert(
+            collection_name=config['vector_db']['collection'],
+            points=[PointStruct(id=point_id, vector=vector, payload=payload)]
+        )
+
+        conn.execute("UPDATE intel SET vector_id = ? WHERE id = ?", (point_id, intel_id))
+        conn.commit()
+
+        logger.info(f"Ingested intel #{intel_id} from {intel_data.get('source', 'unknown')}")
+        return intel_id
+
+    except Exception as e:
+        logger.error(f"Intel ingestion failed: {e}\n{traceback.format_exc()}")
+        return None
+
+
+def ingest_file(filepath, config=None):
+    if config is None:
+        config = get_config()
+
+    try:
+        with open(filepath, encoding='utf-8') as f:
+            data = json.load(f)
+
+        if isinstance(data, list):
+            results = []
+            for item in data:
+                result = ingest_intel(item, config)
+                results.append(result)
+            success = sum(1 for r in results if r is not None)
+            logger.info(f"Ingested {success}/{len(data)} items from {filepath}")
+            return results
+        else:
+            return [ingest_intel(data, config)]
+
+    except Exception as e:
+        logger.error(f"Failed to ingest file {filepath}: {e}")
+        return []
+
+
+def run_ingestion(directory=None):
+    config = get_config()
+    intel_dir = directory or config['paths']['intel']
+
+    if not os.path.exists(intel_dir):
+        logger.info(f"Intel directory does not exist: {intel_dir}")
+        return 0
+
+    json_files = sorted([
+        f for f in os.listdir(intel_dir)
+        if f.endswith('.json') and not f.startswith('.')
+    ])
+
+    if not json_files:
+        logger.info("No intel files to ingest")
+        return 0
+
+    total = 0
+    for jf in json_files:
+        filepath = os.path.join(intel_dir, jf)
+        results = ingest_file(filepath, config)
+        ingested = sum(1 for r in results if r is not None)
+        total += ingested
+
+        if ingested > 0:
+            done_dir = os.path.join(intel_dir, 'processed')
+            os.makedirs(done_dir, exist_ok=True)
+            os.rename(filepath, os.path.join(done_dir, jf))
+
+    logger.info(f"Intel ingestion complete: {total} items ingested")
+    return total
--- a/lib/key_manager.py
+++ b/lib/key_manager.py
@ -0,0 +1,270 @@
+"""
+RECON Key Manager - Thread-safe API key management with hot-reload.
+
+Provides a singleton KeyManager that workers (enricher, extractor) read from
+instead of loading .env directly. Dashboard can update keys at runtime without
+restarting the service.
+
+Dependencies: None beyond stdlib + requests (already in requirements.txt)
+Config: Reads/writes /opt/recon/.env
+"""
+
+import os
+import re
+import time
+import logging
+import threading
+import requests
+
+logger = logging.getLogger('recon.key_manager')
+
+class KeyManager:
+    """Thread-safe API key store with hot-reload and validation."""
+    
+    _instance = None
+    _lock = threading.Lock()
+    
+    def __new__(cls):
+        if cls._instance is None:
+            with cls._lock:
+                if cls._instance is None:
+                    cls._instance = super().__new__(cls)
+                    cls._instance._initialized = False
+        return cls._instance
+    
+    def __init__(self):
+        if self._initialized:
+            return
+        self._keys_lock = threading.RLock()
+        self._gemini_keys = []
+        self._env_path = os.path.join(os.path.dirname(os.path.dirname(os.path.abspath(__file__))), '.env')
+        self._last_loaded = None
+        self._key_stats = {}  # key_index -> {calls, errors, last_used}
+        self._load_from_env()
+        self._initialized = True
+        logger.info(f"KeyManager initialized with {len(self._gemini_keys)} Gemini key(s)")
+    
+    # ── Read Operations ──
+    
+    def get_gemini_keys(self):
+        """Return a copy of current Gemini keys. Thread-safe."""
+        with self._keys_lock:
+            return list(self._gemini_keys)
+    
+    def get_gemini_key(self, index=0):
+        """Get a single Gemini key by index. Returns None if out of range."""
+        with self._keys_lock:
+            if 0 <= index < len(self._gemini_keys):
+                return self._gemini_keys[index]
+            return None
+    
+    def get_gemini_key_count(self):
+        """Return number of loaded Gemini keys."""
+        with self._keys_lock:
+            return len(self._gemini_keys)
+    
+    def get_masked_keys(self):
+        """Return keys masked for display: first 8 + ... + last 4 chars."""
+        with self._keys_lock:
+            result = []
+            for i, key in enumerate(self._gemini_keys):
+                if len(key) > 16:
+                    masked = key[:8] + '...' + key[-4:]
+                elif len(key) > 8:
+                    masked = key[:4] + '...' + key[-2:]
+                else:
+                    masked = '****'
+                stats = self._key_stats.get(i, {})
+                result.append({
+                    'index': i,
+                    'masked': masked,
+                    'length': len(key),
+                    'calls': stats.get('calls', 0),
+                    'errors': stats.get('errors', 0),
+                    'last_used': stats.get('last_used', None),
+                    'valid': stats.get('valid', None),
+                    'last_validated': stats.get('last_validated', None),
+                })
+            return result
+    
+    # ── Write Operations (all persist to .env) ──
+    
+    def set_gemini_keys(self, keys):
+        """Replace all Gemini keys. Persists to .env. Returns success bool."""
+        # Filter empty strings
+        keys = [k.strip() for k in keys if k.strip()]
+        with self._keys_lock:
+            self._gemini_keys = keys
+            self._key_stats = {}  # Reset stats on full replace
+            self._persist_to_env()
+            logger.info(f"Gemini keys replaced: {len(keys)} key(s) loaded")
+        return True
+    
+    def add_gemini_key(self, key):
+        """Add a single Gemini key. Persists to .env. Returns new index."""
+        key = key.strip()
+        if not key:
+            raise ValueError("Key cannot be empty")
+        with self._keys_lock:
+            # Check for duplicates
+            if key in self._gemini_keys:
+                raise ValueError("Key already exists")
+            self._gemini_keys.append(key)
+            idx = len(self._gemini_keys) - 1
+            self._persist_to_env()
+            logger.info(f"Gemini key added at index {idx}")
+            return idx
+    
+    def remove_gemini_key(self, index):
+        """Remove a Gemini key by index. Persists to .env. Returns removed key (masked)."""
+        with self._keys_lock:
+            if index < 0 or index >= len(self._gemini_keys):
+                raise IndexError(f"Key index {index} out of range (have {len(self._gemini_keys)} keys)")
+            if len(self._gemini_keys) <= 1:
+                raise ValueError("Cannot remove last key — pipeline needs at least 1 Gemini key")
+            key = self._gemini_keys.pop(index)
+            # Rebuild stats with shifted indices
+            new_stats = {}
+            for i, stats in self._key_stats.items():
+                if i < index:
+                    new_stats[i] = stats
+                elif i > index:
+                    new_stats[i - 1] = stats
+            self._key_stats = new_stats
+            self._persist_to_env()
+            masked = key[:8] + '...' + key[-4:] if len(key) > 16 else '****'
+            logger.info(f"Gemini key removed at index {index}: {masked}")
+            return masked
+    
+    def replace_gemini_key(self, index, new_key):
+        """Replace a single Gemini key at index. Persists to .env."""
+        new_key = new_key.strip()
+        if not new_key:
+            raise ValueError("Key cannot be empty")
+        with self._keys_lock:
+            if index < 0 or index >= len(self._gemini_keys):
+                raise IndexError(f"Key index {index} out of range")
+            # Check duplicate (but allow replacing with same key)
+            if new_key in self._gemini_keys and self._gemini_keys[index] != new_key:
+                raise ValueError("Key already exists at another index")
+            self._gemini_keys[index] = new_key
+            if index in self._key_stats:
+                self._key_stats[index] = {}  # Reset stats for replaced key
+            self._persist_to_env()
+            logger.info(f"Gemini key replaced at index {index}")
+    
+    # ── Validation ──
+    
+    def validate_key(self, key):
+        """
+        Test a Gemini API key by listing models.
+        Returns (valid: bool, message: str).
+        """
+        try:
+            resp = requests.get(
+                f"https://generativelanguage.googleapis.com/v1beta/models?key={key}",
+                timeout=10
+            )
+            if resp.status_code == 200 and 'models' in resp.text:
+                return True, "Valid — API responded"
+            elif resp.status_code == 400:
+                return False, f"Invalid key (HTTP {resp.status_code})"
+            elif resp.status_code == 403:
+                return False, "Key disabled or quota exhausted"
+            elif resp.status_code == 429:
+                return True, "Valid — but currently rate-limited"
+            else:
+                return False, f"Unexpected response (HTTP {resp.status_code})"
+        except requests.Timeout:
+            return False, "Timeout — could not reach Gemini API"
+        except requests.ConnectionError:
+            return False, "Connection error — check network"
+        except Exception as e:
+            return False, f"Error: {str(e)}"
+    
+    def validate_all(self):
+        """Validate all loaded Gemini keys. Returns list of results."""
+        results = []
+        with self._keys_lock:
+            keys_copy = list(enumerate(self._gemini_keys))
+        
+        for i, key in keys_copy:
+            valid, message = self.validate_key(key)
+            with self._keys_lock:
+                if i not in self._key_stats:
+                    self._key_stats[i] = {}
+                self._key_stats[i]['valid'] = valid
+                self._key_stats[i]['last_validated'] = time.strftime('%Y-%m-%dT%H:%M:%SZ', time.gmtime())
+            results.append({'index': i, 'valid': valid, 'message': message})
+            time.sleep(0.2)  # Don't hammer the API
+        
+        return results
+    
+    # ── Stats tracking (called by enricher/extractor) ──
+    
+    def record_usage(self, key_index, success=True):
+        """Record a key usage event. Called by workers after each Gemini call."""
+        with self._keys_lock:
+            if key_index not in self._key_stats:
+                self._key_stats[key_index] = {'calls': 0, 'errors': 0}
+            self._key_stats[key_index]['calls'] = self._key_stats[key_index].get('calls', 0) + 1
+            if not success:
+                self._key_stats[key_index]['errors'] = self._key_stats[key_index].get('errors', 0) + 1
+            self._key_stats[key_index]['last_used'] = time.strftime('%Y-%m-%dT%H:%M:%SZ', time.gmtime())
+    
+    # ── Internal ──
+    
+    def _load_from_env(self):
+        """Load Gemini keys from .env file."""
+        keys = []
+        if os.path.exists(self._env_path):
+            with open(self._env_path, 'r') as f:
+                for line in f:
+                    line = line.strip()
+                    if line and not line.startswith('#'):
+                        match = re.match(r'^GEMINI_KEY(?:_\d+)?=(.+)$', line)
+                        if match:
+                            val = match.group(1).strip().strip('"').strip("'")
+                            if val:
+                                keys.append(val)
+        self._gemini_keys = keys
+        self._last_loaded = time.time()
+    
+    def _persist_to_env(self):
+        """Write current keys back to .env file, preserving non-Gemini lines."""
+        other_lines = []
+        if os.path.exists(self._env_path):
+            with open(self._env_path, 'r') as f:
+                for line in f:
+                    stripped = line.strip()
+                    if stripped and not re.match(r'^GEMINI_KEY', stripped):
+                        other_lines.append(line.rstrip('\n'))
+        
+        with open(self._env_path, 'w') as f:
+            # Write non-Gemini lines first
+            for line in other_lines:
+                f.write(line + '\n')
+            # Write Gemini keys
+            for i, key in enumerate(self._gemini_keys, 1):
+                f.write(f'GEMINI_KEY_{i}={key}\n')
+        
+        self._last_loaded = time.time()
+        logger.info(f"Persisted {len(self._gemini_keys)} Gemini key(s) to {self._env_path}")
+    
+    def reload_from_env(self):
+        """Force reload from .env (e.g., if edited externally)."""
+        with self._keys_lock:
+            self._load_from_env()
+            logger.info(f"Reloaded {len(self._gemini_keys)} Gemini key(s) from .env")
+        return len(self._gemini_keys)
+
+
+# Module-level convenience — import and use anywhere
+_manager = None
+
+def get_key_manager():
+    """Get the singleton KeyManager instance."""
+    global _manager
+    if _manager is None:
+        _manager = KeyManager()
+    return _manager
--- a/lib/new_pipeline.py
+++ b/lib/new_pipeline.py
--- a/lib/organizer.py
+++ b/lib/organizer.py
@ -0,0 +1,374 @@
+"""
+RECON Library Organizer
+
+After a document completes the pipeline (extract -> enrich -> embed),
+this module classifies it by dominant domain and moves it into the
+correct Domain/Subdomain/ folder with a sanitized filename.
+
+Two modes:
+  1. Per-document: determine_dominant_domain() from on-disk concept JSONs
+  2. Bulk manifest: organize_from_manifest() using pre-built manifest JSON
+
+Path updates trigger the existing catalogue.path_updated_at mechanism,
+which sync_qdrant_paths() propagates to Qdrant payloads.
+"""
+import json
+import logging
+import os
+import shutil
+from collections import Counter
+
+from .utils import sanitize_filename
+
+logger = logging.getLogger('recon.organizer')
+
+# ── Domain folder mapping (canonical) ───────────────────────────────────
+# Keys = exact domain strings from Gemini enrichment
+# Values = filesystem-safe folder names
+
+DOMAIN_FOLDERS = {
+    'Agriculture & Livestock': 'Agriculture-and-Livestock',
+    'Civil Organization': 'Civil-Organization',
+    'Communications': 'Communications',
+    'Food Systems': 'Food-Systems',
+    'Foundational Skills': 'Foundational-Skills',
+    'Logistics': 'Logistics',
+    'Medical': 'Medical',
+    'Navigation': 'Navigation',
+    'Operations': 'Operations',
+    'Power Systems': 'Power-Systems',
+    'Preservation & Storage': 'Preservation-and-Storage',
+    'Security': 'Security',
+    'Shelter & Construction': 'Shelter-and-Construction',
+    'Technology': 'Technology',
+    'Tools & Equipment': 'Tools-and-Equipment',
+    'Vehicles': 'Vehicles',
+    'Water Systems': 'Water-Systems',
+    'Wilderness Skills': 'Wilderness-Skills',
+}
+
+
+def normalize_folder_name(name):
+    """Normalize a domain/subdomain name to a folder-safe string.
+
+    Examples:
+        'Edible Plants & Foraging' -> 'Edible-Plants-and-Foraging'
+        'emergency medicine' -> 'Emergency-Medicine'
+    """
+    if not name:
+        return 'Uncategorized'
+    name = name.strip()
+    name = name.replace('&', 'and')
+    words = name.split()
+    titled = []
+    for w in words:
+        if w.lower() in ('and', 'of', 'the', 'to', 'for', 'in', 'on', 'at'):
+            titled.append(w.lower())
+        else:
+            titled.append(w.capitalize())
+    return '-'.join(titled)
+
+
+def determine_dominant_domain(doc_hash, data_dir):
+    """Determine a document's dominant domain from on-disk concept JSONs.
+
+    Reads all /data/concepts/{hash}/window_*.json files, counts domain
+    occurrences across all concepts, returns the top domain.
+
+    Args:
+        doc_hash: Document hash
+        data_dir: Path to /opt/recon/data
+
+    Returns:
+        (domain, subdomain, confidence) tuple.
+        domain/subdomain are strings or None.
+        confidence is float 0-1 (top domain count / total concepts).
+    """
+    concepts_dir = os.path.join(data_dir, 'concepts', doc_hash)
+    if not os.path.isdir(concepts_dir):
+        return (None, None, 0.0)
+
+    domain_counter = Counter()
+    subdomain_counter = Counter()
+    total_concepts = 0
+
+    for fname in os.listdir(concepts_dir):
+        if not fname.startswith('window_') or not fname.endswith('.json'):
+            continue
+        fpath = os.path.join(concepts_dir, fname)
+        try:
+            with open(fpath, 'r') as f:
+                concepts = json.load(f)
+        except (json.JSONDecodeError, OSError):
+            continue
+
+        if not isinstance(concepts, list):
+            continue
+
+        for concept in concepts:
+            total_concepts += 1
+            # domain is usually a list with one element
+            dom = concept.get('domain')
+            if isinstance(dom, list):
+                for d in dom:
+                    if isinstance(d, str):
+                        domain_counter[d] += 1
+            elif isinstance(dom, str):
+                domain_counter[dom] += 1
+
+            sub = concept.get('subdomain')
+            if isinstance(sub, list):
+                for s in sub:
+                    if isinstance(s, str):
+                        subdomain_counter[s] += 1
+            elif isinstance(sub, str):
+                subdomain_counter[sub] += 1
+
+    if total_concepts == 0 or not domain_counter:
+        return (None, None, 0.0)
+
+    top_domains = domain_counter.most_common(2)
+    dom_name = top_domains[0][0]
+    dom_count = top_domains[0][1]
+    confidence = dom_count / total_concepts
+
+    # Check ambiguity
+    is_ambiguous = False
+    if len(top_domains) >= 2:
+        dom2_count = top_domains[1][1]
+        if dom2_count >= dom_count * 0.8:
+            is_ambiguous = True
+    if confidence < 0.4:
+        is_ambiguous = True
+
+    if is_ambiguous:
+        return (None, None, confidence)
+
+    top_sub = subdomain_counter.most_common(1)
+    sub_name = top_sub[0][0] if top_sub else None
+
+    return (dom_name, sub_name, confidence)
+
+
+def _build_target_path(library_root, domain, subdomain, filename, doc_hash):
+    """Build the target path for a document, handling domain mapping and collisions.
+
+    Returns:
+        (target_path, sanitized_filename) tuple
+    """
+    san_name = sanitize_filename(filename, doc_hash=doc_hash)
+
+    if domain is None:
+        # Unclassified — leave in place (don't move to Review folder for pipeline)
+        return (None, san_name)
+
+    domain_folder = DOMAIN_FOLDERS.get(domain)
+    if not domain_folder:
+        domain_folder = normalize_folder_name(domain)
+
+    if subdomain:
+        sub_folder = normalize_folder_name(subdomain)
+    else:
+        sub_folder = 'General'
+
+    target_dir = os.path.join(library_root, domain_folder, sub_folder)
+    target_path = os.path.join(target_dir, san_name)
+
+    # Handle collision at target
+    if os.path.exists(target_path):
+        stem, ext = os.path.splitext(san_name)
+        h6 = doc_hash[:6]
+        new_name = '{} [{}]{}'.format(stem, h6, ext)
+        if len(new_name) > 120:
+            max_stem = 120 - len(ext) - 9
+            stem = stem[:max_stem].rstrip('. -,')
+            new_name = '{} [{}]{}'.format(stem, h6, ext)
+        san_name = new_name
+        target_path = os.path.join(target_dir, san_name)
+
+    return (target_path, san_name)
+
+
+def organize_document(doc_hash, db, config, dry_run=False):
+    """Organize a single document: classify, rename, and move.
+
+    Args:
+        doc_hash: Document hash
+        db: StatusDB instance
+        config: RECON config dict
+        dry_run: If True, don't actually move files
+
+    Returns:
+        dict with keys: hash, action, before_path, after_path, domain, subdomain, error
+    """
+    library_root = config['library_root']
+    data_dir = config['paths']['data']
+
+    result = {
+        'hash': doc_hash,
+        'action': 'skip',
+        'before_path': None,
+        'after_path': None,
+        'domain': None,
+        'subdomain': None,
+        'error': None,
+    }
+
+    # Look up current path from catalogue
+    conn = db._get_conn()
+    row = conn.execute(
+        "SELECT path, filename FROM catalogue WHERE hash = ?", (doc_hash,)
+    ).fetchone()
+    if not row:
+        result['error'] = 'Not in catalogue'
+        return result
+
+    current_path = row['path']
+    current_filename = row['filename']
+    result['before_path'] = current_path
+
+    # Verify file exists on disk
+    if not dry_run and not os.path.exists(current_path):
+        result['error'] = 'File not found on disk'
+        return result
+
+    # Determine domain from concept JSONs
+    domain, subdomain, confidence = determine_dominant_domain(doc_hash, data_dir)
+    result['domain'] = domain
+    result['subdomain'] = subdomain
+
+    if domain is None:
+        result['action'] = 'skip_unclassified'
+        return result
+
+    # Build target path
+    target_path, san_name = _build_target_path(
+        library_root, domain, subdomain, current_filename, doc_hash
+    )
+
+    if target_path is None:
+        result['action'] = 'skip_unclassified'
+        return result
+
+    result['after_path'] = target_path
+
+    # Already at target?
+    if os.path.abspath(current_path) == os.path.abspath(target_path):
+        result['action'] = 'already_organized'
+        # Still mark as organized
+        if not dry_run:
+            db.mark_organized(doc_hash)
+        return result
+
+    if dry_run:
+        result['action'] = 'would_move'
+        return result
+
+    # Move the file
+    try:
+        target_dir = os.path.dirname(target_path)
+        os.makedirs(target_dir, exist_ok=True)
+        shutil.move(current_path, target_path)
+
+        # Update catalogue (triggers path_updated_at for Qdrant sync)
+        db.update_catalogue_path(doc_hash, target_path, san_name)
+        db.mark_organized(doc_hash)
+
+        result['action'] = 'moved'
+        logger.info("Organized %s -> %s [%s/%s]",
+                     doc_hash[:8], target_path, domain, subdomain)
+    except Exception as e:
+        result['action'] = 'error'
+        result['error'] = str(e)
+        logger.error("Failed to organize %s: %s", doc_hash[:8], e)
+
+    return result
+
+
+def organize_from_manifest(manifest_path, db, config, dry_run=False):
+    """Bulk migration using a pre-built manifest JSON.
+
+    The manifest is produced by recon_manifest_builder.py and contains
+    entries with current_path, sanitized_path, sanitized_filename, hash, etc.
+
+    Args:
+        manifest_path: Path to manifest JSON file
+        db: StatusDB instance
+        config: RECON config dict
+        dry_run: If True, don't actually move files
+
+    Returns:
+        dict with summary stats: moved, skipped, errors, already_organized, total
+    """
+    with open(manifest_path, 'r') as f:
+        entries = json.load(f)
+
+    stats = {
+        'total': len(entries),
+        'moved': 0,
+        'skipped': 0,
+        'already_organized': 0,
+        'errors': 0,
+        'not_found': 0,
+    }
+
+    for i, entry in enumerate(entries):
+        doc_hash = entry['hash']
+        current_path = entry['current_path']
+        target_path = entry.get('sanitized_path', entry.get('proposed_path'))
+        san_name = entry.get('sanitized_filename', entry.get('filename'))
+
+        if not target_path or not san_name:
+            stats['skipped'] += 1
+            continue
+
+        # Skip ambiguous entries
+        if entry.get('ambiguous'):
+            stats['skipped'] += 1
+            continue
+
+        # Already at target?
+        if os.path.abspath(current_path) == os.path.abspath(target_path):
+            stats['already_organized'] += 1
+            if not dry_run:
+                db.mark_organized(doc_hash)
+            continue
+
+        if dry_run:
+            stats['moved'] += 1
+            continue
+
+        # Verify source exists
+        if not os.path.exists(current_path):
+            stats['not_found'] += 1
+            logger.warning("Manifest: file not found: %s [%s]", current_path, doc_hash[:8])
+            continue
+
+        try:
+            target_dir = os.path.dirname(target_path)
+            os.makedirs(target_dir, exist_ok=True)
+
+            # Check for collision at target (different file already there)
+            if os.path.exists(target_path):
+                stem, ext = os.path.splitext(san_name)
+                h6 = doc_hash[:6]
+                san_name = '{} [{}]{}'.format(stem, h6, ext)
+                target_path = os.path.join(target_dir, san_name)
+
+            shutil.move(current_path, target_path)
+
+            # Update catalogue + mark organized
+            db.update_catalogue_path(doc_hash, target_path, san_name)
+            db.mark_organized(doc_hash)
+            stats['moved'] += 1
+
+        except Exception as e:
+            stats['errors'] += 1
+            logger.error("Manifest: failed to move %s: %s", doc_hash[:8], e)
+
+        # Progress reporting
+        if (i + 1) % 1000 == 0:
+            logger.info("Manifest progress: %d / %d (moved=%d, errors=%d)",
+                        i + 1, stats['total'], stats['moved'], stats['errors'])
+
+    return stats
--- a/lib/peertube_collector.py
+++ b/lib/peertube_collector.py
@ -0,0 +1,137 @@
+"""
+RECON Metrics Collector
+
+Background daemon thread that snapshots pipeline metrics every 5 minutes
+to the metrics_snapshots SQLite table. Used for time-series charts.
+"""
+import json
+import time
+import threading
+import logging
+
+logger = logging.getLogger('recon.collector')
+
+
+def start_collector(stop_event=None):
+    """Start the metrics collector in a daemon thread."""
+    def _run():
+        from .status import StatusDB
+        from .utils import get_config
+        import requests as req
+
+        interval = 120  # 2 minutes
+        logger.info(f"Metrics collector started (interval: {interval}s)")
+
+        while True:
+            if stop_event and stop_event.is_set():
+                break
+            try:
+                _snapshot(StatusDB(), get_config(), req)
+            except Exception as e:
+                logger.error(f"Metrics snapshot failed: {e}")
+
+            # Wait with stop check
+            if stop_event:
+                stop_event.wait(interval)
+                if stop_event.is_set():
+                    break
+            else:
+                time.sleep(interval)
+
+        logger.info("Metrics collector stopped")
+
+    t = threading.Thread(target=_run, daemon=True, name='metrics-collector')
+    t.start()
+    return t
+
+
+def _snapshot(db, config, req):
+    """Take a single metrics snapshot."""
+    from datetime import datetime, timezone, timedelta
+
+    conn = db._get_conn()
+    ts = datetime.now(timezone.utc).strftime('%Y-%m-%dT%H:%M:00Z')  # Round to minute
+
+    # Knowledge pipeline stats
+    try:
+        totals = conn.execute("""
+            SELECT
+                COUNT(*) as total,
+                SUM(CASE WHEN status = 'complete' THEN 1 ELSE 0 END) as complete,
+                SUM(CASE WHEN status = 'failed' THEN 1 ELSE 0 END) as failed,
+                SUM(CASE WHEN status NOT IN ('complete', 'failed') THEN 1 ELSE 0 END) as in_pipeline,
+                SUM(COALESCE(concepts_extracted, 0)) as concepts,
+                SUM(COALESCE(vectors_inserted, 0)) as vectors
+            FROM documents
+        """).fetchone()
+
+        knowledge_data = {
+            'total': totals['total'],
+            'complete': totals['complete'],
+            'failed': totals['failed'],
+            'in_pipeline': totals['in_pipeline'],
+            'concepts': totals['concepts'],
+            'vectors': totals['vectors'],
+        }
+
+        conn.execute(
+            "INSERT OR REPLACE INTO metrics_snapshots (timestamp, metric_type, data) VALUES (?, ?, ?)",
+            (ts, 'knowledge', json.dumps(knowledge_data))
+        )
+        conn.commit()
+    except Exception as e:
+        logger.debug(f"Knowledge snapshot failed: {e}")
+
+    # PeerTube pipeline stats (via SSH)
+    try:
+        import subprocess
+        result = subprocess.run(
+            ['ssh', '-o', 'BatchMode=yes', '-o', 'ConnectTimeout=5',
+             'zvx@192.168.1.170',
+             'sudo -u peertube psql peertube_prod -t -A -c "SELECT state, COUNT(*) FROM video GROUP BY state;" 2>/dev/null; '
+             'echo "---"; '
+             'for d in staging completed transcoded failed; do '
+             '  dir="/opt/bulk-import/$d"; '
+             '  files=$(find -L "$dir" -type f 2>/dev/null | wc -l); '
+             '  echo "$d|$files"; '
+             'done'],
+            capture_output=True, text=True, timeout=20
+        )
+        if result.returncode == 0 or result.stdout.strip():
+            sections = result.stdout.split('---')
+            video_states = {}
+            if len(sections) > 0:
+                for line in sections[0].strip().split('\n'):
+                    if '|' in line:
+                        parts = line.split('|')
+                        if len(parts) == 2 and parts[1].isdigit():
+                            video_states[parts[0]] = int(parts[1])
+            pipeline_files = {}
+            if len(sections) > 1:
+                for line in sections[1].strip().split('\n'):
+                    if '|' in line:
+                        parts = line.split('|')
+                        if len(parts) == 2:
+                            pipeline_files[parts[0]] = int(parts[1]) if parts[1].isdigit() else 0
+
+            pt_data = {
+                'video_states': video_states,
+                'pipeline_files': pipeline_files,
+                'published': video_states.get('1', 0),
+                'backlog': sum(pipeline_files.values()),
+            }
+            conn.execute(
+                "INSERT OR REPLACE INTO metrics_snapshots (timestamp, metric_type, data) VALUES (?, ?, ?)",
+                (ts, 'peertube', json.dumps(pt_data))
+            )
+            conn.commit()
+    except Exception as e:
+        logger.debug(f"PeerTube snapshot failed: {e}")
+
+    # Prune old snapshots (> 7 days)
+    try:
+        cutoff = (datetime.now(timezone.utc) - timedelta(days=7)).isoformat()
+        conn.execute("DELETE FROM metrics_snapshots WHERE timestamp < ?", (cutoff,))
+        conn.commit()
+    except Exception:
+        pass
--- a/lib/peertube_scraper.py
+++ b/lib/peertube_scraper.py
@ -0,0 +1,580 @@
+"""
+RECON PeerTube Scraper — Video transcript ingestion.
+
+Fetches WebVTT captions from a PeerTube instance, converts to plain text,
+chunks into pages, and feeds into the standard RECON enrichment pipeline.
+
+Output format matches lib/web_scraper.py so the enricher and embedder
+process transcript content identically to web content.
+"""
+
+import hashlib
+import io
+import json
+import os
+import bisect
+import re
+import time
+from datetime import datetime, timezone
+from urllib.parse import quote
+
+import requests
+import webvtt
+
+from .utils import get_config, setup_logging
+from .status import StatusDB
+from .web_scraper import chunk_text
+
+logger = setup_logging('recon.peertube_scraper')
+
+# Module-level stop flag — set by service thread for graceful shutdown
+_stop_check = None
+
+def set_stop_check(fn):
+    """Register a callable that returns True when shutdown is requested."""
+    global _stop_check
+    _stop_check = fn
+
+# Defaults (overridden by config.yaml peertube section)
+DEFAULT_API_BASE = 'http://192.168.1.170'
+DEFAULT_PUBLIC_URL = 'https://stream.echo6.co'
+DEFAULT_FETCH_TIMEOUT = 30
+DEFAULT_RATE_LIMIT_DELAY = 0.5
+
+
+def _get_pt_config(config=None):
+    """Get PeerTube settings from config, with defaults."""
+    if config is None:
+        config = get_config()
+    pt = config.get('peertube', {})
+    return {
+        'api_base': pt.get('api_base', DEFAULT_API_BASE),
+        'public_url': pt.get('public_url', DEFAULT_PUBLIC_URL),
+        'fetch_timeout': pt.get('fetch_timeout', DEFAULT_FETCH_TIMEOUT),
+        'rate_limit_delay': pt.get('rate_limit_delay', DEFAULT_RATE_LIMIT_DELAY),
+    }
+
+
+def _api_get(path, config=None, params=None):
+    """Make a GET request to the PeerTube API."""
+    ptc = _get_pt_config(config)
+    url = f"{ptc['api_base']}{path}"
+    resp = requests.get(url, params=params, timeout=ptc['fetch_timeout'])
+    resp.raise_for_status()
+    return resp.json()
+
+
+def get_videos(channel=None, since=None, config=None):
+    """
+    Paginate through all published videos on the PeerTube instance.
+
+    Args:
+        channel: Filter to this channel actor_name (e.g., 'mental-outlaw')
+        since: ISO date string — only return videos published after this date
+        config: RECON config dict
+
+    Returns list of video dicts with: uuid, name, duration,
+    channel.name, channel.displayName, publishedAt, description.
+    """
+    ptc = _get_pt_config(config)
+    videos = []
+    start = 0
+    count = 100  # PeerTube supports up to 100 per page
+
+    while True:
+        if channel:
+            path = f"/api/v1/video-channels/{channel}/videos"
+        else:
+            path = "/api/v1/videos"
+
+        data = _api_get(path, config, params={
+            'count': count,
+            'start': start,
+            'sort': '-publishedAt',
+        })
+
+        total = data.get('total', 0)
+        batch = data.get('data', [])
+
+        if not batch:
+            break
+
+        for v in batch:
+            published = v.get('publishedAt', '')
+
+            # Filter by since date
+            if since and published < since:
+                # Videos are sorted by publishedAt desc, so once we pass
+                # the since threshold, all remaining are older — stop
+                return videos
+
+            videos.append({
+                'uuid': v['uuid'],
+                'name': v['name'],
+                'duration': v.get('duration', 0),
+                'channel_name': v.get('channel', {}).get('name', ''),
+                'channel_display': v.get('channel', {}).get('displayName', ''),
+                'publishedAt': published,
+                'description': (v.get('description') or '')[:500],
+            })
+
+        start += count
+        if start >= total:
+            break
+
+        # Check for shutdown during pagination
+        if _stop_check and _stop_check():
+            logger.info(f"Shutdown requested during video listing — returning {len(videos)} collected so far")
+            return videos
+
+        # Rate limit pagination requests
+        time.sleep(ptc['rate_limit_delay'])
+
+    return videos
+
+
+def get_captions(uuid, config=None):
+    """Get caption list for a video. Returns list of caption dicts."""
+    data = _api_get(f"/api/v1/videos/{uuid}/captions", config)
+    return data.get('data', [])
+
+
+def fetch_vtt(caption_path, config=None):
+    """Fetch raw VTT file content from PeerTube."""
+    ptc = _get_pt_config(config)
+    url = f"{ptc['api_base']}{caption_path}"
+    resp = requests.get(url, timeout=ptc['fetch_timeout'])
+    resp.raise_for_status()
+    return resp.text
+
+
+
+def _parse_vtt_time(time_str):
+    """Parse VTT timestamp string (HH:MM:SS.mmm or MM:SS.mmm) to seconds."""
+    parts = time_str.split(':')
+    if len(parts) == 3:
+        h, m, s = parts
+        return int(h) * 3600 + int(m) * 60 + float(s)
+    elif len(parts) == 2:
+        m, s = parts
+        return int(m) * 60 + float(s)
+    return 0.0
+
+
+def vtt_to_text(vtt_content):
+    """
+    Convert WebVTT content to clean plain text with timestamp tracking.
+
+    Strips timestamps, de-duplicates consecutive identical cues (common with
+    Whisper output), removes HTML tags, and joins cues with spaces (not
+    newlines — Whisper cues break mid-sentence).
+
+    Returns (text, cue_timestamps) where:
+    - text: clean prose string
+    - cue_timestamps: list of (start_seconds, char_offset) tuples tracking
+      where each VTT cue begins in the output text
+    """
+    buf = io.StringIO(vtt_content)
+    try:
+        captions = webvtt.read_buffer(buf)
+    except Exception:
+        # Fallback: manual regex parse if webvtt-py fails
+        return _vtt_to_text_fallback(vtt_content)
+
+    prev_text = None
+    segments = []
+    raw_timestamps = []  # (start_seconds, segment_index)
+
+    for caption in captions:
+        text = caption.text.strip()
+        if not text:
+            continue
+
+        # Strip HTML tags
+        text = re.sub(r'<[^>]+>', '', text)
+
+        # De-duplicate consecutive identical cues
+        if text == prev_text:
+            continue
+        prev_text = text
+
+        start_seconds = _parse_vtt_time(caption.start)
+        raw_timestamps.append((start_seconds, len(segments)))
+        segments.append(text)
+
+    # Join with spaces — VTT cues break mid-sentence
+    raw = ' '.join(segments)
+
+    # Clean up double spaces and whitespace
+    raw = re.sub(r'\s+', ' ', raw).strip()
+
+    # Compute char offsets for each tracked segment
+    seg_offsets = []
+    pos = 0
+    for i, seg in enumerate(segments):
+        seg_offsets.append(pos)
+        pos += len(seg) + 1  # +1 for space separator
+
+    cue_timestamps = []
+    for start_secs, seg_idx in raw_timestamps:
+        if seg_idx < len(seg_offsets):
+            cue_timestamps.append((start_secs, seg_offsets[seg_idx]))
+
+    return raw, cue_timestamps
+
+
+def _vtt_to_text_fallback(vtt_content):
+    """Regex-based VTT parser as fallback. Returns (text, cue_timestamps)."""
+    lines = vtt_content.split('\n')
+    prev_text = None
+    segments = []
+    raw_timestamps = []
+    last_time = 0.0
+
+    for line in lines:
+        line = line.strip()
+        if not line or line == 'WEBVTT':
+            continue
+        if '-->' in line:
+            # Parse start time from "00:01:23.456 --> 00:01:25.789"
+            time_part = line.split('-->')[0].strip()
+            last_time = _parse_vtt_time(time_part)
+            continue
+        if line.isdigit():
+            continue
+
+        text = re.sub(r'<[^>]+>', '', line)
+        if text == prev_text:
+            continue
+        prev_text = text
+        raw_timestamps.append((last_time, len(segments)))
+        segments.append(text)
+
+    raw = ' '.join(segments)
+    raw = re.sub(r'\s+', ' ', raw).strip()
+
+    # Compute char offsets
+    seg_offsets = []
+    pos = 0
+    for seg in segments:
+        seg_offsets.append(pos)
+        pos += len(seg) + 1
+
+    cue_timestamps = []
+    for start_secs, seg_idx in raw_timestamps:
+        if seg_idx < len(seg_offsets):
+            cue_timestamps.append((start_secs, seg_offsets[seg_idx]))
+
+    return raw, cue_timestamps
+
+
+
+def _map_page_timestamps(pages, full_text, cue_timestamps):
+    """
+    Map page numbers to video timestamps.
+
+    For each page, finds its approximate start position in the full text,
+    then looks up the nearest VTT cue timestamp via binary search.
+
+    Returns dict: {"page_0001": 0.0, "page_0002": 312.5, ...}
+    """
+    if not cue_timestamps:
+        return {}
+
+    offsets = [ct[1] for ct in cue_timestamps]
+    times = [ct[0] for ct in cue_timestamps]
+
+    page_ts = {}
+    search_start = 0
+
+    for i, page_text in enumerate(pages):
+        page_name = f"page_{i+1:04d}"
+
+        # Find where this page starts in the full text
+        snippet = page_text[:200].strip()
+        pos = full_text.find(snippet, search_start)
+        if pos < 0:
+            pos = search_start  # fallback
+
+        # Binary search for nearest cue at or before this position
+        idx = bisect.bisect_right(offsets, pos) - 1
+        if idx < 0:
+            idx = 0
+
+        page_ts[page_name] = round(times[idx], 1)
+        search_start = pos + len(snippet)
+
+    return page_ts
+
+def _content_hash(text):
+    """MD5 hash of text content — same as web_scraper."""
+    return hashlib.md5(text.encode('utf-8')).hexdigest()
+
+
+def ingest_video(uuid, video_meta, config=None):
+    """
+    Ingest a single PeerTube video transcript.
+
+    Fetches captions, converts VTT to text, chunks into pages,
+    saves to data/text/{hash}/, and sets status to 'extracted'.
+
+    Args:
+        uuid: Video UUID
+        video_meta: Dict with name, duration, channel_name, channel_display,
+                    publishedAt, description
+        config: RECON config dict
+
+    Returns dict with hash, status, title, page_count — or None if no captions.
+    """
+    if config is None:
+        config = get_config()
+    ptc = _get_pt_config(config)
+    db = StatusDB()
+
+    # Get captions
+    captions = get_captions(uuid, config)
+    if not captions:
+        return None
+
+    # Prefer English caption
+    caption = None
+    for c in captions:
+        if c.get('language', {}).get('id') == 'en':
+            caption = c
+            break
+    if caption is None:
+        caption = captions[0]
+
+    # Fetch VTT
+    vtt_content = fetch_vtt(caption['captionPath'], config)
+
+    # Convert to plain text with timestamp tracking
+    text, cue_timestamps = vtt_to_text(vtt_content)
+    if not text or len(text) < 50:
+        logger.warning(f"Transcript too short for {video_meta['name']} ({uuid}): {len(text)} chars")
+        return None
+
+    # Hash the text content
+    doc_hash = _content_hash(text)
+
+    # Check for duplicate
+    conn = db._get_conn()
+    existing = conn.execute("SELECT * FROM catalogue WHERE hash = ?", (doc_hash,)).fetchone()
+    if existing:
+        doc = db.get_document(doc_hash)
+        existing_status = doc['status'] if doc else existing['status']
+        logger.debug(f"Duplicate transcript (hash {doc_hash[:12]}...) — {video_meta['name']}")
+        return {
+            'hash': doc_hash,
+            'status': 'duplicate',
+            'title': video_meta['name'],
+            'existing_status': existing_status,
+        }
+
+    # Chunk into pages
+    words_per_page = config.get('web_scraper', {}).get('words_per_page', 2000)
+    pages = chunk_text(text, words_per_page)
+
+    # Compute page-to-timestamp mapping
+    page_timestamps = _map_page_timestamps(pages, text, cue_timestamps)
+
+    # Save text files
+    text_dir = os.path.join(config['paths']['text'], doc_hash)
+    os.makedirs(text_dir, exist_ok=True)
+
+    for i, page_text in enumerate(pages, 1):
+        page_file = os.path.join(text_dir, f"page_{i:04d}.txt")
+        with open(page_file, 'w', encoding='utf-8') as f:
+            f.write(page_text)
+
+    # Save meta.json
+    video_url = f"{ptc['public_url']}/w/{uuid}"
+    meta = {
+        'hash': doc_hash,
+        'source_type': 'transcript',
+        'url': video_url,
+        'title': video_meta['name'],
+        'author': video_meta.get('channel_display', ''),
+        'channel': video_meta.get('channel_name', ''),
+        'duration': video_meta.get('duration', 0),
+        'date': video_meta.get('publishedAt', ''),
+        'description': video_meta.get('description', ''),
+        'sitename': 'stream.echo6.co',
+        'page_count': len(pages),
+        'text_length': len(text),
+        'page_timestamps': page_timestamps,
+        'fetched_at': datetime.now(timezone.utc).isoformat(),
+    }
+    with open(os.path.join(text_dir, 'meta.json'), 'w') as f:
+        json.dump(meta, f, indent=2)
+
+    # Display filename for catalogue
+    display_name = re.sub(r'[^\w\s._-]', '', video_meta['name'])[:200].strip()
+    if not display_name:
+        display_name = uuid
+
+    # Add to catalogue
+    db.add_to_catalogue(
+        doc_hash, display_name, video_url,
+        len(text), 'stream.echo6.co', video_meta.get('channel_name', 'unknown')
+    )
+
+    # Queue + advance to extracted
+    db.queue_document(doc_hash)
+    db.update_status(doc_hash, 'extracted',
+                     page_count=len(pages),
+                     pages_extracted=len(pages),
+                     book_title=video_meta['name'],
+                     book_author=video_meta.get('channel_display', ''))
+
+    logger.info(
+        f"Ingested transcript: {video_meta['name']} ({uuid[:8]}...) "
+        f"-> {doc_hash[:12]}... ({len(pages)} pages, {len(text)} chars)"
+    )
+
+    return {
+        'hash': doc_hash,
+        'status': 'extracted',
+        'title': video_meta['name'],
+        'page_count': len(pages),
+        'text_length': len(text),
+        'page_timestamps': page_timestamps,
+        'channel': video_meta.get('channel_name', ''),
+        'duration': video_meta.get('duration', 0),
+        'url': video_url,
+    }
+
+
+def ingest_channel(channel_name, config=None, since=None):
+    """
+    Ingest all captioned videos from a specific channel.
+
+    Returns summary dict.
+    """
+    if config is None:
+        config = get_config()
+    ptc = _get_pt_config(config)
+
+    logger.info(f"Ingesting channel: {channel_name}")
+    videos = get_videos(channel=channel_name, since=since, config=config)
+    return _ingest_video_list(videos, config, ptc)
+
+
+def ingest_all(config=None, since=None):
+    """
+    Ingest all captioned videos from the entire PeerTube instance.
+
+    Returns summary dict.
+    """
+    if config is None:
+        config = get_config()
+    ptc = _get_pt_config(config)
+
+    logger.info("Ingesting all PeerTube videos with captions")
+    videos = get_videos(since=since, config=config)
+    return _ingest_video_list(videos, config, ptc)
+
+
+def _ingest_video_list(videos, config, ptc):
+    """Process a list of videos — shared logic for ingest_channel and ingest_all."""
+    results = []
+    skipped_no_captions = 0
+    skipped_duplicate = 0
+    failed = 0
+    ingested = 0
+    total_pages = 0
+
+    total = len(videos)
+    logger.info(f"Found {total} videos to check for captions")
+
+    for i, video in enumerate(videos, 1):
+        if _stop_check and _stop_check():
+            logger.info(f"Shutdown requested — stopping after {i-1}/{total} videos")
+            break
+        uuid = video['uuid']
+
+        try:
+            result = ingest_video(uuid, video, config)
+
+            if result is None:
+                skipped_no_captions += 1
+            elif result['status'] == 'duplicate':
+                skipped_duplicate += 1
+            else:
+                ingested += 1
+                total_pages += result.get('page_count', 0)
+                results.append(result)
+
+        except Exception as e:
+            logger.error(f"[{i}/{total}] Failed: {video['name']} ({uuid}) — {e}")
+            failed += 1
+
+        # Check for shutdown
+        if _stop_check and _stop_check():
+            logger.info(f"Shutdown requested — stopping after {i}/{total} videos")
+            break
+
+        # Rate limit
+        if i < total:
+            time.sleep(ptc['rate_limit_delay'])
+
+        # Progress logging every 50 videos
+        if i % 50 == 0:
+            logger.info(
+                f"Progress: {i}/{total} checked — "
+                f"{ingested} ingested, {skipped_no_captions} no captions, "
+                f"{skipped_duplicate} dupes, {failed} failed"
+            )
+
+    logger.info(
+        f"PeerTube ingestion complete: {ingested} ingested ({total_pages} pages), "
+        f"{skipped_no_captions} no captions, {skipped_duplicate} duplicates, "
+        f"{failed} failed out of {total} videos"
+    )
+
+    return {
+        'results': results,
+        'summary': {
+            'total_checked': total,
+            'ingested': ingested,
+            'skipped_no_captions': skipped_no_captions,
+            'skipped_duplicate': skipped_duplicate,
+            'failed': failed,
+            'total_pages': total_pages,
+        }
+    }
+
+
+def get_instance_stats(config=None):
+    """Get PeerTube instance statistics for the dashboard."""
+    if config is None:
+        config = get_config()
+    db = StatusDB()
+
+    # Total videos on instance
+    try:
+        data = _api_get("/api/v1/videos", config, params={'count': 1})
+        total_videos = data.get('total', 0)
+    except Exception:
+        total_videos = 0
+
+    # Videos ingested into RECON (from catalogue)
+    conn = db._get_conn()
+    ingested = conn.execute(
+        "SELECT count(*) FROM catalogue WHERE source = 'stream.echo6.co'"
+    ).fetchone()[0]
+
+    # Status breakdown
+    status_rows = conn.execute(
+        "SELECT d.status, count(*) as cnt FROM documents d "
+        "JOIN catalogue c ON d.hash = c.hash "
+        "WHERE c.source = 'stream.echo6.co' "
+        "GROUP BY d.status"
+    ).fetchall()
+    status_breakdown = {row['status']: row['cnt'] for row in status_rows}
+
+    return {
+        'total_videos': total_videos,
+        'ingested': ingested,
+        'status_breakdown': status_breakdown,
+    }
--- a/lib/status.py
+++ b/lib/status.py
@ -0,0 +1,508 @@
+"""
+RECON Status Tracker
+
+SQLite operations for catalogue and documents tables. WAL mode, thread-local connections.
+Status flow: catalogued -> queued -> extracting -> extracted -> enriching -> enriched -> embedding -> complete.
+
+Config: paths.db
+"""
+import os
+import sqlite3
+import threading
+from datetime import datetime, timezone
+
+from .utils import get_config
+
+_local = threading.local()
+
+
+class StatusDB:
+    def __init__(self, db_path=None):
+        if db_path is None:
+            db_path = get_config()['paths']['db']
+        self.db_path = db_path
+        os.makedirs(os.path.dirname(db_path), exist_ok=True)
+        self._init_db()
+
+    def _get_conn(self):
+        if not hasattr(_local, 'conn') or _local.conn is None:
+            _local.conn = sqlite3.connect(self.db_path, timeout=30)
+            _local.conn.row_factory = sqlite3.Row
+            _local.conn.execute("PRAGMA journal_mode=WAL")
+            _local.conn.execute("PRAGMA busy_timeout=5000")
+        return _local.conn
+
+    def _init_db(self):
+        conn = self._get_conn()
+        conn.executescript("""
+            CREATE TABLE IF NOT EXISTS catalogue (
+                hash TEXT PRIMARY KEY,
+                filename TEXT NOT NULL,
+                path TEXT NOT NULL,
+                size_bytes INTEGER,
+                source TEXT,
+                category TEXT,
+                status TEXT DEFAULT 'catalogued',
+                discovered_at TEXT DEFAULT CURRENT_TIMESTAMP
+            );
+
+            CREATE TABLE IF NOT EXISTS documents (
+                hash TEXT PRIMARY KEY,
+                filename TEXT NOT NULL,
+                path TEXT,
+                size_bytes INTEGER,
+                page_count INTEGER,
+                book_title TEXT,
+                book_author TEXT,
+                collection TEXT DEFAULT 'survival',
+                status TEXT DEFAULT 'pending',
+                pages_extracted INTEGER DEFAULT 0,
+                concepts_extracted INTEGER DEFAULT 0,
+                vectors_inserted INTEGER DEFAULT 0,
+                discovered_at TEXT DEFAULT CURRENT_TIMESTAMP,
+                extracted_at TEXT,
+                enriched_at TEXT,
+                embedded_at TEXT,
+                error_message TEXT,
+                retry_count INTEGER DEFAULT 0
+            );
+
+            CREATE TABLE IF NOT EXISTS intel (
+                id INTEGER PRIMARY KEY AUTOINCREMENT,
+                source TEXT,
+                timestamp TEXT,
+                region TEXT,
+                category TEXT,
+                content TEXT,
+                summary TEXT,
+                key_facts TEXT,
+                credibility_score REAL,
+                verification_status TEXT,
+                vector_id INTEGER,
+                ingested_at TEXT DEFAULT CURRENT_TIMESTAMP
+            );
+
+            CREATE TABLE IF NOT EXISTS metrics_snapshots (
+                id INTEGER PRIMARY KEY AUTOINCREMENT,
+                timestamp TEXT NOT NULL,
+                metric_type TEXT NOT NULL,
+                data TEXT NOT NULL,
+                UNIQUE(timestamp, metric_type)
+            );
+
+            CREATE INDEX IF NOT EXISTS idx_catalogue_status ON catalogue(status);
+            CREATE INDEX IF NOT EXISTS idx_catalogue_source ON catalogue(source);
+            CREATE INDEX IF NOT EXISTS idx_documents_status ON documents(status);
+        """)
+        # Migration: add path_updated_at column if missing
+        try:
+            conn.execute("ALTER TABLE catalogue ADD COLUMN path_updated_at TEXT")
+        except Exception:
+            pass  # column already exists
+        # Migration: add organized_at column to documents if missing
+        try:
+            conn.execute("ALTER TABLE documents ADD COLUMN organized_at TEXT")
+        except Exception:
+            pass  # column already exists
+
+        # Stream B: file_operations + duplicate_review tables
+        conn.executescript("""
+            CREATE TABLE IF NOT EXISTS file_operations (
+                id INTEGER PRIMARY KEY AUTOINCREMENT,
+                doc_hash TEXT NOT NULL,
+                operation TEXT NOT NULL,
+                source_path TEXT NOT NULL,
+                target_path TEXT NOT NULL,
+                source_filename TEXT NOT NULL,
+                target_filename TEXT NOT NULL,
+                original_filename TEXT,
+                collision_step INTEGER,
+                qdrant_points_updated INTEGER DEFAULT 0,
+                performed_at TEXT DEFAULT CURRENT_TIMESTAMP,
+                reversed_at TEXT,
+                notes TEXT
+            );
+            CREATE INDEX IF NOT EXISTS idx_fileops_hash ON file_operations(doc_hash);
+
+            CREATE TABLE IF NOT EXISTS duplicate_review (
+                id INTEGER PRIMARY KEY AUTOINCREMENT,
+                doc_hash TEXT NOT NULL,
+                original_filename TEXT NOT NULL,
+                sanitized_filename TEXT NOT NULL,
+                collision_with_hash TEXT,
+                collision_path TEXT,
+                duplicate_path TEXT NOT NULL,
+                domain TEXT,
+                subdomain TEXT,
+                book_author TEXT,
+                book_title TEXT,
+                status TEXT DEFAULT 'pending',
+                resolution TEXT,
+                discovered_at TEXT DEFAULT CURRENT_TIMESTAMP,
+                resolved_at TEXT
+            );
+            CREATE INDEX IF NOT EXISTS idx_dupreview_status ON duplicate_review(status);
+        """)
+        conn.commit()
+
+    def add_to_catalogue(self, file_hash, filename, path, size_bytes, source, category):
+        conn = self._get_conn()
+        conn.execute(
+            """INSERT INTO catalogue (hash, filename, path, size_bytes, source, category)
+               VALUES (?, ?, ?, ?, ?, ?)
+               ON CONFLICT(hash) DO UPDATE SET
+                   path = excluded.path,
+                   filename = excluded.filename,
+                   source = excluded.source,
+                   category = excluded.category,
+                   path_updated_at = CASE
+                       WHEN catalogue.path != excluded.path THEN CURRENT_TIMESTAMP
+                       ELSE catalogue.path_updated_at
+                   END""",
+            (file_hash, filename, path, size_bytes, source, category)
+        )
+        conn.commit()
+
+    def queue_document(self, file_hash):
+        conn = self._get_conn()
+        row = conn.execute("SELECT * FROM catalogue WHERE hash = ?", (file_hash,)).fetchone()
+        if not row:
+            return False
+        conn.execute("UPDATE catalogue SET status = 'queued' WHERE hash = ?", (file_hash,))
+        conn.execute(
+            """INSERT INTO documents (hash, filename, path, size_bytes, status)
+               VALUES (?, ?, ?, ?, 'queued')
+               ON CONFLICT(hash) DO UPDATE SET
+                   path = excluded.path,
+                   filename = excluded.filename""",
+            (row['hash'], row['filename'], row['path'], row['size_bytes'])
+        )
+        conn.commit()
+        return True
+
+    def update_status(self, file_hash, status, **kwargs):
+        conn = self._get_conn()
+        sets = ["status = ?"]
+        vals = [status]
+
+        ts_field = {
+            'extracted': 'extracted_at',
+            'enriched': 'enriched_at',
+            'complete': 'embedded_at',
+        }.get(status)
+        if ts_field:
+            sets.append(f"{ts_field} = ?")
+            vals.append(datetime.now(timezone.utc).isoformat())
+
+        for k, v in kwargs.items():
+            sets.append(f"{k} = ?")
+            vals.append(v)
+
+        vals.append(file_hash)
+        conn.execute(f"UPDATE documents SET {', '.join(sets)} WHERE hash = ?", vals)
+        conn.commit()
+
+    def get_by_status(self, status, limit=None):
+        conn = self._get_conn()
+        q = "SELECT * FROM documents WHERE status = ? ORDER BY discovered_at"
+        if limit:
+            q += f" LIMIT {int(limit)}"
+        return [dict(r) for r in conn.execute(q, (status,)).fetchall()]
+
+    def get_catalogued(self, source=None, category=None, limit=None):
+        conn = self._get_conn()
+        q = "SELECT * FROM catalogue WHERE status = 'catalogued'"
+        params = []
+        if source:
+            q += " AND source = ?"
+            params.append(source)
+        if category:
+            q += " AND category = ?"
+            params.append(category)
+        q += " ORDER BY discovered_at"
+        if limit:
+            q += f" LIMIT {int(limit)}"
+        return [dict(r) for r in conn.execute(q, params).fetchall()]
+
+    def get_document(self, file_hash):
+        conn = self._get_conn()
+        row = conn.execute("SELECT * FROM documents WHERE hash = ?", (file_hash,)).fetchone()
+        return dict(row) if row else None
+
+    def get_status_counts(self):
+        conn = self._get_conn()
+        cat_counts = {}
+        for row in conn.execute("SELECT status, COUNT(*) as cnt FROM catalogue GROUP BY status"):
+            cat_counts[row['status']] = row['cnt']
+
+        doc_counts = {}
+        for row in conn.execute("SELECT status, COUNT(*) as cnt FROM documents GROUP BY status"):
+            doc_counts[row['status']] = row['cnt']
+
+        return {'catalogue': cat_counts, 'documents': doc_counts}
+
+    def get_failures(self):
+        conn = self._get_conn()
+        return [dict(r) for r in conn.execute(
+            "SELECT * FROM documents WHERE status = 'failed' ORDER BY discovered_at"
+        ).fetchall()]
+
+    def mark_failed(self, file_hash, error_msg):
+        conn = self._get_conn()
+        conn.execute(
+            "UPDATE documents SET status = 'failed', error_message = ? WHERE hash = ?",
+            (str(error_msg)[:1000], file_hash)
+        )
+        conn.commit()
+
+    def increment_retry(self, file_hash):
+        conn = self._get_conn()
+        conn.execute(
+            "UPDATE documents SET retry_count = retry_count + 1, status = 'queued', error_message = NULL WHERE hash = ?",
+            (file_hash,)
+        )
+        conn.commit()
+
+    def get_sources(self):
+        conn = self._get_conn()
+        return [r[0] for r in conn.execute(
+            "SELECT DISTINCT source FROM catalogue ORDER BY source"
+        ).fetchall()]
+
+    def get_categories(self, source=None):
+        conn = self._get_conn()
+        if source:
+            return [r[0] for r in conn.execute(
+                "SELECT DISTINCT category FROM catalogue WHERE source = ? ORDER BY category", (source,)
+            ).fetchall()]
+        return [r[0] for r in conn.execute(
+            "SELECT DISTINCT category FROM catalogue ORDER BY category"
+        ).fetchall()]
+
+    def get_all_documents(self, status=None, source=None, category=None, limit=None, offset=None):
+        conn = self._get_conn()
+        q = """SELECT d.*, c.source, c.category FROM documents d
+               LEFT JOIN catalogue c ON d.hash = c.hash WHERE 1=1"""
+        params = []
+        if status:
+            q += " AND d.status = ?"
+            params.append(status)
+        if source:
+            q += " AND c.source = ?"
+            params.append(source)
+        if category:
+            q += " AND c.category = ?"
+            params.append(category)
+        q += " ORDER BY d.discovered_at DESC"
+        if limit:
+            q += f" LIMIT {int(limit)}"
+        if offset:
+            q += f" OFFSET {int(offset)}"
+        return [dict(r) for r in conn.execute(q, params).fetchall()]
+
+    def count_documents(self, source=None, category=None):
+        """Count documents matching optional source/category filters."""
+        conn = self._get_conn()
+        q = """SELECT COUNT(*) FROM documents d
+               LEFT JOIN catalogue c ON d.hash = c.hash WHERE 1=1"""
+        params = []
+        if source:
+            q += " AND c.source = ?"
+            params.append(source)
+        if category:
+            q += " AND c.category = ?"
+            params.append(category)
+        return conn.execute(q, params).fetchone()[0]
+
+    def catalogue_count(self):
+        conn = self._get_conn()
+        return conn.execute("SELECT COUNT(*) FROM catalogue").fetchone()[0]
+
+    def source_breakdown(self):
+        conn = self._get_conn()
+        return [dict(r) for r in conn.execute(
+            "SELECT source, COUNT(*) as count, SUM(size_bytes) as total_bytes FROM catalogue GROUP BY source ORDER BY count DESC"
+        ).fetchall()]
+
+    def category_breakdown(self, source=None):
+        conn = self._get_conn()
+        if source:
+            return [dict(r) for r in conn.execute(
+                "SELECT category, COUNT(*) as count FROM catalogue WHERE source = ? GROUP BY category ORDER BY count DESC",
+                (source,)
+            ).fetchall()]
+        return [dict(r) for r in conn.execute(
+            "SELECT source, category, COUNT(*) as count FROM catalogue GROUP BY source, category ORDER BY source, count DESC"
+        ).fetchall()]
+
+    def get_path_updates(self):
+        """Get catalogue entries where path was updated since last sync."""
+        conn = self._get_conn()
+        return [dict(r) for r in conn.execute(
+            "SELECT hash, filename, path, source, category FROM catalogue "
+            "WHERE path_updated_at IS NOT NULL"
+        ).fetchall()]
+
+    def clear_path_update(self, file_hash):
+        """Clear path_updated_at flag after Qdrant sync."""
+        conn = self._get_conn()
+        conn.execute(
+            "UPDATE catalogue SET path_updated_at = NULL WHERE hash = ?",
+            (file_hash,)
+        )
+        conn.commit()
+
+    def sync_document_path(self, file_hash, path, filename):
+        """Update path and filename in documents table."""
+        conn = self._get_conn()
+        conn.execute(
+            "UPDATE documents SET path = ?, filename = ? WHERE hash = ?",
+            (path, filename, file_hash)
+        )
+        conn.commit()
+
+    def status_breakdown(self):
+        conn = self._get_conn()
+        rows = conn.execute(
+            "SELECT status, COUNT(*) as count FROM catalogue GROUP BY status ORDER BY count DESC"
+        ).fetchall()
+        return [dict(r) for r in rows]
+
+    def get_unorganized(self, limit=None):
+        """Get completed documents that haven't been organized yet."""
+        conn = self._get_conn()
+        q = "SELECT hash, filename, path FROM documents WHERE status = 'complete' AND organized_at IS NULL ORDER BY embedded_at"
+        if limit:
+            q += " LIMIT {}".format(int(limit))
+        return [dict(r) for r in conn.execute(q).fetchall()]
+
+
+    def get_ingest_pending(self, ingest_dir, limit=50):
+        """Get completed docs in _ingest/ that haven't been organized."""
+        conn = self._get_conn()
+        pattern = ingest_dir + '%'
+        return [dict(r) for r in conn.execute(
+            "SELECT hash, filename, path FROM documents "
+            "WHERE status = 'complete' AND organized_at IS NULL AND path LIKE ? "
+            "ORDER BY embedded_at LIMIT ?",
+            (pattern, limit)
+        ).fetchall()]
+
+    def mark_organized(self, file_hash):
+        """Mark a document as organized (sets organized_at timestamp)."""
+        conn = self._get_conn()
+        conn.execute(
+            "UPDATE documents SET organized_at = CURRENT_TIMESTAMP WHERE hash = ?",
+            (file_hash,)
+        )
+        conn.commit()
+
+    def update_catalogue_path(self, file_hash, new_path, new_filename):
+        """Update catalogue path/filename and flag for Qdrant sync."""
+        conn = self._get_conn()
+        conn.execute(
+            "UPDATE catalogue SET path = ?, filename = ?, path_updated_at = CURRENT_TIMESTAMP WHERE hash = ?",
+            (new_path, new_filename, file_hash)
+        )
+        conn.commit()
+
+    # ── Stream B: File Operations ───────────────────────────────────
+
+    def log_file_operation(self, doc_hash, operation, source_path, target_path,
+                           source_filename, target_filename, original_filename=None,
+                           collision_step=None, qdrant_points_updated=0, notes=None):
+        """Log a file move/rename operation for audit trail and rollback."""
+        conn = self._get_conn()
+        conn.execute(
+            """INSERT INTO file_operations
+               (doc_hash, operation, source_path, target_path,
+                source_filename, target_filename, original_filename,
+                collision_step, qdrant_points_updated, notes)
+               VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)""",
+            (doc_hash, operation, source_path, target_path,
+             source_filename, target_filename, original_filename,
+             collision_step, qdrant_points_updated, notes)
+        )
+        conn.commit()
+        return conn.execute("SELECT last_insert_rowid()").fetchone()[0]
+
+    def get_file_operations(self, doc_hash=None, limit=50):
+        """Get file operations, optionally filtered by doc_hash."""
+        conn = self._get_conn()
+        if doc_hash:
+            return [dict(r) for r in conn.execute(
+                "SELECT * FROM file_operations WHERE doc_hash = ? ORDER BY performed_at DESC LIMIT ?",
+                (doc_hash, limit)
+            ).fetchall()]
+        return [dict(r) for r in conn.execute(
+            "SELECT * FROM file_operations WHERE reversed_at IS NULL ORDER BY performed_at DESC LIMIT ?",
+            (limit,)
+        ).fetchall()]
+
+    def get_file_operation(self, op_id):
+        """Get a single file operation by ID."""
+        conn = self._get_conn()
+        row = conn.execute("SELECT * FROM file_operations WHERE id = ?", (op_id,)).fetchone()
+        return dict(row) if row else None
+
+    def mark_operation_reversed(self, op_id):
+        """Mark a file operation as reversed."""
+        conn = self._get_conn()
+        conn.execute(
+            "UPDATE file_operations SET reversed_at = CURRENT_TIMESTAMP WHERE id = ?",
+            (op_id,)
+        )
+        conn.commit()
+
+    def queue_duplicate_review(self, doc_hash, original_filename, sanitized_filename,
+                                collision_with_hash=None, collision_path=None,
+                                duplicate_path='', domain=None, subdomain=None,
+                                book_author=None, book_title=None):
+        """Queue a file for human duplicate review."""
+        conn = self._get_conn()
+        conn.execute(
+            """INSERT INTO duplicate_review
+               (doc_hash, original_filename, sanitized_filename,
+                collision_with_hash, collision_path, duplicate_path,
+                domain, subdomain, book_author, book_title)
+               VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)""",
+            (doc_hash, original_filename, sanitized_filename,
+             collision_with_hash, collision_path, duplicate_path,
+             domain, subdomain, book_author, book_title)
+        )
+        conn.commit()
+
+    def get_duplicate_reviews(self, status='pending', limit=50):
+        """Get duplicate review queue."""
+        conn = self._get_conn()
+        return [dict(r) for r in conn.execute(
+            "SELECT * FROM duplicate_review WHERE status = ? ORDER BY discovered_at DESC LIMIT ?",
+            (status, limit)
+        ).fetchall()]
+
+    def get_pipeline_stats(self):
+        """Get Stream B pipeline statistics."""
+        conn = self._get_conn()
+        ops = conn.execute(
+            "SELECT operation, COUNT(*) as cnt FROM file_operations WHERE reversed_at IS NULL GROUP BY operation"
+        ).fetchall()
+        dupes = conn.execute(
+            "SELECT status, COUNT(*) as cnt FROM duplicate_review GROUP BY status"
+        ).fetchall()
+        acquired = 0
+        ingest = 0
+        try:
+            acquired_dir = get_config().get('new_pipeline', {}).get('acquired_dir', '')
+            ingest_dir = get_config().get('new_pipeline', {}).get('ingest_dir', '')
+            if acquired_dir and os.path.isdir(acquired_dir):
+                acquired = len([f for f in os.listdir(acquired_dir) if f.lower().endswith('.pdf')])
+            if ingest_dir and os.path.isdir(ingest_dir):
+                ingest = len([f for f in os.listdir(ingest_dir) if f.lower().endswith('.pdf')])
+        except Exception:
+            pass
+        return {
+            'operations': {dict(r)['operation']: dict(r)['cnt'] for r in ops},
+            'duplicates': {dict(r)['status']: dict(r)['cnt'] for r in dupes},
+            'acquired_pending': acquired,
+            'ingest_pending': ingest,
+        }
--- a/lib/utils.py
+++ b/lib/utils.py
@ -0,0 +1,390 @@
+"""
+RECON Utilities
+
+Content hashing (MD5), config loading (YAML), download URL generation,
+source/category derivation, logging setup, filename sanitization.
+
+Config: Loads and caches config.yaml
+"""
+import hashlib
+import logging
+import os
+import re
+import unicodedata
+from urllib.parse import quote
+
+import yaml
+from logging.handlers import RotatingFileHandler
+
+_config = None
+
+
+def get_config():
+    global _config
+    if _config is not None:
+        return _config
+
+    config_path = os.path.join(os.path.dirname(os.path.dirname(os.path.abspath(__file__))), 'config.yaml')
+    with open(config_path) as f:
+        _config = yaml.safe_load(f)
+
+    # Load Gemini keys from .env
+    env_path = os.path.join(os.path.dirname(os.path.dirname(os.path.abspath(__file__))), '.env')
+    _config['gemini_keys'] = []
+    if os.path.exists(env_path):
+        with open(env_path) as f:
+            for line in f:
+                line = line.strip()
+                if line and not line.startswith('#') and '=' in line:
+                    key, val = line.split('=', 1)
+                    if key.startswith('GEMINI_KEY_') and val != 'PASTE_KEY_HERE':
+                        _config['gemini_keys'].append(val)
+
+    return _config
+
+
+def content_hash(filepath):
+    h = hashlib.md5()
+    with open(filepath, 'rb') as f:
+        for chunk in iter(lambda: f.read(8192), b''):
+            h.update(chunk)
+    return h.hexdigest()
+
+
+def concept_id(doc_hash, page_num, concept_index):
+    raw = f"{doc_hash}:{page_num}:{concept_index}"
+    h = hashlib.md5(raw.encode()).hexdigest()[:15]
+    return int(h, 16)
+
+
+def setup_logging(name='recon'):
+    config = get_config()
+    log_dir = config['paths']['logs']
+    os.makedirs(log_dir, exist_ok=True)
+    os.makedirs(os.path.join(log_dir, 'errors'), exist_ok=True)
+
+    logger = logging.getLogger(name)
+    if logger.handlers:
+        return logger
+    logger.setLevel(logging.DEBUG)
+
+    fmt = logging.Formatter('%(asctime)s [%(levelname)s] %(name)s: %(message)s', datefmt='%Y-%m-%d %H:%M:%S')
+
+    fh = RotatingFileHandler(os.path.join(log_dir, 'recon.log'), maxBytes=10*1024*1024, backupCount=5)
+    fh.setLevel(logging.DEBUG)
+    fh.setFormatter(fmt)
+    logger.addHandler(fh)
+
+    eh = RotatingFileHandler(os.path.join(log_dir, 'errors', 'errors.log'), maxBytes=5*1024*1024, backupCount=3)
+    eh.setLevel(logging.ERROR)
+    eh.setFormatter(fmt)
+    logger.addHandler(eh)
+
+    ch = logging.StreamHandler()
+    ch.setLevel(logging.INFO)
+    ch.setFormatter(fmt)
+    logger.addHandler(ch)
+
+    return logger
+
+
+def derive_source_and_category(filepath, library_root):
+    rel = os.path.relpath(filepath, library_root)
+    parts = rel.split(os.sep)
+    source = parts[0] if parts else 'unknown'
+    category = parts[1] if len(parts) > 2 else source
+    return source, category
+
+
+def clean_filename_to_title(filename):
+    """Convert a PDF filename into a human-readable title."""
+    # Strip extension
+    name = os.path.splitext(filename)[0]
+    # Remove common PDF download suffixes (with or without parens)
+    name = re.sub(r'[\s_]*\(?\s*PDFDrive\s*\)?\s*_?', '', name, flags=re.IGNORECASE)
+    name = re.sub(r'[\s_]*\(?\s*z-lib\.org\s*\)?\s*_?', '', name, flags=re.IGNORECASE)
+    # Handle military manual prefixes: FM_23_10 -> FM 23-10, ATP_3_21 -> ATP 3-21
+    name = re.sub(
+        r'\b(FM|ATP|TC|TM|AR|STP|GTA|ATTP|FMFRP|ADP|ADRP)[-_](\d+)[-_](\d+)',
+        lambda m: f"{m.group(1)} {m.group(2)}-{m.group(3)}",
+        name
+    )
+    # Fix common abbreviations: U_S -> U.S., etc.
+    name = re.sub(r'(?<![A-Za-z])U[_\s]S(?=[_\s]|$)', 'U.S.', name)
+    # Replace underscores and hyphens with spaces (but not in manual numbers like FM 23-10)
+    name = re.sub(r'(?<!\d)[-_](?!\d)', ' ', name)
+    name = name.replace('_', ' ')
+    # Remove bracketed years like [1990]
+    year_match = re.search(r'\[(\d{4})\]', name)
+    year_suffix = f" ({year_match.group(1)})" if year_match else ''
+    name = re.sub(r'\s*\[\d{4}\]\s*', ' ', name)
+    # Collapse multiple spaces
+    name = re.sub(r'\s+', ' ', name).strip()
+    # Title-case, but preserve uppercase military abbreviations
+    words = name.split()
+    titled = []
+    for w in words:
+        if w.isupper() and len(w) >= 2:
+            titled.append(w)
+        elif re.match(r'^\d', w):
+            titled.append(w)
+        else:
+            titled.append(w.capitalize() if w.islower() else w)
+    name = ' '.join(titled) + year_suffix
+    name = name.strip()
+    if len(name) < 3:
+        return os.path.splitext(filename)[0]
+    return name
+
+
+# ── Mojibake fix table ──────────────────────────────────────────────
+_MOJIBAKE = {
+    '\u00e2\u0080\u0099': "'",       # â€™ → '  (right single quote)
+    '\u00e2\u0080\u0098': "'",       # â€˜ → '  (left single quote)
+    '\u00e2\u0080\u009c': '"',       # â€œ → "  (left double quote)
+    '\u00e2\u0080\u009d': '"',       # â€ → "   (right double quote)
+    '\u00e2\u0080\u0093': '-',       # â€" → -  (en dash)
+    '\u00e2\u0080\u0094': '-',       # â€" → -  (em dash)
+    '\u00e2\u0080\u00a6': '...',     # â€¦ → ... (ellipsis)
+    '\u00c3\u00a9': 'e',             # Ã© → e   (e-acute)
+    '\u00c3\u00a8': 'e',             # Ã¨ → e   (e-grave)
+    '\u00c3\u00b6': 'o',             # Ã¶ → o   (o-umlaut)
+    '\u00c3\u00bc': 'u',             # Ã¼ → u   (u-umlaut)
+    '\u00c3\u00a4': 'a',             # Ã¤ → a   (a-umlaut)
+    '\u00c3\u00b1': 'n',             # Ã± → n   (n-tilde)
+    '\u00c3\u00ad': 'i',             # Ã → i   (i-acute)
+    '\u00c3\u00a1': 'a',             # Ã¡ → a   (a-acute)
+    '\u00c3\u00ba': 'u',             # Ãº → u   (u-acute)
+    '\u00c3\u00b3': 'o',             # Ã³ → o   (o-acute)
+    '\u00c2\u00ae': '',              # Â® → (registered)
+    '\u00c2\u00a9': '',              # Â© → (copyright)
+    '\u00c2\u00ab': '"',             # Â« → "   (guillemet left)
+    '\u00c2\u00bb': '"',             # Â» → "   (guillemet right)
+}
+
+# Pre-compile: replace longer sequences first to avoid partial matches
+_MOJIBAKE_PATTERN = re.compile(
+    '|'.join(re.escape(k) for k in sorted(_MOJIBAKE.keys(), key=len, reverse=True))
+)
+
+
+def sanitize_filename(filename, doc_hash=None):
+    """Sanitize a PDF filename for cross-platform filesystem safety.
+
+    Six-phase pipeline:
+      1. Strip source-site metadata (Anna's Archive, PDFDrive, z-lib, torrent tags)
+      2. Strip embedded identifiers (ISBN, MD5 hash, z-lib hex suffix)
+      3. Fix character encoding (mojibake, NFKD normalization)
+      4. Normalize structure (military prefixes, period-separated words, underscores)
+      5. Clean characters (Windows-illegal, control chars, collapse whitespace)
+      6. Validate and truncate (120 char max, word-boundary break)
+
+    Args:
+        filename: Original filename (with extension)
+        doc_hash: Optional doc_hash to verify z-lib suffix matches
+
+    Returns:
+        Sanitized filename (with extension preserved)
+    """
+    stem, ext = os.path.splitext(filename)
+    ext = ext.lower()
+    if not ext:
+        ext = '.pdf'
+
+    # ── Phase 1: Strip source-site metadata ─────────────────────────
+    # Anna's Archive pattern: Title -- Authors -- Edition -- ISBN -- Hash -- Source
+    segments = stem.split(' -- ')
+    if len(segments) >= 3:
+        stem = segments[0]
+    elif len(segments) == 2:
+        second = segments[1]
+        if re.search(r'97[89]\d{10}|[0-9a-f]{32}|(?:19|20)\d{2}|[Aa]nna', second):
+            stem = segments[0]
+
+    # PDFDrive tags
+    stem = re.sub(r'\s*\(\s*PDFDrive\s*\)\s*', ' ', stem, flags=re.IGNORECASE)
+    stem = re.sub(r'\s*_PDFDrive_\s*', ' ', stem, flags=re.IGNORECASE)
+
+    # z-lib tags
+    stem = re.sub(r'\s*\(\s*z-lib\.org\s*\)\s*', ' ', stem, flags=re.IGNORECASE)
+    stem = re.sub(r'\s*_z-lib\.org_\s*', ' ', stem, flags=re.IGNORECASE)
+
+    # Torrent tags in curly braces
+    stem = re.sub(r'\s*\{[A-Za-z0-9]+\}\s*', ' ', stem)
+
+    # ── Phase 2: Strip embedded identifiers ─────────────────────────
+    # ISBN-13 (with optional dashes/spaces)
+    stem = re.sub(r'\s*97[89][\s-]?\d[\s-]?\d{2}[\s-]?\d{5,6}[\s-]?\d\s*', ' ', stem)
+    # ISBN-10 with dashes
+    stem = re.sub(r'\s*\d[\s-]\d{2}[\s-]\d{5,6}[\s-][\dXx]\s*', ' ', stem)
+    # MD5 hashes (32 hex chars, standalone)
+    stem = re.sub(r'\s*\b[0-9a-f]{32}\b\s*', ' ', stem)
+    # z-lib 8-char hex suffix like _4d969c3c
+    if doc_hash:
+        # Only strip if it matches the doc_hash prefix
+        match = re.search(r'_([0-9a-f]{8})$', stem)
+        if match and doc_hash.startswith(match.group(1)):
+            stem = stem[:match.start()]
+    else:
+        # Strip any trailing 8-char hex suffix after underscore
+        stem = re.sub(r'_[0-9a-f]{8}$', '', stem)
+
+    # ── Phase 3: Fix character encoding ─────────────────────────────
+    # Fix known mojibake sequences
+    stem = _MOJIBAKE_PATTERN.sub(lambda m: _MOJIBAKE[m.group()], stem)
+
+    # Common single-char mojibake that slip through
+    stem = stem.replace('\u00e2\u0080', '-')  # partial em/en dash mojibake
+    stem = stem.replace('H_', 'H. ')  # Anna's Archive initial abbreviation pattern
+
+    # NFKD normalize: decompose accented chars, strip combining marks
+    nfkd = unicodedata.normalize('NFKD', stem)
+    cleaned = []
+    for ch in nfkd:
+        cat = unicodedata.category(ch)
+        if cat.startswith('M'):  # combining mark — skip
+            continue
+        if cat.startswith('C') and ch not in (' ', '\t'):  # control char — skip
+            continue
+        # Keep ASCII + common punctuation; drop CJK/Cyrillic/etc if not transliteratable
+        cp = ord(ch)
+        if cp < 128:
+            cleaned.append(ch)
+        elif cat.startswith('L') or cat.startswith('N'):
+            # Letter or number outside ASCII — try to keep if Latin-ish
+            if cp < 0x0250:  # Latin Extended range
+                cleaned.append(ch)
+            # else: drop CJK, Cyrillic, etc.
+        elif cat.startswith('P') or cat.startswith('S'):
+            # Punctuation/symbol — map to ASCII equivalent
+            if ch in ('\u2018', '\u2019', '\u201a', '\u0060'):
+                cleaned.append("'")
+            elif ch in ('\u201c', '\u201d', '\u201e'):
+                cleaned.append('"')
+            elif ch in ('\u2013', '\u2014', '\u2012'):
+                cleaned.append('-')
+            elif ch == '\u2026':
+                cleaned.append('...')
+            elif ch in ('\u00ab', '\u00bb'):
+                cleaned.append('"')
+            else:
+                cleaned.append(' ')
+        elif cat.startswith('Z'):
+            cleaned.append(' ')
+    stem = ''.join(cleaned)
+
+    # ── Phase 4: Normalize structure ────────────────────────────────
+    # Detect URL-derived filenames — skip aggressive normalization
+    is_url_derived = bool(re.match(r'[a-z0-9-]+\.[a-z]{2,}[_/]', stem))
+
+    if not is_url_derived:
+        # Military manual prefixes: FM_23_10 -> FM 23-10
+        stem = re.sub(
+            r'\b(FM|ATP|TC|TM|AR|STP|GTA|ATTP|FMFRP|ADP|ADRP)[-_](\d+)[-_](\d+)',
+            lambda m: '{} {}-{}'.format(m.group(1), m.group(2), m.group(3)),
+            stem
+        )
+        # Period-separated words (4+ segments = likely word-separated, not abbreviations like U.S.)
+        if stem.count('.') >= 4:
+            stem = re.sub(r'\.(?=[A-Za-z])', ' ', stem)
+
+    # Underscores to spaces (always)
+    stem = stem.replace('_', ' ')
+
+    # ── Phase 5: Clean characters ───────────────────────────────────
+    # Remove Windows-illegal chars and control chars
+    stem = re.sub(r'[<>:"|?*\\\/]', '', stem)
+    stem = re.sub(r'[\x00-\x1f\x7f]', '', stem)
+
+    # Collapse multiple spaces, hyphens, underscores
+    stem = re.sub(r' {2,}', ' ', stem)
+    stem = re.sub(r'-{2,}', '-', stem)
+
+    # Strip leading/trailing dots, spaces, dashes
+    stem = stem.strip('. -')
+
+    # ── Phase 6: Validate and truncate ──────────────────────────────
+    stem = stem.strip()
+    if not stem or len(stem) < 2:
+        stem = 'untitled'
+
+    max_stem = 120 - len(ext)
+    if len(stem) > max_stem:
+        # Break at word boundary
+        truncated = stem[:max_stem]
+        last_space = truncated.rfind(' ')
+        if last_space > max_stem * 0.6:
+            truncated = truncated[:last_space]
+        stem = truncated.rstrip('. -,')
+
+    return stem + ext
+
+
+def filename_needs_sanitization(filename, doc_hash=None):
+    """Return True if sanitize_filename() would change the filename."""
+    return sanitize_filename(filename, doc_hash) != filename
+
+
+def resolve_collisions(entries):
+    """Resolve filename collisions after sanitization.
+
+    Args:
+        entries: list of dicts, each with 'sanitized_filename', 'proposed_dir', 'hash'
+
+    Returns:
+        Updated entries with collision suffixes applied where needed.
+        Each entry gets 'collision' key (True/False) and possibly updated 'sanitized_filename'.
+    """
+    from collections import defaultdict
+
+    # Group by (dir, lowercase filename) to find collisions
+    groups = defaultdict(list)
+    for i, e in enumerate(entries):
+        key = (e['proposed_dir'], e['sanitized_filename'].lower())
+        groups[key].append(i)
+
+    collision_count = 0
+    for key, indices in groups.items():
+        if len(indices) <= 1:
+            for i in indices:
+                entries[i]['collision'] = False
+            continue
+
+        # Collision — add hash suffix to all but the first
+        collision_count += len(indices) - 1
+        entries[indices[0]]['collision'] = False
+
+        for i in indices[1:]:
+            e = entries[i]
+            h6 = e['hash'][:6]
+            stem, ext = os.path.splitext(e['sanitized_filename'])
+            new_name = '{} [{}]{}'.format(stem, h6, ext)
+            # Re-check length
+            if len(new_name) > 120:
+                max_stem = 120 - len(ext) - 9  # 9 = len(' [XXXXXX]')
+                stem = stem[:max_stem].rstrip('. -,')
+                new_name = '{} [{}]{}'.format(stem, h6, ext)
+            e['sanitized_filename'] = new_name
+            e['collision'] = True
+
+    return entries, collision_count
+
+
+def generate_download_url(filepath, library_root='/mnt/library', base_url='https://files.echo6.co'):
+    """Generate a download/source URL from a document path.
+
+    For web URLs (http/https): returns the URL directly -- it's already a link.
+    For file paths: converts to files.echo6.co URL.
+    """
+    if not filepath:
+        return ''
+
+    # Web content -- path IS the source URL
+    if filepath.startswith(('http://', 'https://')):
+        return filepath
+
+    # File content -- convert to files.echo6.co URL
+    rel = os.path.relpath(filepath, library_root)
+    parts = rel.split(os.sep)
+    encoded = '/'.join(quote(p) for p in parts)
+    return f"{base_url}/{encoded}"
--- a/lib/web_scraper.py
+++ b/lib/web_scraper.py
@ -0,0 +1,324 @@
+"""
+RECON Web Scraper — URL-based content ingestion.
+
+Fetches web pages, extracts clean text, chunks into pages,
+and feeds into the standard RECON enrichment pipeline.
+
+Output format matches lib/extractor.py so the enricher
+processes web content identically to PDF content.
+"""
+
+import hashlib
+import json
+import os
+import re
+import time
+from datetime import datetime, timezone
+from urllib.parse import urlparse, unquote
+
+import requests
+import trafilatura
+
+from .utils import get_config, setup_logging
+from .status import StatusDB
+
+logger = setup_logging('recon.web_scraper')
+
+# Defaults (overridden by config.yaml web_scraper section)
+DEFAULT_WORDS_PER_PAGE = 2000
+DEFAULT_FETCH_TIMEOUT = 30
+DEFAULT_USER_AGENT = 'RECON/1.0 (Knowledge Extraction Pipeline)'
+DEFAULT_RATE_LIMIT_DELAY = 1.0
+
+
+def _get_scraper_config(config=None):
+    """Get web scraper settings from config, with defaults."""
+    if config is None:
+        config = get_config()
+    ws = config.get('web_scraper', {})
+    return {
+        'words_per_page': ws.get('words_per_page', DEFAULT_WORDS_PER_PAGE),
+        'fetch_timeout': ws.get('fetch_timeout', DEFAULT_FETCH_TIMEOUT),
+        'user_agent': ws.get('user_agent', DEFAULT_USER_AGENT),
+        'rate_limit_delay': ws.get('rate_limit_delay', DEFAULT_RATE_LIMIT_DELAY),
+        'max_batch_size': ws.get('max_batch_size', 50),
+    }
+
+
+def fetch_url(url, config=None):
+    """
+    Fetch a URL and extract clean text + metadata using trafilatura.
+
+    Returns dict with: text, title, author, date, description, url,
+    sitename, raw_length, text_length.
+
+    Raises ValueError if fetch or extraction fails.
+    """
+    sc = _get_scraper_config(config)
+    logger.info(f"Fetching URL: {url}")
+
+    try:
+        response = requests.get(
+            url,
+            headers={'User-Agent': sc['user_agent']},
+            timeout=sc['fetch_timeout'],
+            allow_redirects=True
+        )
+        response.raise_for_status()
+    except requests.RequestException as e:
+        raise ValueError(f"Failed to fetch {url}: {e}")
+
+    raw_html = response.text
+    if not raw_html or len(raw_html) < 100:
+        raise ValueError(f"Empty or too-short response from {url}")
+
+    text = trafilatura.extract(
+        raw_html,
+        include_comments=False,
+        include_tables=True,
+        include_links=False,
+        include_images=False,
+        favor_precision=False,
+        deduplicate=True
+    )
+
+    if not text or len(text.strip()) < 50:
+        raise ValueError(f"No meaningful text extracted from {url}")
+
+    metadata = trafilatura.extract_metadata(raw_html)
+
+    result = {
+        'text': text.strip(),
+        'title': '',
+        'author': '',
+        'date': '',
+        'description': '',
+        'url': url,
+        'sitename': '',
+        'raw_length': len(raw_html),
+        'text_length': len(text),
+    }
+
+    if metadata:
+        result['title'] = metadata.title or ''
+        result['author'] = metadata.author or ''
+        result['date'] = metadata.date or ''
+        result['description'] = metadata.description or ''
+        result['sitename'] = metadata.sitename or ''
+
+    if not result['title']:
+        result['title'] = _title_from_url(url)
+
+    logger.info(f"Extracted {result['text_length']} chars from {url} — \"{result['title']}\"")
+    return result
+
+
+def _title_from_url(url):
+    """Generate a readable title from a URL as fallback."""
+    parsed = urlparse(url)
+    path = unquote(parsed.path).strip('/')
+    if path:
+        segment = path.split('/')[-1]
+        segment = re.sub(r'[-_]', ' ', segment)
+        segment = re.sub(r'\.\w+$', '', segment)
+        return segment.title() if segment else parsed.netloc
+    return parsed.netloc
+
+
+def chunk_text(text, words_per_page=DEFAULT_WORDS_PER_PAGE):
+    """
+    Split text into page-sized chunks for enrichment windows.
+
+    Breaks at paragraph boundaries. Each chunk is ~words_per_page words.
+    Returns list of strings (each is one "page").
+    """
+    paragraphs = text.split('\n\n')
+    pages = []
+    current_page = []
+    current_words = 0
+
+    for para in paragraphs:
+        para = para.strip()
+        if not para:
+            continue
+
+        para_words = len(para.split())
+
+        if para_words > words_per_page * 1.5:
+            if current_page:
+                pages.append('\n\n'.join(current_page))
+                current_page = []
+                current_words = 0
+
+            sentences = re.split(r'(?<=[.!?])\s+', para)
+            for sentence in sentences:
+                sentence_words = len(sentence.split())
+                if current_words + sentence_words > words_per_page and current_page:
+                    pages.append('\n\n'.join(current_page))
+                    current_page = [sentence]
+                    current_words = sentence_words
+                else:
+                    current_page.append(sentence)
+                    current_words += sentence_words
+        elif current_words + para_words > words_per_page and current_page:
+            pages.append('\n\n'.join(current_page))
+            current_page = [para]
+            current_words = para_words
+        else:
+            current_page.append(para)
+            current_words += para_words
+
+    if current_page:
+        pages.append('\n\n'.join(current_page))
+
+    if not pages:
+        pages = [text]
+
+    return pages
+
+
+def _content_hash(text):
+    """MD5 hash of text content — same hash type as PDF pipeline."""
+    return hashlib.md5(text.encode('utf-8')).hexdigest()
+
+
+def _display_filename(url):
+    """Create a display filename from a URL."""
+    parsed = urlparse(url)
+    name = f"{parsed.netloc}_{parsed.path.strip('/').replace('/', '_')}"
+    name = re.sub(r'[^\w._-]', '_', name)[:200]
+    if not name.endswith('.html'):
+        name += '.html'
+    return name
+
+
+def ingest_url(url, category='Web', source='web', config=None):
+    """
+    Full URL ingestion: fetch -> extract -> chunk -> save -> catalogue -> queue as extracted.
+
+    Returns dict with hash, title, page_count, status.
+    Raises ValueError on failure.
+    """
+    if config is None:
+        config = get_config()
+    sc = _get_scraper_config(config)
+    db = StatusDB()
+
+    # Fetch and extract
+    extracted = fetch_url(url, config)
+
+    # Hash the extracted text content
+    doc_hash = _content_hash(extracted['text'])
+
+    # Check for duplicate in catalogue
+    conn = db._get_conn()
+    existing = conn.execute("SELECT * FROM catalogue WHERE hash = ?", (doc_hash,)).fetchone()
+    if existing:
+        # Also check documents table for status
+        doc = db.get_document(doc_hash)
+        existing_status = doc['status'] if doc else existing['status']
+        logger.info(f"Duplicate content (hash {doc_hash[:12]}...) — already exists as '{existing['filename']}'")
+        return {
+            'hash': doc_hash,
+            'status': 'duplicate',
+            'title': doc.get('book_title', '') if doc else existing['filename'],
+            'existing_status': existing_status,
+        }
+
+    # Chunk into pages
+    pages = chunk_text(extracted['text'], sc['words_per_page'])
+
+    # Save text files in extractor-compatible format:
+    # data/text/{hash}/page_0001.txt, page_0002.txt, ... + meta.json
+    text_dir = os.path.join(config['paths']['text'], doc_hash)
+    os.makedirs(text_dir, exist_ok=True)
+
+    for i, page_text in enumerate(pages, 1):
+        page_file = os.path.join(text_dir, f"page_{i:04d}.txt")
+        with open(page_file, 'w', encoding='utf-8') as f:
+            f.write(page_text)
+
+    meta = {
+        'hash': doc_hash,
+        'source_type': 'web',
+        'url': url,
+        'title': extracted['title'],
+        'author': extracted['author'],
+        'date': extracted['date'],
+        'description': extracted['description'],
+        'sitename': extracted['sitename'],
+        'page_count': len(pages),
+        'text_length': extracted['text_length'],
+        'fetched_at': datetime.now(timezone.utc).isoformat(),
+    }
+    with open(os.path.join(text_dir, 'meta.json'), 'w') as f:
+        json.dump(meta, f, indent=2)
+
+    display_name = _display_filename(url)
+
+    # Add to catalogue
+    db.add_to_catalogue(doc_hash, display_name, url, extracted['text_length'], source, category)
+
+    # Queue (creates documents entry as 'queued')
+    db.queue_document(doc_hash)
+
+    # Advance directly to 'extracted' — text is already saved, skip PDF extraction
+    db.update_status(doc_hash, 'extracted',
+                     page_count=len(pages),
+                     pages_extracted=len(pages),
+                     book_title=extracted['title'],
+                     book_author=extracted['author'] or None)
+
+    logger.info(f"Ingested URL: {url} -> {doc_hash[:12]}... ({len(pages)} pages, \"{extracted['title']}\")")
+
+    return {
+        'hash': doc_hash,
+        'status': 'extracted',
+        'title': extracted['title'],
+        'author': extracted['author'],
+        'page_count': len(pages),
+        'url': url,
+    }
+
+
+def ingest_urls(urls, category='Web', source='web', delay=None, config=None):
+    """
+    Batch URL ingestion with rate limiting.
+    Returns list of result dicts (one per URL).
+    """
+    if config is None:
+        config = get_config()
+    if delay is None:
+        delay = _get_scraper_config(config)['rate_limit_delay']
+
+    results = []
+    total = len(urls)
+
+    for i, url in enumerate(urls, 1):
+        url = url.strip()
+        if not url or url.startswith('#'):
+            continue
+
+        logger.info(f"[{i}/{total}] Processing: {url}")
+
+        try:
+            result = ingest_url(url, category=category, source=source, config=config)
+            result['url'] = url
+            results.append(result)
+        except Exception as e:
+            logger.error(f"[{i}/{total}] Failed: {url} — {e}")
+            results.append({
+                'url': url,
+                'status': 'failed',
+                'error': str(e),
+            })
+
+        if i < total and delay > 0:
+            time.sleep(delay)
+
+    succeeded = sum(1 for r in results if r.get('status') not in ('failed', 'duplicate'))
+    failed = sum(1 for r in results if r.get('status') == 'failed')
+    dupes = sum(1 for r in results if r.get('status') == 'duplicate')
+    logger.info(f"Batch complete: {succeeded} new, {dupes} duplicates, {failed} failed out of {total}")
+
+    return results
--- a/migrate_paths.py
+++ b/migrate_paths.py
@ -0,0 +1,72 @@
+#!/usr/bin/env python3
+"""One-time migration: rescan library to detect moved files and sync paths to Qdrant.
+
+This rescans all PDFs in the library. The upsert in add_to_catalogue() will
+detect any files whose paths changed since they were originally catalogued,
+and flag them with path_updated_at. Then sync_qdrant_paths() propagates
+those path changes to Qdrant download_url payloads.
+
+Usage: cd /opt/recon && source venv/bin/activate && python3 migrate_paths.py [--dry-run]
+"""
+import sys
+import os
+
+sys.path.insert(0, '/opt/recon')
+
+from recon import scan_library, sync_qdrant_paths
+from lib.status import StatusDB
+from lib.utils import setup_logging
+
+logger = setup_logging('recon.migrate')
+
+
+def main():
+    dry_run = '--dry-run' in sys.argv
+
+    db = StatusDB()
+    conn = db._get_conn()
+
+    total_cat = conn.execute("SELECT COUNT(*) FROM catalogue").fetchone()[0]
+    total_docs = conn.execute("SELECT COUNT(*) FROM documents").fetchone()[0]
+    print(f"Before: {total_cat} catalogue entries, {total_docs} documents")
+
+    # Rescan library — upsert will detect and flag path changes
+    print("\nScanning library (this will re-hash all files)...")
+    count = scan_library()
+    print(f"Scanned {count} PDFs")
+
+    # Check how many paths changed
+    updates = db.get_path_updates()
+    print(f"\nDetected {len(updates)} path changes")
+
+    if not updates:
+        print("No paths need syncing — all up to date")
+        return 0
+
+    # Show what changed
+    for row in updates[:20]:
+        print(f"  {row['hash'][:8]} {row['filename']}")
+    if len(updates) > 20:
+        print(f"  ... and {len(updates) - 20} more")
+
+    if dry_run:
+        print(f"\n[DRY RUN] Would sync {len(updates)} paths to Qdrant. Re-run without --dry-run to apply.")
+        return 0
+
+    # Sync to Qdrant
+    print(f"\nSyncing {len(updates)} paths to Qdrant...")
+    synced = sync_qdrant_paths()
+    print(f"Synced {synced} document paths to Qdrant")
+
+    # Verify
+    remaining = db.get_path_updates()
+    if remaining:
+        print(f"\nWARNING: {len(remaining)} paths still pending (Qdrant sync may have partially failed)")
+    else:
+        print("\nAll paths synced successfully")
+
+    return 0
+
+
+if __name__ == '__main__':
+    sys.exit(main())
--- a/recon.py
+++ b/recon.py
--- a/requirements.txt
+++ b/requirements.txt
@ -0,0 +1,69 @@
+annotated-types==0.7.0
+anyio==4.12.1
+babel==2.18.0
+beautifulsoup4==4.14.3
+blinker==1.9.0
+certifi==2026.1.4
+cffi==2.0.0
+charset-normalizer==3.4.4
+click==8.3.1
+courlan==1.3.2
+cryptography==46.0.5
+dateparser==1.3.0
+Flask==3.1.2
+google-ai-generativelanguage==0.6.15
+google-api-core==2.29.0
+google-api-python-client==2.190.0
+google-auth==2.48.0
+google-auth-httplib2==0.3.0
+google-generativeai==0.8.6
+googleapis-common-protos==1.72.0
+grpcio==1.78.0
+grpcio-status==1.71.2
+h11==0.16.0
+h2==4.3.0
+hpack==4.1.0
+htmldate==1.9.4
+httpcore==1.0.9
+httplib2==0.31.2
+httpx==0.28.1
+hyperframe==6.1.0
+idna==3.11
+itsdangerous==2.2.0
+Jinja2==3.1.6
+jusText==3.0.2
+lxml==6.0.2
+lxml_html_clean==0.4.3
+MarkupSafe==3.0.3
+numpy==2.4.2
+packaging==26.0
+pillow==12.1.1
+portalocker==3.2.0
+proto-plus==1.27.1
+protobuf==5.29.6
+pyasn1==0.6.2
+pyasn1_modules==0.4.2
+pycparser==3.0
+pydantic==2.12.5
+pydantic_core==2.41.5
+pyparsing==3.3.2
+PyPDF2==3.0.1
+pytesseract==0.3.13
+python-dateutil==2.9.0.post0
+pytz==2025.2
+PyYAML==6.0.3
+qdrant-client==1.16.2
+regex==2026.1.15
+requests==2.32.5
+rsa==4.9.1
+six==1.17.0
+soupsieve==2.8.3
+tld==0.13.1
+tqdm==4.67.3
+trafilatura==2.0.0
+typing-inspection==0.4.2
+typing_extensions==4.15.0
+tzlocal==5.3.1
+uritemplate==4.2.0
+urllib3==2.6.3
+Werkzeug==3.1.5
--- a/run-pipeline-now.sh
+++ b/run-pipeline-now.sh
@ -0,0 +1,67 @@
+#!/bin/bash
+# RECON Pipeline — Skip scan, run extract + enrich in parallel, then embed
+# Scan already completed (10,162 catalogued). 6,211 extracted, 3,603 queued.
+
+set -euo pipefail
+cd /opt/recon
+source venv/bin/activate
+
+LOGDIR="logs"
+mkdir -p "$LOGDIR"
+TS=$(date +%Y%m%d_%H%M%S)
+MAIN_LOG="$LOGDIR/pipeline_${TS}.log"
+
+log() {
+    echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$MAIN_LOG"
+}
+
+log "=== RECON Pipeline (parallel extract+enrich) ==="
+log "Skipping scan (already done). Starting extract + enrich concurrently."
+
+# Reset any stuck docs from previous kill
+sqlite3 data/recon.db "UPDATE documents SET status='queued' WHERE status='extracting';"
+sqlite3 data/recon.db "UPDATE documents SET status='extracted' WHERE status='enriching';"
+sqlite3 data/recon.db "UPDATE documents SET status='enriched' WHERE status='embedding';"
+
+# Status before
+log "Before:"
+sqlite3 data/recon.db "SELECT status, COUNT(*) FROM documents GROUP BY status;" | while read line; do log "  $line"; done
+
+# Start extract and enrich in parallel
+log "--- Starting Extract (4 workers) + Enrich (16 workers) ---"
+
+python3 recon.py extract --workers 4 >> "$LOGDIR/extract_${TS}.log" 2>&1 &
+EXTRACT_PID=$!
+log "  Extract PID: $EXTRACT_PID"
+
+sleep 3
+
+python3 recon.py enrich --workers 16 >> "$LOGDIR/enrich_${TS}.log" 2>&1 &
+ENRICH_PID=$!
+log "  Enrich PID: $ENRICH_PID"
+
+# Monitor loop — report progress every 5 minutes
+while kill -0 $EXTRACT_PID 2>/dev/null || kill -0 $ENRICH_PID 2>/dev/null; do
+    sleep 300
+    STATS=$(sqlite3 data/recon.db "SELECT status, COUNT(*) FROM documents GROUP BY status;" | tr '\n' ' ')
+    log "  Progress: $STATS"
+done
+
+log "  Extract + Enrich finished"
+
+# Second enrich pass (catch docs extracted during first enrich)
+REMAINING=$(sqlite3 data/recon.db "SELECT COUNT(*) FROM documents WHERE status='extracted';")
+if [ "$REMAINING" -gt 0 ]; then
+    log "--- Enrich pass 2: $REMAINING remaining ---"
+    python3 recon.py enrich --workers 16 >> "$LOGDIR/enrich_${TS}.log" 2>&1
+    log "  Pass 2 complete"
+fi
+
+# Embed
+log "--- Embed ---"
+python3 recon.py embed --workers 4 >> "$LOGDIR/embed_${TS}.log" 2>&1
+log "  Embed complete"
+
+log "=== Pipeline Complete ==="
+python3 recon.py status 2>&1 | tee -a "$MAIN_LOG"
+log "Finished: $(date)"
--- a/scripts/init.py
+++ b/scripts/init.py
--- a/scripts/aa_download.py
+++ b/scripts/aa_download.py
@ -0,0 +1,373 @@
+#!/usr/bin/env python3
+"""
+aa_download.py — Anna's Archive bulk downloader for RECON library acquisition.
+
+For each target book:
+  1. Searches annas-archive.org for the title + author
+  2. Extracts the best PDF match (verified by author/page count)
+  3. Gets the MD5 from the book page
+  4. Attempts download from Libgen mirrors in order
+  5. Verifies downloaded file is a valid PDF
+  6. Writes full acquisition report
+
+Usage:
+  python3 /opt/recon/scripts/aa_download.py [--dry-run] [--limit N]
+
+Report output: ~/projects/recon/aa_acquisition_report.md
+"""
+
+import json
+import time
+import random
+import hashlib
+import logging
+import argparse
+from pathlib import Path
+from datetime import datetime
+
+import requests
+from bs4 import BeautifulSoup
+
+REPORT_PATH = Path.home() / "projects/recon/aa_acquisition_report.md"
+LOG_FILE    = Path("/opt/recon/logs/aa_download.log")
+
+logging.basicConfig(
+    level=logging.INFO,
+    format="%(asctime)s %(levelname)s %(message)s",
+    handlers=[logging.FileHandler(LOG_FILE), logging.StreamHandler()]
+)
+log = logging.getLogger("aa_download")
+
+SESSION = requests.Session()
+SESSION.headers.update({
+    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:120.0) Gecko/20100101 Firefox/120.0",
+    "Accept-Language": "en-US,en;q=0.9",
+})
+
+BASE_AA = "https://annas-archive.gl"
+
+# Download attempt order — try fastest mirrors first
+LIBGEN_MIRRORS = [
+    "https://libgen.is/get.php?md5={md5}",
+    "https://libgen.rs/get.php?md5={md5}",
+    "https://libgen.st/get.php?md5={md5}",
+    "https://libgen.li/ads.php?md5={md5}",
+]
+
+# ── Target book list ──────────────────────────────────────────────────────────
+TARGETS = [
+    # (title, author, dest_dir)
+
+    # Medical — Herbalism
+    ("Medical Herbalism",                          "David Hoffmann",             "Medical/Herbalism"),
+    ("Making Plant Medicine",                      "Richo Cech",                 "Medical/Herbalism"),
+    ("The Earthwise Herbal Volume 1",              "Matthew Wood",               "Medical/Herbalism"),
+    ("The Earthwise Herbal Volume 2",              "Matthew Wood",               "Medical/Herbalism"),
+    ("Herbal Antibiotics",                         "Stephen Buhner",             "Medical/Herbalism"),
+    ("Herbal Antivirals",                          "Stephen Buhner",             "Medical/Herbalism"),
+    ("The Herbal Medicine-Maker's Handbook",       "James Green",                "Medical/Herbalism"),
+    ("Rosemary Gladstar's Medicinal Herbs",        "Rosemary Gladstar",          "Medical/Herbalism"),
+
+    # Medical — Austere
+    ("Wilderness Medicine",                        "Paul Auerbach",              "Medical/Austere"),
+    ("Medicine for Mountaineering",                "James Wilkerson",            "Medical/Austere"),
+
+    # Medical — Veterinary
+    ("The Chicken Health Handbook",                "Gail Damerow",               "Medical/Veterinary"),
+    ("Goat Husbandry",                             "David Mackenzie",            "Medical/Veterinary"),
+
+    # Power Systems
+    ("The Renewable Energy Handbook",              "William Kemp",               "Power"),
+    ("Homebrew Wind Power",                        "Dan Bartmann",               "Power"),
+    ("Wind Energy Basics",                         "Paul Gipe",                  "Power"),
+    ("12-Volt Bible",                              "Brotherton",                 "Power"),
+    ("Wiring a House",                             "Rex Cauldwell",              "Power"),
+
+    # Navigation
+    ("Wilderness Navigation",                      "Bob Burns",                  "Navigation"),
+    ("Be Expert with Map and Compass",             "Bjorn Kjellstrom",           "Navigation"),
+    ("Emergency Navigation",                       "David Burch",                "Navigation"),
+    ("The Natural Navigator",                      "Tristan Gooley",             "Navigation"),
+    ("The Essential Wilderness Navigator",         "David Seidman",              "Navigation"),
+
+    # Water Systems
+    ("Rainwater Harvesting for Drylands Volume 1", "Brad Lancaster",            "Water"),
+    ("Rainwater Harvesting for Drylands Volume 2", "Brad Lancaster",            "Water"),
+    ("Rainwater Harvesting for Drylands Volume 3", "Brad Lancaster",            "Water"),
+    ("Water Storage",                              "Art Ludwig",                 "Water"),
+    ("The Home Water Supply",                      "Stu Campbell",               "Water"),
+
+    # Food Systems
+    ("The Art of Fermentation",                    "Sandor Katz",                "Food"),
+    ("Fermented Vegetables",                       "Kirsten Shockey",            "Food"),
+    ("Mastering Artisan Cheesemaking",             "Gianaclis Caldwell",         "Food"),
+    ("Home Cheese Making",                         "Ricki Carroll",              "Food"),
+    ("The Art of Natural Cheesemaking",            "David Asher",                "Food"),
+
+    # Permaculture
+    ("Edible Forest Gardens Volume 1",             "Dave Jacke",                 "Permaculture"),
+    ("Edible Forest Gardens Volume 2",             "Dave Jacke",                 "Permaculture"),
+    ("Creating a Forest Garden",                   "Martin Crawford",            "Permaculture"),
+    ("Sepp Holzer's Permaculture",                 "Sepp Holzer",                "Permaculture"),
+    ("The Permaculture Handbook",                  "Peter Bane",                 "Permaculture"),
+    ("The Market Gardener",                        "Jean-Martin Fortier",        "Permaculture"),
+
+    # Scenario / Emergency
+    ("SAS Survival Handbook",                      "John Wiseman",               "Scenario"),
+    ("Pocket Ref",                                 "Thomas Glover",              "Scenario"),
+    ("Deep Survival",                              "Laurence Gonzales",          "Scenario"),
+
+    # Foundational Skills
+    ("Back to Basics",                             "Reader's Digest",            "Skills"),
+    ("A Pattern Language",                         "Christopher Alexander",      "Skills"),
+]
+
+BASE_LIB = Path("/mnt/library/Acquired")
+
+
+def search_aa(title, author):
+    """Search Anna's Archive and return list of candidate result dicts."""
+    query = f"{title} {author}"
+    url = f"{BASE_AA}/search"
+    params = {"q": query, "ext": "pdf", "lang": "en"}
+    try:
+        r = SESSION.get(url, params=params, timeout=20)
+        r.raise_for_status()
+    except Exception as e:
+        log.warning(f"Search failed for '{title}': {e}")
+        return []
+
+    soup = BeautifulSoup(r.text, "html.parser")
+    results = []
+
+    seen_md5 = set()
+    for item in soup.select("a[href^='/md5/']"):
+        href = item.get("href", "")
+        md5 = href.split("/md5/")[-1].split("/")[0].split("?")[0].strip()
+        if not md5 or len(md5) != 32:
+            continue
+        text = item.get_text(" ", strip=True)
+        if not text or md5 in seen_md5:
+            continue
+        seen_md5.add(md5)
+        results.append({"md5": md5, "text": text, "href": href})
+        if len(results) >= 5:
+            break
+
+    return results
+
+
+def get_book_details(md5):
+    """Fetch the book detail page and extract useful metadata."""
+    url = f"{BASE_AA}/md5/{md5}"
+    try:
+        r = SESSION.get(url, timeout=20)
+        r.raise_for_status()
+        soup = BeautifulSoup(r.text, "html.parser")
+        text = soup.get_text(" ", strip=True)
+        # Extract page count if visible
+        pages = None
+        for word in text.split():
+            if word.isdigit() and 50 < int(word) < 5000:
+                pages = int(word)
+                break
+        return {"pages": pages, "text": text[:500]}
+    except Exception as e:
+        log.warning(f"Detail fetch failed for md5={md5}: {e}")
+        return {}
+
+
+def try_download(md5, dest_path):
+    """Try each libgen mirror until one works. Returns True on success."""
+    for mirror_tpl in LIBGEN_MIRRORS:
+        url = mirror_tpl.format(md5=md5)
+        try:
+            r = SESSION.get(url, timeout=60, stream=True, allow_redirects=True)
+            content_type = r.headers.get("content-type", "")
+            if r.status_code != 200:
+                continue
+            # Some mirrors return an HTML ads page before the real file
+            if "text/html" in content_type:
+                # Parse redirect link from ads page
+                soup = BeautifulSoup(r.text, "html.parser")
+                dl_link = soup.select_one("a[href*='.pdf']")
+                if not dl_link:
+                    dl_link = soup.select_one("a[href*='get.php']")
+                if not dl_link:
+                    continue
+                actual_url = dl_link["href"]
+                if not actual_url.startswith("http"):
+                    actual_url = f"https://libgen.is{actual_url}"
+                r = SESSION.get(actual_url, timeout=120, stream=True)
+                if r.status_code != 200:
+                    continue
+
+            # Stream to disk
+            dest_path.parent.mkdir(parents=True, exist_ok=True)
+            with open(dest_path, "wb") as f:
+                for chunk in r.iter_content(8192):
+                    f.write(chunk)
+
+            # Verify it's a real PDF
+            with open(dest_path, "rb") as f:
+                header = f.read(4)
+            if header == b"%PDF":
+                size_mb = dest_path.stat().st_size / 1024 / 1024
+                log.info(f"  [OK] {dest_path.name} ({size_mb:.1f}MB) via {url}")
+                return True
+            else:
+                log.warning(f"  [BAD] Not a PDF from {url}")
+                dest_path.unlink(missing_ok=True)
+
+        except Exception as e:
+            log.warning(f"  Mirror failed {url}: {e}")
+            continue
+
+    return False
+
+
+def process_book(title, author, subdir, dry_run):
+    """Full search + download pipeline for one book."""
+    log.info(f"[SEARCH] '{title}' — {author}")
+    result = {
+        "title": title,
+        "author": author,
+        "status": "NOT FOUND",
+        "md5": "",
+        "pages": "",
+        "file": "",
+        "notes": "",
+    }
+
+    candidates = search_aa(title, author)
+    if not candidates:
+        result["notes"] = "No results from AA search"
+        return result
+
+    # Pick best candidate — prefer one whose text contains author name
+    best = None
+    for c in candidates:
+        if author.split()[-1].lower() in c["text"].lower():
+            best = c
+            break
+    if not best:
+        best = candidates[0]  # take first result if no author match
+
+    md5 = best["md5"]
+    result["md5"] = md5
+
+    details = get_book_details(md5)
+    result["pages"] = details.get("pages", "")
+
+    if dry_run:
+        result["status"] = "DRY RUN — found"
+        result["notes"] = f"MD5: {md5}"
+        return result
+
+    # Build destination path
+    safe_title = "".join(c if c.isalnum() or c in " ._-" else "_" for c in title)[:60]
+    safe_author = author.split()[-1]
+    filename = f"{safe_title}_{safe_author}.pdf"
+    dest = BASE_LIB / subdir / filename
+
+    if dest.exists():
+        result["status"] = "ALREADY EXISTS"
+        result["file"] = str(dest)
+        return result
+
+    log.info(f"  MD5: {md5} — attempting download...")
+    ok = try_download(md5, dest)
+
+    if ok:
+        result["status"] = "DOWNLOADED"
+        result["file"] = str(dest)
+    else:
+        result["status"] = "MD5 ONLY"
+        result["notes"] = f"All mirrors failed. MD5: {md5}"
+
+    return result
+
+
+def write_report(results):
+    REPORT_PATH.parent.mkdir(parents=True, exist_ok=True)
+    downloaded   = [r for r in results if r["status"] == "DOWNLOADED"]
+    md5_only     = [r for r in results if r["status"] == "MD5 ONLY"]
+    not_found    = [r for r in results if r["status"] == "NOT FOUND"]
+    already_have = [r for r in results if r["status"] == "ALREADY EXISTS"]
+
+    lines = [
+        f"# Anna's Archive Acquisition Report",
+        f"**Generated:** {datetime.now().strftime('%Y-%m-%d %H:%M')}",
+        f"**Total searched:** {len(results)}",
+        f"",
+        f"| Status | Count |",
+        f"|--------|-------|",
+        f"| Downloaded | {len(downloaded)} |",
+        f"| MD5 only (mirrors failed) | {len(md5_only)} |",
+        f"| Not found on AA | {len(not_found)} |",
+        f"| Already in library | {len(already_have)} |",
+        f"",
+    ]
+
+    if downloaded:
+        lines += ["## Downloaded", ""]
+        lines += ["| Title | Author | Pages | File |", "|-------|--------|-------|------|"]
+        for r in downloaded:
+            lines.append(f"| {r['title']} | {r['author']} | {r['pages']} | `{Path(r['file']).name}` |")
+        lines.append("")
+
+    if md5_only:
+        lines += ["## Found on AA — Download Failed (use MD5 for manual retrieval)", ""]
+        lines += ["| Title | Author | MD5 | Notes |", "|-------|--------|-----|-------|"]
+        for r in md5_only:
+            lines.append(f"| {r['title']} | {r['author']} | `{r['md5']}` | {r['notes']} |")
+        lines.append("")
+
+    if not_found:
+        lines += ["## Not Found on Anna's Archive", ""]
+        lines += ["| Title | Author | Notes |", "|-------|--------|-------|"]
+        for r in not_found:
+            lines.append(f"| {r['title']} | {r['author']} | {r['notes']} |")
+        lines.append("")
+
+    if already_have:
+        lines += ["## Already in Library", ""]
+        lines += ["| Title | Author |", "|-------|--------|"]
+        for r in already_have:
+            lines.append(f"| {r['title']} | {r['author']} |")
+        lines.append("")
+
+    REPORT_PATH.write_text("\n".join(lines))
+    log.info(f"Report written to {REPORT_PATH}")
+
+
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--dry-run", action="store_true")
+    parser.add_argument("--limit", type=int, default=None)
+    args = parser.parse_args()
+
+    targets = TARGETS[:args.limit] if args.limit else TARGETS
+    log.info(f"Starting AA acquisition: {len(targets)} books | dry_run={args.dry_run}")
+
+    results = []
+    for i, (title, author, subdir) in enumerate(targets, 1):
+        log.info(f"[{i}/{len(targets)}]")
+        result = process_book(title, author, subdir, args.dry_run)
+        results.append(result)
+        log.info(f"  -> {result['status']}")
+        # Polite delay between requests
+        time.sleep(random.uniform(8, 15))
+
+    write_report(results)
+
+    print(f"\n-- Summary -----------------------------------------------")
+    for status in ["DOWNLOADED", "MD5 ONLY", "NOT FOUND", "ALREADY EXISTS", "DRY RUN — found"]:
+        count = sum(1 for r in results if r["status"] == status)
+        if count:
+            print(f"  {status:<35} {count:>3}")
+    print(f"  Report: {REPORT_PATH}")
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/aa_download_pass2.py
+++ b/scripts/aa_download_pass2.py
@ -0,0 +1,478 @@
+#!/usr/bin/env python3
+"""
+aa_download_pass2.py — Second-pass downloader for books that failed in pass 1.
+
+Reads the MD5 list from pass 1 report and tries:
+  1. Z-Library search by title/author (separate catalog from Libgen)
+  2. IPFS gateways using AA's IPFS CID (different from MD5 but findable)
+  3. Alternative Libgen mirrors not tried in pass 1
+  4. Direct AA slow download with longer timeout + retry
+
+Checkpoint: saves progress to /opt/recon/data/aa_pass2_checkpoint.json
+  so interrupted runs resume where they left off.
+
+Usage:
+  python3 /opt/recon/scripts/aa_download_pass2.py [--dry-run]
+"""
+
+import json
+import time
+import random
+import logging
+import hashlib
+import argparse
+from pathlib import Path
+from datetime import datetime
+
+import requests
+from bs4 import BeautifulSoup
+
+LOG_FILE       = Path("/opt/recon/logs/aa_download_pass2.log")
+REPORT_IN      = Path.home() / "projects/recon/aa_acquisition_report.md"
+REPORT_OUT     = Path.home() / "projects/recon/aa_acquisition_report_pass2.md"
+CHECKPOINT     = Path("/opt/recon/data/aa_pass2_checkpoint.json")
+BASE_LIB       = Path("/mnt/library/Acquired")
+
+logging.basicConfig(
+    level=logging.INFO,
+    format="%(asctime)s %(levelname)s %(message)s",
+    handlers=[logging.FileHandler(LOG_FILE), logging.StreamHandler()]
+)
+log = logging.getLogger("aa_pass2")
+
+SESSION = requests.Session()
+SESSION.headers.update({
+    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:120.0) Gecko/20100101 Firefox/120.0",
+    "Accept-Language": "en-US,en;q=0.9",
+})
+
+# ── Mirrors to try in order ───────────────────────────────────────────────────
+MIRRORS = [
+    # Libgen alternatives
+    "https://libgen.li/ads.php?md5={md5}",
+    "https://library.lol/main/{md5}",
+    "https://libgen.rocks/get.php?md5={md5}",
+    # Z-Library direct MD5 endpoint (sometimes works)
+    "https://z-library.se/md5/{md5}",
+    # IPFS public gateways — AA uses IPFS for storage
+    "https://cloudflare-ipfs.com/ipfs/{md5}",
+    "https://ipfs.io/ipfs/{md5}",
+    "https://gateway.pinata.cloud/ipfs/{md5}",
+]
+
+# ── Books that failed in pass 1 — title, author, md5, subdir ─────────────────
+PASS1_FAILURES = [
+    # Medical/Herbalism
+    ("The Earthwise Herbal Volume 1",         "Matthew Wood",         "fc8dc19f5a17f38849a3979830dc95c1", "Medical/Herbalism"),
+    ("The Earthwise Herbal Volume 2",         "Matthew Wood",         "fc8dc19f5a17f38849a3979830dc95c1", "Medical/Herbalism"),
+    ("Herbal Antibiotics",                    "Stephen Buhner",       "5839dab78edfdff0d7986fac62b814da", "Medical/Herbalism"),
+    ("The Herbal Medicine-Maker's Handbook",  "James Green",          "27e8e8a3585705ed194029b69c7d61b1", "Medical/Herbalism"),
+    ("Rosemary Gladstar's Medicinal Herbs",   "Rosemary Gladstar",    "9b1966f20a32ab4331bfece167be1dd0", "Medical/Herbalism"),
+
+    # Medical/Austere
+    ("Wilderness Medicine",                   "Paul Auerbach",        "957818eaa4ec40527bb05902f9ef7c51", "Medical/Austere"),
+    ("Medicine for Mountaineering",           "James Wilkerson",      "39cb07998f2034206f0c9472e44cb0b4", "Medical/Austere"),
+
+    # Medical/Veterinary
+    ("The Chicken Health Handbook",           "Gail Damerow",         "0ba42fbea034b9a08ec8e2f8d7606efe", "Medical/Veterinary"),
+
+    # Power
+    ("The Renewable Energy Handbook",         "William Kemp",         "475d89fa80aea6c45aa4b1b4b9c5e274", "Power"),
+    ("Homebrew Wind Power",                   "Dan Bartmann",         "0578696d5b1b6bceb3e5e3302c1a31aa", "Power"),
+    ("Wind Energy Basics",                    "Paul Gipe",            "ccbe9d22e0a5e32d61921d20d66a8e05", "Power"),
+    ("12-Volt Bible",                         "Brotherton",           "3f964fa6d730fdf2c3d3e231e87cf692", "Power"),
+    ("Wiring a House",                        "Rex Cauldwell",        "5efcb53450e9eb560210eee40678adcf", "Power"),
+
+    # Navigation
+    ("Emergency Navigation",                  "David Burch",          "25e4def9e777b3fa9ca935134732ff9d", "Navigation"),
+
+    # Water
+    ("Water Storage",                         "Art Ludwig",           "17c965ec15c6cf4f09b5377b599a5266", "Water"),
+    ("The Home Water Supply",                 "Stu Campbell",         "9b22677d2f8e8b39f7a6bf032187295b", "Water"),
+
+    # Food
+    ("Fermented Vegetables",                  "Kirsten Shockey",      "74d3bde876b4c17be66c21fdfa85213e", "Food"),
+    ("The Art of Natural Cheesemaking",       "David Asher",          "bc0e0829d701fea9beca912d39f8cc74", "Food"),
+
+    # Permaculture
+    ("Edible Forest Gardens Volume 1",        "Dave Jacke",           "6b069c3bb077fdd89d487a363c070fbb", "Permaculture"),
+    ("Edible Forest Gardens Volume 2",        "Dave Jacke",           "699255bfde7f69285c132a94ec291bf4", "Permaculture"),
+    ("Creating a Forest Garden",              "Martin Crawford",      "96d71d70dba31ae86e14845f913e557e", "Permaculture"),
+    ("Sepp Holzer's Permaculture",            "Sepp Holzer",          "32be55a9fce3e31cacd6912069abb410", "Permaculture"),
+    ("The Permaculture Handbook",             "Peter Bane",           "08cb4492739fda4d01b5a868a408e4a0", "Permaculture"),
+    ("The Market Gardener",                   "Jean-Martin Fortier",  "ac69f6c8c22305b42b539482dc761c19", "Permaculture"),
+
+    # Scenario
+    ("SAS Survival Handbook",                 "John Wiseman",         "fa967fd5fcbeb3c9887e22f73e590c64", "Scenario"),
+    ("Pocket Ref",                            "Thomas Glover",        "8e4988ce513a4aa75e7e6c00ee36692b", "Scenario"),
+    ("Deep Survival",                         "Laurence Gonzales",    "9a907ab13b81ea597407fffdb8ea1b04", "Scenario"),
+
+    # Skills
+    ("A Pattern Language",                    "Christopher Alexander","7f5cc06b5399b65a278c4005ccd8d476", "Skills"),
+]
+
+
+def load_checkpoint():
+    """Load checkpoint: dict of {title: result_dict} for completed books."""
+    if CHECKPOINT.exists():
+        try:
+            return json.loads(CHECKPOINT.read_text())
+        except Exception:
+            pass
+    return {}
+
+
+def save_checkpoint(completed):
+    """Save checkpoint after each book."""
+    CHECKPOINT.parent.mkdir(parents=True, exist_ok=True)
+    tmp = str(CHECKPOINT) + ".tmp"
+    with open(tmp, "w") as f:
+        json.dump(completed, f, indent=2)
+    Path(tmp).replace(CHECKPOINT)
+
+
+def load_md5s_from_report():
+    """Parse MD5 hashes from pass 1 report to pre-populate PASS1_FAILURES."""
+    if not REPORT_IN.exists():
+        return {}
+    md5_map = {}
+    for line in REPORT_IN.read_text().splitlines():
+        if "`" in line and len(line) > 30:
+            parts = line.split("|")
+            if len(parts) >= 4:
+                title = parts[1].strip()
+                md5_cell = parts[3].strip().strip("`")
+                if len(md5_cell) == 32 and md5_cell.isalnum():
+                    md5_map[title.lower()] = md5_cell
+    return md5_map
+
+
+def search_zlib(title, author):
+    """Try Z-Library search endpoint."""
+    try:
+        url = "https://z-library.se/s/"
+        params = {"q": f"{title} {author}", "extension[]": "pdf"}
+        r = SESSION.get(url, params=params, timeout=15)
+        if r.status_code != 200:
+            return None
+        soup = BeautifulSoup(r.text, "html.parser")
+        # Z-lib book links contain /book/
+        for a in soup.select("a[href*='/book/']")[:3]:
+            href = a.get("href", "")
+            if href:
+                book_url = f"https://z-library.se{href}" if href.startswith("/") else href
+                return book_url
+    except Exception as e:
+        log.debug(f"Zlib search failed: {e}")
+    return None
+
+
+def try_zlib_download(book_url, dest_path):
+    """Download from Z-Library book page."""
+    try:
+        r = SESSION.get(book_url, timeout=15)
+        soup = BeautifulSoup(r.text, "html.parser")
+        dl = soup.select_one("a.addDownloadedBook, a[href*='/dl/'], a.btn-primary[href*='download']")
+        if not dl:
+            return False
+        dl_url = dl["href"]
+        if not dl_url.startswith("http"):
+            dl_url = f"https://z-library.se{dl_url}"
+        r2 = SESSION.get(dl_url, timeout=120, stream=True)
+        if r2.status_code != 200:
+            return False
+        dest_path.parent.mkdir(parents=True, exist_ok=True)
+        with open(dest_path, "wb") as f:
+            for chunk in r2.iter_content(8192):
+                f.write(chunk)
+        with open(dest_path, "rb") as f:
+            if f.read(4) == b"%PDF":
+                return True
+        dest_path.unlink(missing_ok=True)
+    except Exception as e:
+        log.debug(f"Zlib download failed: {e}")
+    return False
+
+
+def try_mirrors(md5, dest_path):
+    """Try all mirrors with the MD5."""
+    import re as _re
+    for tpl in MIRRORS:
+        url = tpl.format(md5=md5)
+        try:
+            r = SESSION.get(url, timeout=20, stream=True, allow_redirects=True)
+            if r.status_code != 200:
+                continue
+            ctype = r.headers.get("content-type", "")
+            if "html" in ctype:
+                soup = BeautifulSoup(r.text, "html.parser")
+                # For libgen.li ads page, look for get.php with key
+                dl = None
+                match = _re.search(r'href="(get\.php\?md5=[^"]+)"', r.text)
+                if match:
+                    actual = f"https://libgen.li/{match.group(1)}"
+                else:
+                    dl = (soup.select_one("a[href*='.pdf']") or
+                          soup.select_one("a[href*='get.php']") or
+                          soup.select_one("a[href*='/get/']"))
+                    if not dl:
+                        continue
+                    actual = dl["href"]
+                    if not actual.startswith("http"):
+                        base = url.split("/")[0] + "//" + url.split("/")[2]
+                        actual = base + ("/" if not actual.startswith("/") else "") + actual
+
+                r = SESSION.get(actual, timeout=60, stream=True)
+                if r.status_code != 200:
+                    continue
+
+            dest_path.parent.mkdir(parents=True, exist_ok=True)
+            with open(dest_path, "wb") as f:
+                for chunk in r.iter_content(8192):
+                    f.write(chunk)
+            with open(dest_path, "rb") as f:
+                if f.read(4) == b"%PDF":
+                    size_mb = dest_path.stat().st_size / 1024 / 1024
+                    log.info(f"    [OK] {size_mb:.1f}MB via {url}")
+                    return True
+            dest_path.unlink(missing_ok=True)
+        except Exception as e:
+            log.debug(f"Mirror {url} failed: {e}")
+        time.sleep(2)
+    return False
+
+
+def get_ipfs_cids(md5):
+    """Fetch IPFS CIDs from AA book detail page."""
+    import re as _re
+    cids = []
+    try:
+        r = SESSION.get(f"https://annas-archive.gl/md5/{md5}", timeout=20)
+        if r.status_code == 200:
+            for m in _re.finditer(r'ipfs_cid[:\s]+([A-Za-z0-9]{46,})', r.text):
+                cids.append(m.group(1))
+            # Also check for CIDs in href attributes
+            for m in _re.finditer(r'ipfs://([A-Za-z0-9]{46,})', r.text):
+                if m.group(1) not in cids:
+                    cids.append(m.group(1))
+    except Exception as e:
+        log.debug(f"IPFS CID fetch failed: {e}")
+    return cids
+
+
+def try_ipfs_download(cids, dest_path):
+    """Try downloading via IPFS public gateways."""
+    gateways = [
+        "https://cloudflare-ipfs.com/ipfs/{}",
+        "https://dweb.link/ipfs/{}",
+    ]
+    for cid in cids[:3]:  # limit to first 3 CIDs
+        for gw_tpl in gateways:
+            url = gw_tpl.format(cid)
+            try:
+                r = SESSION.get(url, timeout=15, stream=True)
+                if r.status_code != 200:
+                    continue
+                dest_path.parent.mkdir(parents=True, exist_ok=True)
+                with open(dest_path, "wb") as f:
+                    for chunk in r.iter_content(8192):
+                        f.write(chunk)
+                with open(dest_path, "rb") as f:
+                    if f.read(4) == b"%PDF":
+                        size_mb = dest_path.stat().st_size / 1024 / 1024
+                        log.info(f"    [OK] {size_mb:.1f}MB via IPFS {url[:60]}...")
+                        return True
+                dest_path.unlink(missing_ok=True)
+            except Exception as e:
+                log.debug(f"IPFS {url} failed: {e}")
+            time.sleep(1)
+    return False
+
+
+def search_aa_fresh(title, author):
+    """Fresh AA search on .gl domain for books that weren't found before."""
+    for domain in ["annas-archive.gl", "annas-archive.se", "annas-archive.org"]:
+        try:
+            url = f"https://{domain}/search"
+            params = {"q": f"{title} {author}", "ext": "pdf", "lang": "en"}
+            r = SESSION.get(url, params=params, timeout=15)
+            if r.status_code != 200:
+                continue
+            soup = BeautifulSoup(r.text, "html.parser")
+            for a in soup.select("a[href^='/md5/']"):
+                text = a.get_text(" ", strip=True)
+                if not text:
+                    continue
+                md5 = a["href"].split("/md5/")[-1].split("/")[0].strip()
+                if len(md5) == 32:
+                    if author.split()[-1].lower() in text.lower() or title.split()[0].lower() in text.lower():
+                        return md5
+        except Exception:
+            continue
+    return None
+
+
+def process_book(title, author, md5_hint, subdir, dry_run):
+    result = {
+        "title": title, "author": author,
+        "status": "NOT FOUND", "md5": md5_hint,
+        "file": "", "notes": "",
+    }
+
+    safe_title  = "".join(c if c.isalnum() or c in " ._-" else "_" for c in title)[:60]
+    safe_author = author.split()[-1]
+    dest = BASE_LIB / subdir / f"{safe_title}_{safe_author}.pdf"
+
+    if dest.exists():
+        result["status"] = "ALREADY EXISTS"
+        result["file"] = str(dest)
+        return result
+
+    if dry_run:
+        result["status"] = "DRY RUN"
+        return result
+
+    # 1. Try Z-Library first (different catalog)
+    log.info(f"  Trying Z-Library...")
+    zlib_url = search_zlib(title, author)
+    if zlib_url:
+        if try_zlib_download(zlib_url, dest):
+            result["status"] = "DOWNLOADED (Z-Library)"
+            result["file"] = str(dest)
+            return result
+
+    # 2. If no MD5 from pass 1, do a fresh AA search
+    md5 = md5_hint
+    if not md5:
+        log.info(f"  Searching AA for fresh MD5...")
+        md5 = search_aa_fresh(title, author)
+        if md5:
+            result["md5"] = md5
+            log.info(f"  Found MD5: {md5}")
+
+    # 3. Try IPFS with real CIDs from AA detail page
+    if md5:
+        log.info(f"  Fetching IPFS CIDs from AA...")
+        cids = get_ipfs_cids(md5)
+        if cids:
+            log.info(f"  Found {len(cids)} IPFS CID(s), trying gateways...")
+            if try_ipfs_download(cids, dest):
+                result["status"] = "DOWNLOADED (IPFS)"
+                result["file"] = str(dest)
+                return result
+
+    # 4. Try all mirrors with MD5
+    if md5:
+        log.info(f"  Trying mirrors with MD5 {md5}...")
+        if try_mirrors(md5, dest):
+            result["status"] = "DOWNLOADED (mirror)"
+            result["file"] = str(dest)
+            return result
+        result["status"] = "MD5 ONLY"
+        result["notes"] = f"MD5 confirmed, all mirrors failed: {md5}"
+    else:
+        result["notes"] = "Not found on AA or Z-Library"
+
+    return result
+
+
+def write_report(results):
+    downloaded = [r for r in results if "DOWNLOADED" in r["status"]]
+    md5_only   = [r for r in results if r["status"] == "MD5 ONLY"]
+    not_found  = [r for r in results if r["status"] == "NOT FOUND"]
+    existing   = [r for r in results if r["status"] == "ALREADY EXISTS"]
+
+    lines = [
+        "# AA Acquisition Report -- Pass 2",
+        f"**Generated:** {datetime.now().strftime('%Y-%m-%d %H:%M')}",
+        f"**Searched:** {len(results)} | **Downloaded:** {len(downloaded)} | "
+        f"**MD5 only:** {len(md5_only)} | **Not found:** {len(not_found)}",
+        "",
+    ]
+    if downloaded:
+        lines += ["## Downloaded", "",
+                  "| Title | Author | Via | File |",
+                  "|-------|--------|-----|------|"]
+        for r in downloaded:
+            lines.append(f"| {r['title']} | {r['author']} | {r['status']} | `{Path(r['file']).name}` |")
+        lines.append("")
+
+    if existing:
+        lines += ["## Already in Library", "",
+                  "| Title | Author |",
+                  "|-------|--------|"]
+        for r in existing:
+            lines.append(f"| {r['title']} | {r['author']} |")
+        lines.append("")
+
+    if md5_only:
+        lines += ["## MD5 Known -- All Mirrors Failed", "",
+                  "| Title | Author | MD5 |",
+                  "|-------|--------|-----|"]
+        for r in md5_only:
+            lines.append(f"| {r['title']} | {r['author']} | `{r['md5']}` |")
+        lines.append("")
+
+    if not_found:
+        lines += ["## Not Found Anywhere", "",
+                  "| Title | Author | Notes |",
+                  "|-------|--------|-------|"]
+        for r in not_found:
+            lines.append(f"| {r['title']} | {r['author']} | {r['notes']} |")
+        lines.append("")
+
+    REPORT_OUT.parent.mkdir(parents=True, exist_ok=True)
+    REPORT_OUT.write_text("\n".join(lines))
+    log.info(f"Report written to {REPORT_OUT}")
+
+
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--dry-run", action="store_true")
+    args = parser.parse_args()
+
+    # Load any MD5s captured in pass 1
+    md5_map = load_md5s_from_report()
+    targets = []
+    for title, author, md5_hint, subdir in PASS1_FAILURES:
+        md5 = md5_hint or md5_map.get(title.lower(), "")
+        targets.append((title, author, md5, subdir))
+
+    # Load checkpoint
+    completed = load_checkpoint()
+    if completed:
+        log.info(f"Resuming: {len(completed)} books already processed in previous run")
+
+    log.info(f"Pass 2: {len(targets)} books | dry_run={args.dry_run}")
+    results = []
+    for i, (title, author, md5, subdir) in enumerate(targets, 1):
+        # Check checkpoint — skip already-processed books
+        if title in completed and not args.dry_run:
+            result = completed[title]
+            results.append(result)
+            log.info(f"[{i}/{len(targets)}] {title} — SKIPPED (checkpoint: {result['status']})")
+            continue
+
+        log.info(f"[{i}/{len(targets)}] {title} -- {author}")
+        result = process_book(title, author, md5, subdir, args.dry_run)
+        results.append(result)
+        log.info(f"  -> {result['status']}")
+
+        # Save checkpoint after each book (not in dry-run)
+        if not args.dry_run:
+            completed[title] = result
+            save_checkpoint(completed)
+
+        time.sleep(random.uniform(6, 12))
+
+    write_report(results)
+    print(f"\n-- Pass 2 Summary ----------------------------------------")
+    for status in ["DOWNLOADED (Z-Library)", "DOWNLOADED (IPFS)", "DOWNLOADED (mirror)", "MD5 ONLY", "NOT FOUND", "ALREADY EXISTS", "DRY RUN"]:
+        count = sum(1 for r in results if r["status"] == status)
+        if count:
+            print(f"  {status:<35} {count:>3}")
+    print(f"  Report: {REPORT_OUT}")
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/backup.sh
+++ b/scripts/backup.sh
@ -0,0 +1,64 @@
+#!/bin/bash
+# RECON Backup Script
+# Backs up the precious data: concept JSONs, text extracts, SQLite DB
+# Qdrant is NOT backed up — rebuilt from JSONs via `recon rebuild`
+# Destination: Contabo VPS (100.64.0.1) via rsync+SSH
+
+set -euo pipefail
+
+RECON_DIR="/opt/recon"
+DATA_DIR="$RECON_DIR/data"
+LOG_FILE="$RECON_DIR/logs/backup.log"
+DATE=$(date +%Y%m%d_%H%M%S)
+
+BACKUP_HOST="root@100.64.0.1"
+BACKUP_BASE="/opt/backups/recon"
+
+log() {
+    echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"
+}
+
+mkdir -p "$RECON_DIR/logs"
+
+log "=== RECON Backup Starting ==="
+
+# ── 1. SQLite DB (small, fast, critical) ──
+log "Backing up recon.db..."
+LOCAL_DB_BACKUP="/tmp/recon_${DATE}.db"
+sqlite3 "$DATA_DIR/recon.db" ".backup '$LOCAL_DB_BACKUP'"
+rsync -az "$LOCAL_DB_BACKUP" "$BACKUP_HOST:$BACKUP_BASE/recon_${DATE}.db"
+rm -f "$LOCAL_DB_BACKUP"
+# Keep last 7 daily DB backups on remote
+ssh "$BACKUP_HOST" "ls -t $BACKUP_BASE/recon_*.db 2>/dev/null | tail -n +8 | xargs rm -f 2>/dev/null || true"
+log "  recon.db backed up"
+
+# ── 2. Concept JSONs (THE PRECIOUS DATA — $130+ of Gemini work) ──
+log "Syncing concept JSONs..."
+rsync -az --delete "$DATA_DIR/concepts/" "$BACKUP_HOST:$BACKUP_BASE/concepts/"
+CONCEPT_COUNT=$(find "$DATA_DIR/concepts/" -name "*.json" 2>/dev/null | wc -l)
+log "  concepts synced ($CONCEPT_COUNT JSON files)"
+
+# ── 3. Text extracts (regenerable but expensive in time) ──
+log "Syncing text extracts..."
+rsync -az --delete "$DATA_DIR/text/" "$BACKUP_HOST:$BACKUP_BASE/text/"
+TEXT_COUNT=$(find "$DATA_DIR/text/" -maxdepth 1 -type d 2>/dev/null | wc -l)
+log "  text synced ($((TEXT_COUNT - 1)) document dirs)"
+
+# ── 4. Intel feeds ──
+if [ -d "$DATA_DIR/intel" ]; then
+    log "Syncing intel feeds..."
+    rsync -az --delete "$DATA_DIR/intel/" "$BACKUP_HOST:$BACKUP_BASE/intel/"
+    log "  intel synced"
+fi
+
+# ── 5. Config files ──
+log "Backing up config..."
+rsync -az "$RECON_DIR/config.yaml" "$BACKUP_HOST:$BACKUP_BASE/config_${DATE}.yaml"
+rsync -az "$RECON_DIR/.env" "$BACKUP_HOST:$BACKUP_BASE/env_${DATE}" 2>/dev/null || true
+ssh "$BACKUP_HOST" "ls -t $BACKUP_BASE/config_*.yaml 2>/dev/null | tail -n +4 | xargs rm -f 2>/dev/null || true"
+ssh "$BACKUP_HOST" "ls -t $BACKUP_BASE/env_* 2>/dev/null | tail -n +4 | xargs rm -f 2>/dev/null || true"
+log "  config backed up"
+
+# ── Summary ──
+BACKUP_SIZE=$(ssh "$BACKUP_HOST" "du -sh $BACKUP_BASE" | cut -f1)
+log "=== Backup Complete: $BACKUP_SIZE on Contabo ==="
--- a/scripts/cleanup_outliers.py
+++ b/scripts/cleanup_outliers.py
@ -0,0 +1,449 @@
+#!/usr/bin/env python3
+"""
+cleanup_outliers.py — Three-pass cleanup of RECON concept data.
+
+Pass 1: Remap ~160 non-canonical domain strings in concept JSONs + Qdrant payloads
+Pass 2: Re-enrich 434 concepts with empty domain arrays via Gemini
+Pass 3: Purge junk/noise URLs from Qdrant + SQLite DB
+
+Usage:
+  python3 /opt/recon/scripts/cleanup_outliers.py [--dry-run] [--skip-pass N]
+"""
+
+import json
+import time
+import random
+import logging
+import argparse
+import threading
+import sqlite3
+from pathlib import Path
+from concurrent.futures import ThreadPoolExecutor, as_completed
+from collections import defaultdict
+
+import google.generativeai as genai
+from qdrant_client import QdrantClient
+from qdrant_client.models import FieldCondition, MatchAny, Filter
+
+import sys, os
+sys.path.insert(0, '/opt/recon')
+from lib.utils import get_config, setup_logging
+
+LOG_FILE = Path("/opt/recon/logs/cleanup_outliers.log")
+logging.basicConfig(
+    level=logging.INFO,
+    format="%(asctime)s %(levelname)s %(message)s",
+    handlers=[logging.FileHandler(LOG_FILE), logging.StreamHandler()]
+)
+log = logging.getLogger("cleanup_outliers")
+
+CONCEPTS_DIR = Path("/opt/recon/data/concepts")
+DB_PATH = Path("/opt/recon/data/recon.db")
+
+CANONICAL_DOMAINS = {
+    "Defense & Tactics", "Sustainment Systems", "Off-Grid Systems",
+    "Foundational Skills", "Communications", "Medical", "Food Systems",
+    "Navigation", "Logistics", "Power Systems", "Leadership",
+    "Scenario Playbooks", "Water Systems", "Security", "Community Coordination"
+}
+
+# Non-canonical → canonical remap
+OUTLIER_MAP = {
+    "Zoology":                  "Sustainment Systems",
+    "Botany":                   "Sustainment Systems",
+    "Nature Lore":              "Sustainment Systems",
+    "Ecology":                  "Sustainment Systems",
+    "Navigational Astronomy":   "Navigation",
+    "Troubleshooting":          "Foundational Skills",
+    "Chemistry":                "Foundational Skills",
+    "Metallurgy":               "Foundational Skills",
+    "Weird Science":            "Foundational Skills",
+    "Philosophy of physics":    "Foundational Skills",
+    "Physics":                  "Foundational Skills",
+    "Cell biology":             "Foundational Skills",
+    "Economics":                "Leadership",
+    "Business":                 "Leadership",
+    "Safety":                   "Security",
+    "Law Enforcement":          "Security",
+    "Security & Intelligence":  "Security",
+    "Fire Weather":             "Scenario Playbooks",
+    "Legal":                    "Leadership",
+    # Discard — replace with closest real domain
+    "Site News":                "Foundational Skills",
+    "Paleogeography":           "Foundational Skills",
+    "Chemical Manipulation":    "Foundational Skills",
+}
+
+# Junk URL patterns — pages with no knowledge value
+JUNK_URL_PATTERNS = [
+    # rocketstoves.com nav/template garbage
+    "rocketstoves.com/favicon",
+    "rocketstoves.com/cropped-favicon",
+    "rocketstoves.com/layouts/",
+    "rocketstoves.com/sample",
+    "rocketstoves.com/templates/",
+    "rocketstoves.com/hello-world",
+    "rocketstoves.com/blog-forthcoming",
+    "rocketstoves.com/contact",
+    "rocketstoves.com/acknowledgements",
+    "rocketstoves.com/ja3",
+    "rocketstoves.com/juxtapositions",
+    "rocketstoves.com/no-name-soi",
+    "rocketstoves.com/big4",
+    "rocketstoves.com/roof",
+    "rocketstoves.com/rmh_dloadcover",
+    "rocketstoves.com/pedcover",
+    "rocketstoves.com/laundry-to-landscape",
+    "rocketstoves.com/barreloven",
+    # NRCS calendar/event noise
+    "nrcs.usda.gov/events/",
+    "nrcs.usda.gov/state-offices/massachusetts",
+    "nrcs.usda.gov/state-offices/nebraska",
+    "nrcs.usda.gov/state-offices/oklahoma",
+    "nrcs.usda.gov/state-offices/utah",
+    "nrcs.usda.gov/conservation-basics/natural-resource-concerns/soil/western-call-for-abstracts",
+    # deeranddeerhunting trophy hunt videos (no knowledge value)
+    "deeranddeerhunting.com/trophy-whitetails-exclusive-videos/",
+    # eattheweeds non-content pages
+    "eattheweeds.com/media-interviews-with-green-deane",
+    "eattheweeds.com/motorcycles-and-mushrooms",
+    "eattheweeds.com/sunny-savage",
+    # foragersharvest nav pages
+    "foragersharvest.com/contact",
+    "foragersharvest.com/podcasts",
+    # motherearthnews classifieds/nav
+    "motherearthnews.com/classifieds/",
+    "motherearthnews.com/biographies/",
+]
+
+CLASSIFY_PROMPT = """\
+Classify this knowledge concept into one or more domains.
+
+VALID DOMAINS (use ONLY these exact strings):
+  Defense & Tactics, Sustainment Systems, Off-Grid Systems, Foundational Skills,
+  Communications, Medical, Food Systems, Navigation, Logistics, Power Systems,
+  Leadership, Scenario Playbooks, Water Systems, Security, Community Coordination
+
+Concept title: {title}
+Concept tags: {subdomain}
+Concept preview: {content}
+
+Return ONLY valid JSON, no markdown:
+{{"domain": ["Domain Name"]}}
+
+Rules:
+- Never return empty domain list
+- Medical content, herbs, first aid, veterinary → Medical
+- Food growing, foraging, hunting, livestock → Sustainment Systems
+- Food preservation, canning, storage → Food Systems
+- Solar, wind, batteries, generators → Power Systems
+- Water sourcing, filtration, sanitation → Water Systems
+"""
+
+def load_gemini_keys():
+    keys = []
+    for line in Path("/opt/recon/.env").read_text().splitlines():
+        if line.startswith("GEMINI_KEY_"):
+            keys.append(line.split("=", 1)[1].strip())
+    return keys
+
+class KeyRotator:
+    def __init__(self, keys):
+        self.keys = keys
+        self._i = 0
+        self._lock = threading.Lock()
+    def next(self):
+        with self._lock:
+            key = self.keys[self._i % len(self.keys)]
+            self._i += 1
+            return key
+
+def classify_concept(title, subdomains, content, key):
+    prompt = CLASSIFY_PROMPT.format(
+        title=title or "(untitled)",
+        subdomain=", ".join(subdomains[:10]) if subdomains else "(none)",
+        content=str(content)[:300] if content else "(none)",
+    )
+    genai.configure(api_key=key)
+    model = genai.GenerativeModel(
+        "gemini-2.0-flash",
+        generation_config={"response_mime_type": "application/json"}
+    )
+    for attempt in range(4):
+        try:
+            resp = model.generate_content(prompt)
+            data = json.loads(resp.text)
+            domains = [d for d in data.get("domain", []) if d in CANONICAL_DOMAINS]
+            if domains:
+                return domains
+        except Exception as e:
+            err = str(e).lower()
+            if any(s in err for s in ["429", "quota", "rate", "503"]):
+                time.sleep(min(5 * (2 ** attempt) + random.uniform(0, 3), 60))
+            else:
+                break
+    return ["Foundational Skills"]
+
+# ── PASS 1: Remap outlier domains ────────────────────────────────────────────
+
+def remap_concept_domains(domains):
+    """Remap any outlier domain names in a domain list."""
+    result = set()
+    changed = False
+    for d in domains:
+        if d in CANONICAL_DOMAINS:
+            result.add(d)
+        elif d in OUTLIER_MAP:
+            result.add(OUTLIER_MAP[d])
+            changed = True
+        else:
+            changed = True  # drop unknown
+    return list(result), changed
+
+def pass1_remap_outliers(qdrant, collection, dry_run):
+    log.info("=== PASS 1: Remapping non-canonical outlier domains ===")
+    outlier_names = list(OUTLIER_MAP.keys())
+    stats = defaultdict(int)
+
+    # Scroll through Qdrant finding affected vectors
+    offset = None
+    affected_points = []
+
+    while True:
+        results, offset = qdrant.scroll(
+            collection_name=collection,
+            scroll_filter=Filter(
+                must=[FieldCondition(
+                    key="domain",
+                    match=MatchAny(any=outlier_names)
+                )]
+            ),
+            limit=500,
+            with_payload=True,
+            with_vectors=False,
+            offset=offset,
+        )
+        affected_points.extend(results)
+        if offset is None:
+            break
+
+    log.info(f"Found {len(affected_points)} Qdrant points with outlier domains")
+
+    for point in affected_points:
+        payload = point.payload
+        old_domains = payload.get("domain", [])
+        if isinstance(old_domains, str):
+            old_domains = [old_domains]
+
+        new_domains, changed = remap_concept_domains(old_domains)
+        if not new_domains:
+            new_domains = ["Foundational Skills"]
+
+        if changed:
+            stats["qdrant_updated"] += 1
+            if not dry_run:
+                qdrant.set_payload(
+                    collection_name=collection,
+                    payload={"domain": new_domains},
+                    points=[point.id],
+                )
+
+    # Also fix concept JSON files on disk
+    json_fixed = 0
+    for window_file in CONCEPTS_DIR.rglob("window_*.json"):
+        try:
+            with open(window_file, "r", encoding="utf-8") as f:
+                concepts = json.load(f)
+        except Exception:
+            continue
+
+        if not isinstance(concepts, list):
+            continue
+
+        file_changed = False
+        for concept in concepts:
+            if not isinstance(concept, dict):
+                continue
+            raw = concept.get("domain", [])
+            if isinstance(raw, str):
+                raw = [raw]
+            new, changed = remap_concept_domains(raw)
+            if changed:
+                concept["domain"] = new if new else ["Foundational Skills"]
+                file_changed = True
+
+        if file_changed:
+            json_fixed += 1
+            if not dry_run:
+                with open(window_file, "w", encoding="utf-8") as f:
+                    json.dump(concepts, f, indent=2, ensure_ascii=False)
+
+    log.info(f"Pass 1 complete: {stats['qdrant_updated']} Qdrant points updated, {json_fixed} JSON files updated")
+    return stats
+
+# ── PASS 2: Re-enrich empty domain concepts ──────────────────────────────────
+
+def pass2_empty_domains(qdrant, collection, key_rotator, dry_run):
+    log.info("=== PASS 2: Re-enriching empty domain concepts ===")
+    stats = defaultdict(int)
+
+    # Find empty domain points in Qdrant
+    offset = None
+    empty_points = []
+    while True:
+        results, offset = qdrant.scroll(
+            collection_name=collection,
+            limit=500,
+            with_payload=True,
+            with_vectors=False,
+            offset=offset,
+        )
+        for r in results:
+            d = r.payload.get("domain", [])
+            if not d or d == [] or d == [""]:
+                empty_points.append(r)
+        if offset is None:
+            break
+
+    log.info(f"Found {len(empty_points)} points with empty domains")
+
+    for point in empty_points:
+        payload = point.payload
+        title = payload.get("title", "")
+        subdomains = payload.get("subdomain", [])
+        content = payload.get("content", payload.get("summary", ""))
+
+        key = key_rotator.next()
+        new_domains = classify_concept(title, subdomains, content, key)
+        stats["classified"] += 1
+
+        if not dry_run:
+            qdrant.set_payload(
+                collection_name=collection,
+                payload={"domain": new_domains},
+                points=[point.id],
+            )
+
+        # Also update the concept JSON on disk
+        doc_hash = payload.get("doc_hash", "")
+        if doc_hash:
+            doc_concepts_dir = CONCEPTS_DIR / doc_hash
+            if doc_concepts_dir.exists():
+                for wf in doc_concepts_dir.glob("window_*.json"):
+                    try:
+                        with open(wf, "r", encoding="utf-8") as f:
+                            concepts = json.load(f)
+                        changed = False
+                        for c in concepts:
+                            if isinstance(c, dict) and c.get("title") == title:
+                                d = c.get("domain", [])
+                                if not d or d == []:
+                                    c["domain"] = new_domains
+                                    changed = True
+                        if changed and not dry_run:
+                            with open(wf, "w", encoding="utf-8") as f:
+                                json.dump(concepts, f, indent=2, ensure_ascii=False)
+                    except Exception:
+                        pass
+
+        time.sleep(0.05)
+
+    log.info(f"Pass 2 complete: {stats['classified']} concepts re-classified")
+    return stats
+
+# ── PASS 3: Purge junk URLs ──────────────────────────────────────────────────
+
+def is_junk_url(url):
+    url_lower = url.lower()
+    return any(pattern.lower() in url_lower for pattern in JUNK_URL_PATTERNS)
+
+def pass3_purge_junk(qdrant, collection, dry_run):
+    log.info("=== PASS 3: Purging junk URLs ===")
+    stats = defaultdict(int)
+
+    # Scroll all web-source points and find junk
+    offset = None
+    junk_point_ids = []
+    junk_doc_hashes = set()
+
+    while True:
+        results, offset = qdrant.scroll(
+            collection_name=collection,
+            scroll_filter=Filter(
+                must=[FieldCondition(key="source_type", match=MatchAny(any=["web"]))]
+            ),
+            limit=500,
+            with_payload=True,
+            with_vectors=False,
+            offset=offset,
+        )
+        for r in results:
+            filename = r.payload.get("filename", "")
+            doc_hash = r.payload.get("doc_hash", "")
+            if is_junk_url(filename):
+                junk_point_ids.append(r.id)
+                if doc_hash:
+                    junk_doc_hashes.add(doc_hash)
+        if offset is None:
+            break
+
+    log.info(f"Found {len(junk_point_ids)} junk vectors across {len(junk_doc_hashes)} documents")
+
+    if not dry_run and junk_point_ids:
+        # Delete in batches
+        batch_size = 500
+        for i in range(0, len(junk_point_ids), batch_size):
+            batch = junk_point_ids[i:i + batch_size]
+            qdrant.delete(collection_name=collection, points_selector=batch)
+        log.info(f"Deleted {len(junk_point_ids)} junk vectors from Qdrant")
+
+        # Mark junk docs as skipped in SQLite
+        conn = sqlite3.connect(str(DB_PATH))
+        for doc_hash in junk_doc_hashes:
+            conn.execute(
+                "UPDATE documents SET status = 'skipped', error_message = 'junk content purged' WHERE hash = ?",
+                (doc_hash,)
+            )
+        conn.commit()
+        conn.close()
+        log.info(f"Marked {len(junk_doc_hashes)} documents as skipped in DB")
+
+    stats["junk_vectors"] = len(junk_point_ids)
+    stats["junk_docs"] = len(junk_doc_hashes)
+    log.info(f"Pass 3 complete: {stats['junk_vectors']} vectors, {stats['junk_docs']} docs purged")
+    return stats
+
+
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--dry-run", action="store_true")
+    parser.add_argument("--skip-pass", type=int, action="append", default=[])
+    args = parser.parse_args()
+
+    config = get_config()
+    keys = load_gemini_keys()
+    rotator = KeyRotator(keys)
+
+    qdrant = QdrantClient(
+        host=config['vector_db']['host'],
+        port=config['vector_db']['port'],
+        timeout=60
+    )
+    collection = config['vector_db']['collection']
+
+    log.info(f"Starting cleanup | dry_run={args.dry_run} | skipping passes: {args.skip_pass}")
+
+    if 1 not in args.skip_pass:
+        pass1_remap_outliers(qdrant, collection, args.dry_run)
+
+    if 2 not in args.skip_pass:
+        pass2_empty_domains(qdrant, collection, rotator, args.dry_run)
+
+    if 3 not in args.skip_pass:
+        pass3_purge_junk(qdrant, collection, args.dry_run)
+
+    log.info("All passes complete.")
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/domain_reenrich.py
+++ b/scripts/domain_reenrich.py
@ -0,0 +1,215 @@
+#!/usr/bin/env python3
+"""
+domain_reenrich.py — Re-enriches solo-Reference concepts that domain_remap.py
+couldn't fix via subdomain lookup. Reads remap_unknowns.jsonl, calls Gemini
+with a lightweight classification-only prompt, updates domain in-place.
+
+Usage:
+  python3 /opt/recon/scripts/domain_reenrich.py [--workers 16] [--limit N]
+
+Reads:  /opt/recon/data/remap_unknowns.jsonl
+Writes: domain field in-place in window JSON files
+Log:    /opt/recon/logs/domain_reenrich.log
+"""
+
+import json
+import time
+import random
+import logging
+import argparse
+import threading
+from pathlib import Path
+from concurrent.futures import ThreadPoolExecutor, as_completed
+from collections import defaultdict
+
+import google.generativeai as genai
+
+UNKNOWNS_FILE = Path("/opt/recon/data/remap_unknowns.jsonl")
+LOG_FILE = Path("/opt/recon/logs/domain_reenrich.log")
+
+logging.basicConfig(
+    level=logging.INFO,
+    format="%(asctime)s %(levelname)s %(message)s",
+    handlers=[
+        logging.FileHandler(LOG_FILE),
+        logging.StreamHandler(),
+    ]
+)
+log = logging.getLogger("domain_reenrich")
+
+CANONICAL_DOMAINS = [
+    "Defense & Tactics", "Sustainment Systems", "Off-Grid Systems",
+    "Foundational Skills", "Communications", "Medical", "Food Systems",
+    "Navigation", "Logistics", "Power Systems", "Leadership",
+    "Scenario Playbooks", "Water Systems", "Security", "Community Coordination"
+]
+
+DOMAIN_SET = set(CANONICAL_DOMAINS)
+
+CLASSIFY_PROMPT = """\
+Classify this knowledge concept into one or more domains.
+
+VALID DOMAINS (use ONLY these exact strings, no others):
+{domains}
+
+Concept title: {title}
+Concept tags: {subdomain}
+Concept preview: {content}
+
+Return ONLY valid JSON, no markdown, no explanation:
+{{"domain": ["Domain Name"]}}
+
+Rules:
+- Use only the domain strings listed above, spelled exactly
+- If genuinely multi-domain assign all that apply
+- Never return empty domain list — pick the closest match
+- Medical content, herbs, first aid, veterinary → Medical
+- Food growing, foraging, hunting, livestock → Sustainment Systems
+- Food preservation, canning, storage → Food Systems
+- Solar, wind, batteries, generators → Power Systems
+- Water sourcing, filtration, sanitation → Water Systems
+"""
+
+def load_gemini_keys():
+    env = Path("/opt/recon/.env")
+    keys = []
+    for line in env.read_text().splitlines():
+        if line.startswith("GEMINI_KEY_"):
+            keys.append(line.split("=", 1)[1].strip())
+    return keys
+
+class KeyRotator:
+    def __init__(self, keys):
+        self.keys = keys
+        self._i = 0
+        self._lock = threading.Lock()
+    def next(self):
+        with self._lock:
+            key = self.keys[self._i % len(self.keys)]
+            self._i += 1
+            return key
+
+def classify_concept(title, subdomains, content, key):
+    prompt = CLASSIFY_PROMPT.format(
+        domains="\n".join(f"  {d}" for d in CANONICAL_DOMAINS),
+        title=title or "(untitled)",
+        subdomain=", ".join(subdomains[:10]) if subdomains else "(none)",
+        content=content[:300] if content else "(none)",
+    )
+    genai.configure(api_key=key)
+    model = genai.GenerativeModel(
+        "gemini-2.0-flash",
+        generation_config={"response_mime_type": "application/json"}
+    )
+    for attempt in range(4):
+        try:
+            resp = model.generate_content(prompt)
+            data = json.loads(resp.text)
+            domains = [d for d in data.get("domain", []) if d in DOMAIN_SET]
+            if domains:
+                return domains
+        except Exception as e:
+            err = str(e).lower()
+            if any(s in err for s in ["429", "quota", "rate", "503", "unavailable"]):
+                delay = min(5 * (2 ** attempt) + random.uniform(0, 3), 60)
+                time.sleep(delay)
+            else:
+                break
+    return ["Foundational Skills"]  # last-resort fallback
+
+def process_unknown(item, key_rotator):
+    filepath = Path(item["filepath"])
+    title = item.get("title", "")
+    subdomains = item.get("subdomain", [])
+    content = item.get("content_preview", "")
+
+    if not filepath.exists():
+        return "file_missing"
+
+    try:
+        with open(filepath, "r", encoding="utf-8") as f:
+            concepts = json.load(f)
+    except Exception:
+        return "read_error"
+
+    if not isinstance(concepts, list):
+        return "not_list"
+
+    # Find this concept by title and update its domain
+    matched = False
+    for concept in concepts:
+        if not isinstance(concept, dict):
+            continue
+        if concept.get("title", "") == title:
+            raw = concept.get("domain", [])
+            if isinstance(raw, str):
+                raw = [raw]
+            # Only re-enrich if still stuck on Reference
+            if raw == ["Reference"] or raw == []:
+                key = key_rotator.next()
+                new_domains = classify_concept(title, subdomains, content, key)
+                concept["domain"] = new_domains
+                concept["_reenriched"] = True
+                matched = True
+                break
+
+    if not matched:
+        return "already_fixed"
+
+    try:
+        with open(filepath, "w", encoding="utf-8") as f:
+            json.dump(concepts, f, indent=2, ensure_ascii=False)
+    except Exception:
+        return "write_error"
+
+    return "ok"
+
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--workers", type=int, default=16)
+    parser.add_argument("--limit", type=int, default=None)
+    args = parser.parse_args()
+
+    keys = load_gemini_keys()
+    if not keys:
+        log.error("No Gemini keys found in .env")
+        return
+    rotator = KeyRotator(keys)
+
+    unknowns = []
+    with open(UNKNOWNS_FILE, "r", encoding="utf-8") as f:
+        for line in f:
+            line = line.strip()
+            if line:
+                unknowns.append(json.loads(line))
+
+    if args.limit:
+        unknowns = unknowns[:args.limit]
+
+    total = len(unknowns)
+    log.info(f"Re-enriching {total:,} concepts | {args.workers} workers | {len(keys)} API keys")
+    log.info(f"Estimated Gemini Flash cost: ~${total * 0.0004:.2f} (conservative)")
+
+    results = defaultdict(int)
+    lock = threading.Lock()
+    done = 0
+
+    with ThreadPoolExecutor(max_workers=args.workers) as ex:
+        futures = {ex.submit(process_unknown, item, rotator): item for item in unknowns}
+        for future in as_completed(futures):
+            status = future.result()
+            with lock:
+                results[status] += 1
+                done += 1
+                if done % 5000 == 0:
+                    pct = done / total * 100
+                    log.info(f"  Progress: {done:,}/{total:,} ({pct:.1f}%) | {dict(results)}")
+            time.sleep(0.05)
+
+    log.info("── Final Results ─────────────────────────────────────────────")
+    for status, count in sorted(results.items(), key=lambda x: -x[1]):
+        log.info(f"  {status:<25} {count:>10,}")
+    log.info(f"  Total: {total:,}")
+
+if __name__ == "__main__":
+    main()
--- a/scripts/domain_remap.py
+++ b/scripts/domain_remap.py
@ -0,0 +1,428 @@
+#!/usr/bin/env python3
+"""
+domain_remap.py — Fix RECON concept domain classifications without API calls.
+
+What this does:
+  1. Strips "Reference" from concepts that have other real domains
+  2. Remaps variant domain spellings to canonical names
+  3. Reclassifies solo-Reference concepts using their subdomain tags
+  4. Writes a JSONL file of true unknowns for API re-enrichment
+
+Each window file is a JSON array of concept dicts.
+Field names: "domain" (list), "subdomain" (list)
+
+Usage:
+  python3 /opt/recon/scripts/domain_remap.py --dry-run   # report only
+  python3 /opt/recon/scripts/domain_remap.py             # apply fixes
+  python3 /opt/recon/scripts/domain_remap.py --workers 16
+"""
+
+import json
+import argparse
+import threading
+from pathlib import Path
+from concurrent.futures import ThreadPoolExecutor, as_completed
+from collections import defaultdict
+
+CONCEPTS_DIR = Path("/opt/recon/data/concepts")
+UNKNOWNS_OUTPUT = Path("/opt/recon/data/remap_unknowns.jsonl")
+
+CANONICAL_DOMAINS = {
+    "Defense & Tactics", "Sustainment Systems", "Off-Grid Systems",
+    "Foundational Skills", "Communications", "Medical", "Food Systems",
+    "Navigation", "Logistics", "Power Systems", "Leadership",
+    "Scenario Playbooks", "Water Systems", "Security", "Community Coordination"
+}
+
+# Variant → Canonical mapping
+VARIANT_MAP = {
+    # Defense & Tactics
+    "Tactical Ops": "Defense & Tactics",
+    "Tactical_Ops": "Defense & Tactics",
+    "Tactical Operations": "Defense & Tactics",
+    "Tactical": "Defense & Tactics",
+    "Tactical Skills": "Defense & Tactics",
+    "Tactics": "Defense & Tactics",
+    "Tactics & Defense": "Defense & Tactics",
+    "Reconnaissance": "Defense & Tactics",
+    "Fire Support": "Defense & Tactics",
+    "Improvised Munitions": "Defense & Tactics",
+    "Military Intelligence": "Defense & Tactics",
+    "Military History": "Defense & Tactics",
+    "Military Engineering": "Defense & Tactics",
+    # Medical
+    "Medical Care": "Medical",
+    "Medical Alternatives": "Medical",
+    "Medical/Dental": "Medical",
+    "Medical & Dental": "Medical",
+    "medical": "Medical",
+    "Medical Awareness": "Medical",
+    "Medical Disasters": "Medical",
+    "Medical Emergency Survival": "Medical",
+    "Medical Procedures": "Medical",
+    "Medical Treatment": "Medical",
+    "Medical Science": "Medical",
+    "Medical History": "Medical",
+    "Medical Diagnosis": "Medical",
+    "Medical Skills": "Medical",
+    "Medical Supply": "Medical",
+    "Medical Gear": "Medical",
+    "Medical Kits": "Medical",
+    "Medical Logistics": "Logistics",
+    "Medical First Aid": "Medical",
+    "Medical Ethics": "Medical",
+    "Medical Reference Ranges": "Medical",
+    "Medical andSurgical Hints": "Medical",
+    "Medical Aspects of Radiation Injury": "Medical",
+    "Medical Uses": "Medical",
+    "Medical Care in Developing Countries": "Medical",
+    "Survival Medicine": "Medical",
+    "Emergency War Surgery": "Medical",
+    "First Aid": "Medical",
+    "First Aid and Life Saving": "Medical",
+    "Veterinary Medicine": "Medical",
+    "Veterinary Hygiene": "Medical",
+    "Veterinary": "Medical",
+    "Pharmacology": "Medical",
+    "Public Health": "Medical",
+    "Health": "Medical",
+    # Food Systems
+    "Food_Systems": "Food Systems",
+    "Food_systems": "Food Systems",
+    "food_systems": "Food Systems",
+    "Food Preservation": "Food Systems",
+    "Food Safety": "Food Systems",
+    "Food Security": "Food Systems",
+    "Food & Nutrition": "Food Systems",
+    "Diet & Nutrition": "Food Systems",
+    "Culinary Arts": "Food Systems",
+    "Foodprocessing": "Food Systems",
+    "Food": "Food Systems",
+    # Sustainment Systems
+    "Sustainment_Systems": "Sustainment Systems",
+    "Agriculture": "Sustainment Systems",
+    "Agriculture & Natural Resources": "Sustainment Systems",
+    "Agriculture and Natural Resources": "Sustainment Systems",
+    "Horticulture": "Sustainment Systems",
+    "Gardening": "Sustainment Systems",
+    "Hydroponics": "Sustainment Systems",
+    "Survival Skills": "Sustainment Systems",
+    # Foundational Skills
+    "Foundational_Skills": "Foundational Skills",
+    "Primitive Living Skills": "Foundational Skills",
+    "Woodcraft": "Foundational Skills",
+    "Home Workshop": "Foundational Skills",
+    "Science": "Foundational Skills",
+    "Engineering": "Foundational Skills",
+    "Construction": "Foundational Skills",
+    "Industrial Processes": "Foundational Skills",
+    "Machine Technology": "Foundational Skills",
+    "Training": "Foundational Skills",
+    "Education": "Foundational Skills",
+    # Off-Grid Systems
+    "Off-Grid_Systems": "Off-Grid Systems",
+    "Appropriate Technology": "Off-Grid Systems",
+    # Power Systems
+    "Homebrewed Electricity": "Power Systems",
+    "Renewable Energy": "Power Systems",
+    "Renewable Energy FAQs": "Power Systems",
+    "Alternative Fuels": "Power Systems",
+    "Power_Systems": "Power Systems",
+    # Water Systems
+    "Water_Systems": "Water Systems",
+    # Community Coordination
+    "Community_Coordination": "Community Coordination",
+    "Community_coordination": "Community Coordination",
+    "Community": "Community Coordination",
+    # Leadership
+    "Leadership & Planning": "Leadership",
+    "Planning": "Leadership",
+    "Administration": "Leadership",
+    "Governance": "Leadership",
+    "Government": "Leadership",
+    # Communications
+    "Emergency Communications": "Communications",
+    # Security
+    "Security Systems": "Security",
+    # Logistics
+    "Transportation": "Logistics",
+    # Scenario Playbooks
+    "General Preparedness": "Scenario Playbooks",
+    "Emergency Preparedness": "Scenario Playbooks",
+    "Emergency Management": "Scenario Playbooks",
+    "Wilderness Preparedness": "Scenario Playbooks",
+    "Urban Preparedness": "Scenario Playbooks",
+    "Winter Preparedness": "Scenario Playbooks",
+    # Discard (noise domains)
+    "Humor": None,
+    "Recreation": None,
+    "Business": None,
+    "Finance": None,
+    "Economics": None,
+    "Economics/Finances": None,
+    "Weird Science": None,
+}
+
+# Subdomain keyword → canonical domain (for solo-Reference reclassification)
+SUBDOMAIN_MAP = {
+    "first aid": "Medical",
+    "emergency care": "Medical",
+    "emergency medicine": "Medical",
+    "trauma": "Medical",
+    "anatomy": "Medical",
+    "oral rehydration": "Medical",
+    "ors": "Medical",
+    "pharmacology": "Medical",
+    "toxicology": "Medical",
+    "antidote": "Medical",
+    "nerve agent": "Defense & Tactics",
+    "chemical warfare": "Defense & Tactics",
+    "biological warfare": "Defense & Tactics",
+    "nbc": "Defense & Tactics",
+    "infectious disease": "Medical",
+    "microbiology": "Medical",
+    "virology": "Medical",
+    "bacteriology": "Medical",
+    "pediatric": "Medical",
+    "surgery": "Medical",
+    "wound care": "Medical",
+    "veterinary": "Medical",
+    "dental": "Medical",
+    "dentistry": "Medical",
+    "herbal": "Medical",
+    "medicinal plant": "Medical",
+    "medicinal herb": "Medical",
+    "herbalism": "Medical",
+    "food preservation": "Food Systems",
+    "canning": "Food Systems",
+    "fermentation": "Food Systems",
+    "food storage": "Food Systems",
+    "food safety": "Food Systems",
+    "cooking": "Food Systems",
+    "food processing": "Food Systems",
+    "agriculture": "Sustainment Systems",
+    "soil": "Sustainment Systems",
+    "permaculture": "Sustainment Systems",
+    "agroforestry": "Sustainment Systems",
+    "livestock": "Sustainment Systems",
+    "animal husbandry": "Sustainment Systems",
+    "beekeeping": "Sustainment Systems",
+    "foraging": "Sustainment Systems",
+    "hunting": "Sustainment Systems",
+    "fishing": "Sustainment Systems",
+    "gardening": "Sustainment Systems",
+    "mycology": "Sustainment Systems",
+    "mushroom": "Sustainment Systems",
+    "water purification": "Water Systems",
+    "water filtration": "Water Systems",
+    "water sanitation": "Water Systems",
+    "water disinfection": "Water Systems",
+    "water storage": "Water Systems",
+    "well construction": "Water Systems",
+    "rainwater": "Water Systems",
+    "solar": "Power Systems",
+    "wind turbine": "Power Systems",
+    "battery": "Power Systems",
+    "batteries": "Power Systems",
+    "generator": "Power Systems",
+    "photovoltaic": "Power Systems",
+    "charge controller": "Power Systems",
+    "inverter": "Power Systems",
+    "biogas": "Off-Grid Systems",
+    "biomass": "Off-Grid Systems",
+    "wood gasification": "Off-Grid Systems",
+    "rocket stove": "Off-Grid Systems",
+    "mechanical system": "Off-Grid Systems",
+    "power transmission": "Off-Grid Systems",
+    "radio": "Communications",
+    "ham radio": "Communications",
+    "amateur radio": "Communications",
+    "antenna": "Communications",
+    "meshtastic": "Communications",
+    "encryption": "Communications",
+    "navigation": "Navigation",
+    "celestial navigation": "Navigation",
+    "land navigation": "Navigation",
+    "map reading": "Navigation",
+    "compass": "Navigation",
+    "pottery": "Foundational Skills",
+    "ceramics": "Foundational Skills",
+    "blacksmithing": "Foundational Skills",
+    "woodworking": "Foundational Skills",
+    "leatherwork": "Foundational Skills",
+    "textile": "Foundational Skills",
+    "masonry": "Foundational Skills",
+    "metalworking": "Foundational Skills",
+    "historical technology": "Foundational Skills",
+    "weapons": "Defense & Tactics",
+    "firearms": "Defense & Tactics",
+    "ballistics": "Defense & Tactics",
+    "tactics": "Defense & Tactics",
+    "perimeter": "Security",
+    "surveillance": "Security",
+    "supply chain": "Logistics",
+    "logistics": "Logistics",
+    "leadership": "Leadership",
+    "governance": "Leadership",
+    "community": "Community Coordination",
+    "emergency preparedness": "Scenario Playbooks",
+    "disaster": "Scenario Playbooks",
+    "evacuation": "Scenario Playbooks",
+}
+
+
+def remap_domains(domains):
+    """Remap a list of domain strings — variants to canonical, strip Reference."""
+    result = set()
+    for d in domains:
+        if d == "Reference":
+            continue
+        if d in CANONICAL_DOMAINS:
+            result.add(d)
+        elif d in VARIANT_MAP:
+            mapped = VARIANT_MAP[d]
+            if mapped:  # None means discard
+                result.add(mapped)
+        # Unknown non-canonical domains: drop them
+    return list(result)
+
+
+def classify_by_subdomain(subdomains):
+    """Try to infer canonical domain(s) from subdomain keyword matching."""
+    found = set()
+    for sd in subdomains:
+        sd_lower = sd.lower().strip()
+        for key, domain in SUBDOMAIN_MAP.items():
+            if key in sd_lower:
+                found.add(domain)
+    return list(found) if found else None
+
+
+def process_window_file(filepath, dry_run):
+    """Process one window JSON file (array of concepts). Returns per-file stats."""
+    stats = defaultdict(int)
+    unknowns = []
+
+    try:
+        with open(filepath, "r", encoding="utf-8") as f:
+            concepts = json.load(f)
+    except Exception as e:
+        return {"parse_error": 1}, []
+
+    if not isinstance(concepts, list):
+        return {"skip_not_list": 1}, []
+
+    modified = False
+
+    for concept in concepts:
+        if not isinstance(concept, dict):
+            continue
+
+        raw_domains = concept.get("domain", [])
+        if isinstance(raw_domains, str):
+            raw_domains = [raw_domains]
+
+        subdomains = concept.get("subdomain", [])
+        if isinstance(subdomains, str):
+            subdomains = [subdomains]
+
+        has_reference = "Reference" in raw_domains
+        non_reference = [d for d in raw_domains if d != "Reference"]
+
+        if not has_reference:
+            # No Reference — just fix any variant names
+            remapped = remap_domains(raw_domains)
+            if set(remapped) != set(raw_domains):
+                concept["domain"] = remapped
+                modified = True
+                stats["variant_remapped"] += 1
+            else:
+                stats["no_change"] += 1
+            continue
+
+        # Has Reference — what else does it have?
+        remapped_others = remap_domains(non_reference)
+
+        if remapped_others:
+            # Reference + real domains: drop Reference, keep the rest
+            concept["domain"] = remapped_others
+            modified = True
+            stats["reference_stripped"] += 1
+            continue
+
+        # Solo Reference (or Reference + only-noise): try subdomain lookup
+        inferred = classify_by_subdomain(subdomains)
+        if inferred:
+            concept["domain"] = inferred
+            concept["_reclassified_from_reference"] = True
+            modified = True
+            stats["subdomain_reclassified"] += 1
+            continue
+
+        # True unknown — needs API re-enrichment
+        unknowns.append({
+            "filepath": str(filepath),
+            "title": concept.get("title", ""),
+            "subdomain": subdomains,
+            "content_preview": str(concept.get("content", concept.get("summary", "")))[:300],
+        })
+        stats["needs_enrichment"] += 1
+
+    if modified and not dry_run:
+        with open(filepath, "w", encoding="utf-8") as f:
+            json.dump(concepts, f, indent=2, ensure_ascii=False)
+
+    return dict(stats), unknowns
+
+
+def main():
+    parser = argparse.ArgumentParser(description="Remap RECON concept domains")
+    parser.add_argument("--dry-run", action="store_true", help="Report without writing")
+    parser.add_argument("--workers", type=int, default=16)
+    args = parser.parse_args()
+
+    print(f"[REMAP] Scanning {CONCEPTS_DIR}")
+    print(f"[REMAP] Dry run: {args.dry_run} | Workers: {args.workers}")
+
+    window_files = [
+        f for f in CONCEPTS_DIR.rglob("window_*.json")
+    ]
+    print(f"[REMAP] Found {len(window_files):,} window files")
+
+    total_stats = defaultdict(int)
+    all_unknowns = []
+    lock = threading.Lock()
+    done = 0
+
+    with ThreadPoolExecutor(max_workers=args.workers) as ex:
+        futures = {ex.submit(process_window_file, f, args.dry_run): f for f in window_files}
+        for future in as_completed(futures):
+            file_stats, unknowns = future.result()
+            with lock:
+                for k, v in file_stats.items():
+                    total_stats[k] += v
+                all_unknowns.extend(unknowns)
+                done += 1
+                if done % 5000 == 0:
+                    print(f"  {done:,}/{len(window_files):,} files processed...")
+
+    print("\n── Results ─────────────────────────────────────────────────")
+    for status, count in sorted(total_stats.items(), key=lambda x: -x[1]):
+        print(f"  {status:<35} {count:>10,}")
+
+    total_concepts = sum(total_stats.values())
+    print(f"\n  Total concepts processed:       {total_concepts:>10,}")
+    print(f"  True unknowns for re-enrichment:{len(all_unknowns):>10,}")
+
+    if not args.dry_run and all_unknowns:
+        with open(UNKNOWNS_OUTPUT, "w", encoding="utf-8") as f:
+            for item in all_unknowns:
+                f.write(json.dumps(item) + "\n")
+        print(f"\n  Unknowns written to: {UNKNOWNS_OUTPUT}")
+
+    if args.dry_run:
+        print("\n  [DRY RUN] No files were modified.")
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/migrate_domains.py
+++ b/scripts/migrate_domains.py
@ -0,0 +1,469 @@
+#!/usr/bin/env python3
+"""
+migrate_domains.py — Reclassify 5 legacy domains via Gemini Flash.
+
+Targets: Sustainment Systems, Off-Grid Systems, Defense & Tactics,
+         Community Coordination, Leadership
+
+Maps each to one of the 18 approved domains. 16 parallel workers,
+checkpoint file, crash-safe, incremental saves, progress every 5,000.
+
+Usage:
+  python3 /tmp/migrate_domains.py [--dry-run] [--workers 16] [--limit N]
+"""
+
+import json
+import time
+import random
+import logging
+import argparse
+import threading
+from pathlib import Path
+from concurrent.futures import ThreadPoolExecutor, as_completed
+from collections import defaultdict
+
+import google.generativeai as genai
+from qdrant_client import QdrantClient
+from qdrant_client.models import FieldCondition, MatchValue, Filter
+
+# Suppress noisy HTTP logs
+import logging as _logging
+_logging.getLogger("httpx").setLevel(_logging.WARNING)
+_logging.getLogger("qdrant_client").setLevel(_logging.WARNING)
+
+LOG_FILE = Path("/opt/recon/logs/migrate_domains.log")
+CHECKPOINT_FILE = Path("/opt/recon/data/migrate_domains_checkpoint.json")
+
+logging.basicConfig(
+    level=logging.INFO,
+    format="%(asctime)s %(levelname)s %(message)s",
+    handlers=[logging.FileHandler(LOG_FILE), logging.StreamHandler()]
+)
+log = logging.getLogger("migrate_domains")
+
+# ── Constants ───────────────────────────────────────────────────────────────
+
+VALID_DOMAINS = {
+    'Agriculture & Livestock', 'Civil Organization', 'Communications',
+    'Food Systems', 'Foundational Skills', 'Logistics', 'Medical',
+    'Navigation', 'Operations', 'Power Systems', 'Preservation & Storage',
+    'Security', 'Shelter & Construction', 'Technology', 'Tools & Equipment',
+    'Vehicles', 'Water Systems', 'Wilderness Skills',
+}
+
+SOURCE_DOMAINS = {
+    'Sustainment Systems', 'Off-Grid Systems', 'Defense & Tactics',
+    'Community Coordination', 'Leadership',
+}
+
+DOMAIN_LIST_STR = ', '.join(sorted(VALID_DOMAINS))
+
+CLASSIFY_PROMPT = """\
+Classify this knowledge concept into exactly one domain from this list:
+Agriculture & Livestock, Civil Organization, Communications, Food Systems, Foundational Skills, Logistics, Medical, Navigation, Operations, Power Systems, Preservation & Storage, Security, Shelter & Construction, Technology, Tools & Equipment, Vehicles, Water Systems, Wilderness Skills
+
+Return ONLY the exact domain string, nothing else. No explanation, no punctuation, no quotes.
+
+Content: {content}
+Summary: {summary}
+Subdomain: {subdomain}
+"""
+
+DOMAIN_FALLBACK = 'Foundational Skills'
+
+# ── Key management ──────────────────────────────────────────────────────────
+
+def load_gemini_keys():
+    keys = []
+    env_path = Path("/opt/recon/.env")
+    if not env_path.exists():
+        raise FileNotFoundError(f"{env_path} not found")
+    for line in env_path.read_text().splitlines():
+        if line.startswith("GEMINI_KEY_"):
+            keys.append(line.split("=", 1)[1].strip())
+    if not keys:
+        raise ValueError("No GEMINI_KEY_* found in .env")
+    return keys
+
+
+class KeyRotator:
+    def __init__(self, keys):
+        self.keys = keys
+        self._i = 0
+        self._lock = threading.Lock()
+
+    def next(self):
+        with self._lock:
+            key = self.keys[self._i % len(self.keys)]
+            self._i += 1
+            return key
+
+
+# ── Classification ──────────────────────────────────────────────────────────
+
+def classify_domain(content, summary, subdomains, key):
+    """Call Gemini Flash to classify into one of 18 domains."""
+    prompt = CLASSIFY_PROMPT.format(
+        content=str(content)[:400] if content else "(none)",
+        summary=str(summary)[:200] if summary else "(none)",
+        subdomain=", ".join(subdomains[:10]) if subdomains else "(none)",
+    )
+    genai.configure(api_key=key)
+    model = genai.GenerativeModel(
+        "gemini-2.0-flash",
+        generation_config={"response_mime_type": "text/plain"}
+    )
+
+    for retry in range(4):
+        try:
+            resp = model.generate_content(prompt)
+            value = resp.text.strip().strip('"').strip("'").strip()
+            if value in VALID_DOMAINS:
+                return value
+            # Try case-insensitive match
+            for valid in VALID_DOMAINS:
+                if value.lower() == valid.lower():
+                    return valid
+            # Partial match — Gemini sometimes returns with trailing period
+            clean = value.rstrip('.')
+            if clean in VALID_DOMAINS:
+                return clean
+            # Invalid — retry with stricter prompt
+            if retry < 3:
+                prompt = (
+                    f"Your previous response '{value}' was invalid. "
+                    f"You must return ONLY one of these exact strings: {DOMAIN_LIST_STR}\n\n"
+                    f"Content: {str(content)[:300]}\n"
+                    f"Return ONLY the exact domain string."
+                )
+                continue
+        except Exception as e:
+            err = str(e).lower()
+            if any(s in err for s in ["429", "quota", "rate", "503", "unavailable"]):
+                time.sleep(min(5 * (2 ** retry) + random.uniform(0, 3), 60))
+            else:
+                log.warning(f"Gemini error (attempt {retry+1}): {e}")
+                if retry >= 2:
+                    break
+
+    return heuristic_fallback(content, summary, subdomains)
+
+
+def heuristic_fallback(content, summary, subdomains):
+    """Last-resort heuristic when Gemini fails or returns invalid."""
+    text = f"{summary or ''} {' '.join(subdomains or [])} {str(content or '')[:200]}".lower()
+
+    mapping = [
+        (["farming", "agriculture", "livestock", "animal husbandry", "poultry",
+          "cattle", "crop", "soil fertility", "irrigation for crops"], "Agriculture & Livestock"),
+        (["foraging", "hunting", "fishing", "bushcraft", "wilderness", "survival skill",
+          "fire starting", "shelter building", "trapping", "tracking"], "Wilderness Skills"),
+        (["food preservation", "canning", "dehydration", "smoking", "pickling",
+          "fermentation", "food storage", "freeze dry"], "Preservation & Storage"),
+        (["cooking", "recipe", "nutrition", "food preparation", "baking",
+          "food production", "meal"], "Food Systems"),
+        (["first aid", "medical", "trauma", "surgery", "anatomy", "pharmacology",
+          "wound", "triage", "diagnosis", "disease", "infection", "veterinary",
+          "herbal medicine", "medicinal plant"], "Medical"),
+        (["radio", "antenna", "ham radio", "communication", "signal",
+          "networking", "meshtastic", "comms"], "Communications"),
+        (["solar", "battery", "generator", "wind turbine", "hydroelectric",
+          "power grid", "inverter", "photovoltaic", "electricity"], "Power Systems"),
+        (["water purification", "water filter", "well", "rainwater",
+          "sanitation", "water treatment", "desalination"], "Water Systems"),
+        (["navigation", "compass", "map reading", "gps", "celestial",
+          "orienteering", "land nav"], "Navigation"),
+        (["security", "opsec", "perimeter", "surveillance", "threat",
+          "intrusion detection", "physical security"], "Security"),
+        (["vehicle", "engine", "motor", "aircraft", "boat", "motorcycle",
+          "truck", "maintenance", "diesel", "transmission"], "Vehicles"),
+        (["tool", "equipment", "wrench", "saw", "drill", "hammer",
+          "hand tool", "power tool", "blade", "sharpening"], "Tools & Equipment"),
+        (["construction", "building", "shelter", "carpentry", "masonry",
+          "roofing", "concrete", "framing", "plumbing"], "Shelter & Construction"),
+        (["electronics", "computer", "software", "circuit", "programming",
+          "technology", "digital", "engineering"], "Technology"),
+        (["supply chain", "logistics", "transport", "distribution",
+          "inventory", "supply", "stockpile"], "Logistics"),
+        (["governance", "civil", "community", "administration", "organization",
+          "council", "democratic", "municipal"], "Civil Organization"),
+        (["tactics", "combat", "military", "mission", "patrol", "ambush",
+          "defensive position", "fire team", "maneuver", "engagement",
+          "search and rescue", "sar", "reconnaissance"], "Operations"),
+    ]
+
+    for keywords, domain in mapping:
+        if any(kw in text for kw in keywords):
+            return domain
+
+    return DOMAIN_FALLBACK
+
+
+# ── Checkpoint ──────────────────────────────────────────────────────────────
+
+class Checkpoint:
+    """Thread-safe checkpoint tracker for crash recovery."""
+    def __init__(self, path):
+        self.path = path
+        self._lock = threading.Lock()
+        self._completed = set()
+        self._dirty = 0
+        self._load()
+
+    def _load(self):
+        if self.path.exists():
+            try:
+                data = json.loads(self.path.read_text())
+                self._completed = set(data.get("completed", []))
+                log.info(f"Loaded checkpoint: {len(self._completed):,} completed points")
+            except Exception:
+                self._completed = set()
+
+    def is_done(self, point_id):
+        return point_id in self._completed
+
+    def mark_done(self, point_id):
+        with self._lock:
+            self._completed.add(point_id)
+            self._dirty += 1
+            if self._dirty >= 1000:
+                self._flush()
+
+    def _flush(self):
+        tmp = self.path.with_suffix('.tmp')
+        tmp.write_text(json.dumps({"completed": list(self._completed)}))
+        tmp.rename(self.path)
+        self._dirty = 0
+
+    def flush(self):
+        with self._lock:
+            self._flush()
+
+    def count(self):
+        return len(self._completed)
+
+
+# ── Per-point processing ───────────────────────────────────────────────────
+
+def process_point(point, qdrant, collection, key_rotator, checkpoint, dry_run, stats):
+    point_id = point.id
+    if checkpoint.is_done(point_id):
+        return "skipped"
+
+    payload = point.payload
+    content = payload.get("content", payload.get("summary", ""))
+    summary = payload.get("summary", "")
+    subdomains = payload.get("subdomain", [])
+    if isinstance(subdomains, str):
+        subdomains = [subdomains]
+    old_domain = payload.get("domain", [])
+    if isinstance(old_domain, list):
+        old_domain_str = old_domain[0] if old_domain else "(empty)"
+    else:
+        old_domain_str = str(old_domain)
+
+    key = key_rotator.next()
+    new_domain = classify_domain(content, summary, subdomains, key)
+
+    # Track the mapping
+    stats_key = f"{old_domain_str} -> {new_domain}"
+    stats[stats_key] = stats.get(stats_key, 0) + 1
+
+    if dry_run:
+        return f"would: {old_domain_str} -> {new_domain}"
+
+    # Write new domain as single string
+    qdrant.set_payload(
+        collection_name=collection,
+        payload={"domain": new_domain},
+        points=[point_id],
+    )
+
+    checkpoint.mark_done(point_id)
+    return "ok"
+
+
+# ── Main loop ───────────────────────────────────────────────────────────────
+
+SCROLL_BATCH = 5000
+
+
+def count_source_domains(qdrant, collection):
+    """Count vectors with source domains."""
+    counts = {}
+    for domain in SOURCE_DOMAINS:
+        result = qdrant.count(
+            collection_name=collection,
+            count_filter=Filter(
+                must=[FieldCondition(key="domain", match=MatchValue(value=domain))]
+            ),
+            exact=True,
+        )
+        counts[domain] = result.count
+    return counts
+
+
+def stream_and_process(qdrant, collection, rotator, checkpoint, workers, limit=None, dry_run=False):
+    """Scroll source domains in batches, process with thread pool."""
+    lock = threading.Lock()
+    done = 0
+    skipped_checkpoint = 0
+    start = time.time()
+    stats = {}  # shared mapping stats
+
+    for source_domain in sorted(SOURCE_DOMAINS):
+        log.info(f"\n--- Processing domain: {source_domain} ---")
+        offset = None
+        domain_done = 0
+
+        while True:
+            scroll_results, offset = qdrant.scroll(
+                collection_name=collection,
+                limit=SCROLL_BATCH,
+                with_payload=True,
+                with_vectors=False,
+                offset=offset,
+                scroll_filter=Filter(
+                    must=[FieldCondition(key="domain", match=MatchValue(value=source_domain))]
+                ),
+            )
+
+            if not scroll_results:
+                if offset is None:
+                    break
+                continue
+
+            # Filter already checkpointed
+            pending = [p for p in scroll_results if not checkpoint.is_done(p.id)]
+            skipped_checkpoint += len(scroll_results) - len(pending)
+
+            if pending:
+                with ThreadPoolExecutor(max_workers=workers) as ex:
+                    futures = {
+                        ex.submit(process_point, p, qdrant, collection, rotator,
+                                  checkpoint, dry_run, stats): p
+                        for p in pending
+                    }
+                    for future in as_completed(futures):
+                        try:
+                            future.result()
+                        except Exception as e:
+                            log.error(f"Worker error: {e}")
+                        with lock:
+                            done += 1
+                            domain_done += 1
+                            if done % 5000 == 0:
+                                elapsed = time.time() - start
+                                rate = done / elapsed * 60
+                                log.info(f"  {done:,} done | {rate:.0f}/min | "
+                                         f"elapsed {elapsed/60:.1f}min")
+                                checkpoint.flush()
+                        time.sleep(0.02)
+
+            if limit and done >= limit:
+                break
+            if offset is None:
+                break
+
+        log.info(f"  {source_domain}: {domain_done:,} vectors processed")
+
+        if limit and done >= limit:
+            break
+
+    checkpoint.flush()
+    return done, skipped_checkpoint, stats, start
+
+
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--dry-run", action="store_true",
+                        help="Classify 20 samples without writing")
+    parser.add_argument("--workers", type=int, default=16)
+    parser.add_argument("--limit", type=int, default=None)
+    args = parser.parse_args()
+
+    keys = load_gemini_keys()
+    rotator = KeyRotator(keys)
+
+    qdrant = QdrantClient(host="localhost", port=6333, timeout=120)
+    collection = "recon_knowledge"
+    checkpoint = Checkpoint(CHECKPOINT_FILE)
+
+    # Count source domains
+    counts = count_source_domains(qdrant, collection)
+    total_source = sum(counts.values())
+    pre_checkpoint = checkpoint.count()
+
+    log.info(f"Source domain counts:")
+    for domain, count in sorted(counts.items(), key=lambda x: -x[1]):
+        log.info(f"  {domain:30s} {count:>10,}")
+    log.info(f"  {'TOTAL':30s} {total_source:>10,}")
+    log.info(f"Checkpoint: {pre_checkpoint:,} already completed")
+    log.info(f"Workers: {args.workers} | Keys: {len(keys)}")
+
+    # Cost estimate
+    remaining = total_source - pre_checkpoint
+    input_tokens = remaining * 200
+    output_tokens = remaining * 5
+    input_cost = input_tokens / 1_000_000 * 0.10
+    output_cost = output_tokens / 1_000_000 * 0.40
+    total_cost = input_cost + output_cost
+    log.info(f"\nEstimated Gemini 2.0 Flash cost:")
+    log.info(f"  Vectors to process: {remaining:,}")
+    log.info(f"  Input:  ~{input_tokens/1_000_000:.1f}M tokens = ${input_cost:.2f}")
+    log.info(f"  Output: ~{output_tokens/1_000_000:.1f}M tokens = ${output_cost:.2f}")
+    log.info(f"  TOTAL:  ~${total_cost:.2f}")
+
+    if args.dry_run:
+        log.info(f"\nDRY RUN: classifying 20 samples...\n")
+        for source_domain in sorted(SOURCE_DOMAINS):
+            scroll_results, _ = qdrant.scroll(
+                collection_name=collection,
+                limit=5,
+                with_payload=True,
+                with_vectors=False,
+                scroll_filter=Filter(
+                    must=[FieldCondition(key="domain", match=MatchValue(value=source_domain))]
+                ),
+            )
+            for p in scroll_results[:4]:
+                pay = p.payload
+                title = pay.get("title", "(no title)")
+                content = pay.get("content", pay.get("summary", ""))
+                summary = pay.get("summary", "")
+                subdomains = pay.get("subdomain", [])
+                if isinstance(subdomains, str):
+                    subdomains = [subdomains]
+
+                key = rotator.next()
+                new_domain = classify_domain(content, summary, subdomains, key)
+
+                old = pay.get("domain", [])
+                if isinstance(old, list):
+                    old = old[0] if old else "?"
+                print(f"  [{old:25s}] -> [{new_domain:25s}]  {title[:60]}")
+
+        print(f"\nDRY RUN complete. ~{remaining:,} vectors would be migrated.")
+        print(f"Estimated cost: ~${total_cost:.2f}")
+        return
+
+    # ── Full migration ──────────────────────────────────────────────────
+    log.info(f"\nStarting full migration...")
+
+    done, skipped_ckpt, stats, start = stream_and_process(
+        qdrant, collection, rotator, checkpoint, args.workers, args.limit
+    )
+
+    elapsed = time.time() - start
+    log.info(f"\n{'='*70}")
+    log.info(f"MIGRATION COMPLETE in {elapsed/60:.1f}min:")
+    log.info(f"  Processed:            {done:,}")
+    log.info(f"  Skipped (checkpoint): {skipped_ckpt:,}")
+    log.info(f"  Rate:                 {done/elapsed*60:.0f}/min")
+    log.info(f"\nMapping distribution:")
+    for mapping, count in sorted(stats.items(), key=lambda x: -x[1])[:30]:
+        log.info(f"  {mapping:<55s} {count:>8,}")
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/migrate_skill_level.py
+++ b/scripts/migrate_skill_level.py
@ -0,0 +1,469 @@
+#!/usr/bin/env python3
+"""
+migrate_skill_level.py — Replaces skill_level with knowledge_type + complexity
+on all vectors in Qdrant and on-disk concept JSONs.
+
+Scrolls entire collection, classifies each concept via Gemini Flash,
+writes knowledge_type + complexity, deletes skill_level.
+
+Crash-safe: completed point IDs tracked in checkpoint file.
+
+Usage:
+  python3 /opt/recon/scripts/migrate_skill_level.py [--dry-run] [--workers 16] [--limit N]
+"""
+
+import json
+import time
+import random
+import logging
+import argparse
+import threading
+from pathlib import Path
+from concurrent.futures import ThreadPoolExecutor, as_completed
+from collections import defaultdict
+
+import google.generativeai as genai
+from qdrant_client import QdrantClient
+from qdrant_client.models import FieldCondition, MatchValue, Filter
+
+import sys
+sys.path.insert(0, '/opt/recon')
+from lib.utils import get_config, setup_logging
+
+# Suppress noisy HTTP request logging from qdrant_client/httpx
+import logging as _logging
+_logging.getLogger("httpx").setLevel(_logging.WARNING)
+_logging.getLogger("qdrant_client").setLevel(_logging.WARNING)
+
+LOG_FILE = Path("/opt/recon/logs/migrate_skill_level.log")
+CHECKPOINT_FILE = Path("/opt/recon/data/migrate_skill_level_checkpoint.json")
+CONCEPTS_DIR = Path("/opt/recon/data/concepts")
+
+logging.basicConfig(
+    level=logging.INFO,
+    format="%(asctime)s %(levelname)s %(message)s",
+    handlers=[logging.FileHandler(LOG_FILE), logging.StreamHandler()]
+)
+log = logging.getLogger("migrate_skill_level")
+
+# ── Prompt ──────────────────────────────────────────────────────────────────
+
+CLASSIFY_PROMPT = """\
+You are a knowledge classification engine. Given a concept, assign two fields:
+
+knowledge_type — what KIND of knowledge this is:
+  foundational — concepts, definitions, theory, background knowledge, explanations of how things work
+  procedural — step-by-step techniques, instructions, how-to skills, methods you execute
+  operational — application under real conditions, decision-making, mission execution, judgment calls in context
+
+complexity — how much prior knowledge is needed:
+  basic — requires little or no prior knowledge, introductory material, simple concepts
+  intermediate — requires some domain familiarity, assumes foundational knowledge is in place
+  advanced — requires significant experience or expertise, high-stakes or highly technical material
+
+EXAMPLES:
+- "Needle chest decompression procedure" → procedural, advanced
+- "What is soil texture and why does it matter" → foundational, basic
+- "Coordinating a fire team withdrawal under contact" → operational, advanced
+- "How to start a campfire with a ferro rod" → procedural, basic
+- "Antenna gain and radiation patterns explained" → foundational, intermediate
+- "Triage decision-making in a mass casualty event" → operational, advanced
+- "Step-by-step: building a Dakota fire hole" → procedural, intermediate
+- "Understanding the water cycle" → foundational, basic
+
+Concept title: {title}
+Concept domain: {domain}
+Concept subdomain: {subdomain}
+Concept content: {content}
+
+Return ONLY valid JSON, no markdown, no explanation:
+{{"knowledge_type": "foundational|procedural|operational", "complexity": "basic|intermediate|advanced"}}
+"""
+
+VALID_KNOWLEDGE_TYPES = {"foundational", "procedural", "operational"}
+VALID_COMPLEXITIES = {"basic", "intermediate", "advanced"}
+
+# ── Key management ──────────────────────────────────────────────────────────
+
+def load_gemini_keys():
+    keys = []
+    for line in Path("/opt/recon/.env").read_text().splitlines():
+        if line.startswith("GEMINI_KEY_"):
+            keys.append(line.split("=", 1)[1].strip())
+    return keys
+
+
+class KeyRotator:
+    def __init__(self, keys):
+        self.keys = keys
+        self._i = 0
+        self._lock = threading.Lock()
+
+    def next(self):
+        with self._lock:
+            key = self.keys[self._i % len(self.keys)]
+            self._i += 1
+            return key
+
+# ── Classification ──────────────────────────────────────────────────────────
+
+def classify(title, domains, subdomains, content, key):
+    """Call Gemini Flash to classify knowledge_type + complexity."""
+    prompt = CLASSIFY_PROMPT.format(
+        title=title or "(untitled)",
+        domain=", ".join(domains[:5]) if domains else "(none)",
+        subdomain=", ".join(subdomains[:10]) if subdomains else "(none)",
+        content=str(content)[:400] if content else "(none)",
+    )
+    genai.configure(api_key=key)
+    model = genai.GenerativeModel(
+        "gemini-2.0-flash",
+        generation_config={"response_mime_type": "application/json"}
+    )
+    for retry in range(4):
+        try:
+            resp = model.generate_content(prompt)
+            data = json.loads(resp.text)
+            kt = data.get("knowledge_type", "").lower().strip()
+            cx = data.get("complexity", "").lower().strip()
+            if kt in VALID_KNOWLEDGE_TYPES and cx in VALID_COMPLEXITIES:
+                return kt, cx
+            # Invalid values — retry once
+            if retry == 0:
+                continue
+        except Exception as e:
+            err = str(e).lower()
+            if any(s in err for s in ["429", "quota", "rate", "503", "unavailable"]):
+                time.sleep(min(5 * (2 ** retry) + random.uniform(0, 3), 60))
+            else:
+                break
+
+    # Fallback heuristic based on old skill_level + content analysis
+    return heuristic_fallback(title, subdomains, content)
+
+
+def heuristic_fallback(title, subdomains, content):
+    """Last-resort heuristic when Gemini fails."""
+    text = f"{title} {' '.join(subdomains)} {str(content)[:200]}".lower()
+
+    # Knowledge type heuristic
+    procedural_signals = ["how to", "step-by-step", "procedure", "instructions",
+                          "method", "technique", "build", "make", "construct",
+                          "install", "assemble", "recipe", "prepare"]
+    operational_signals = ["decision", "coordinate", "execute", "deploy",
+                           "mission", "triage", "under fire", "in the field",
+                           "real-world", "scenario", "assessment", "plan"]
+
+    if any(s in text for s in operational_signals):
+        kt = "operational"
+    elif any(s in text for s in procedural_signals):
+        kt = "procedural"
+    else:
+        kt = "foundational"
+
+    # Complexity heuristic — default intermediate (safest middle ground)
+    cx = "intermediate"
+    basic_signals = ["introduction", "what is", "basic", "beginner", "overview",
+                     "definition", "simple", "fundamentals"]
+    advanced_signals = ["advanced", "expert", "complex", "critical", "high-stakes",
+                        "surgery", "trauma", "tactical", "classified"]
+    if any(s in text for s in basic_signals):
+        cx = "basic"
+    elif any(s in text for s in advanced_signals):
+        cx = "advanced"
+
+    return kt, cx
+
+# ── Checkpoint management ───────────────────────────────────────────────────
+
+class Checkpoint:
+    """Thread-safe checkpoint tracker for crash recovery."""
+    def __init__(self, path):
+        self.path = path
+        self._lock = threading.Lock()
+        self._completed = set()
+        self._dirty = 0
+        self._load()
+
+    def _load(self):
+        if self.path.exists():
+            try:
+                data = json.loads(self.path.read_text())
+                self._completed = set(data.get("completed", []))
+                log.info(f"Loaded checkpoint: {len(self._completed):,} completed points")
+            except Exception:
+                self._completed = set()
+
+    def is_done(self, point_id):
+        return point_id in self._completed
+
+    def mark_done(self, point_id):
+        with self._lock:
+            self._completed.add(point_id)
+            self._dirty += 1
+            if self._dirty >= 1000:
+                self._flush()
+
+    def _flush(self):
+        tmp = self.path.with_suffix('.tmp')
+        tmp.write_text(json.dumps({"completed": list(self._completed)}))
+        tmp.rename(self.path)
+        self._dirty = 0
+
+    def flush(self):
+        with self._lock:
+            self._flush()
+
+    def count(self):
+        return len(self._completed)
+
+# ── Concept JSON update ────────────────────────────────────────────────────
+
+def update_concept_json(doc_hash, title, knowledge_type, complexity):
+    """Update on-disk concept JSON: add knowledge_type + complexity, remove skill_level."""
+    doc_dir = CONCEPTS_DIR / doc_hash
+    if not doc_dir.exists():
+        return False
+    for wf in doc_dir.glob("window_*.json"):
+        try:
+            with open(wf, "r", encoding="utf-8") as f:
+                concepts = json.load(f)
+            changed = False
+            for c in concepts:
+                if not isinstance(c, dict):
+                    continue
+                if c.get("title") == title:
+                    c["knowledge_type"] = knowledge_type
+                    c["complexity"] = complexity
+                    c.pop("skill_level", None)
+                    changed = True
+            if changed:
+                with open(wf, "w", encoding="utf-8") as f:
+                    json.dump(concepts, f, indent=2, ensure_ascii=False)
+                return True
+        except Exception:
+            pass
+    return False
+
+# ── Per-point processing ───────────────────────────────────────────────────
+
+def process_point(point, qdrant, collection, key_rotator, checkpoint, dry_run):
+    point_id = point.id
+    if checkpoint.is_done(point_id):
+        return "skipped"
+
+    payload = point.payload
+    title = payload.get("title", "")
+    domains = payload.get("domain", [])
+    if isinstance(domains, str):
+        domains = [domains]
+    subdomains = payload.get("subdomain", [])
+    if isinstance(subdomains, str):
+        subdomains = [subdomains]
+    content = payload.get("content", payload.get("summary", ""))
+    doc_hash = payload.get("doc_hash", "")
+
+    key = key_rotator.next()
+    knowledge_type, complexity = classify(title, domains, subdomains, content, key)
+
+    if dry_run:
+        return f"kt={knowledge_type}, cx={complexity}"
+
+    # Write new fields
+    qdrant.set_payload(
+        collection_name=collection,
+        payload={"knowledge_type": knowledge_type, "complexity": complexity},
+        points=[point_id],
+    )
+
+    # Delete old field
+    qdrant.delete_payload(
+        collection_name=collection,
+        keys=["skill_level"],
+        points=[point_id],
+    )
+
+    # Update JSON on disk
+    if doc_hash:
+        update_concept_json(doc_hash, title, knowledge_type, complexity)
+
+    checkpoint.mark_done(point_id)
+    return "ok"
+
+# ── Streaming batch processor ───────────────────────────────────────────────
+
+SCROLL_BATCH = 5000  # vectors per scroll batch — keeps memory bounded (~50MB)
+
+
+def count_collection(qdrant, collection):
+    """Quick count of total vectors via collection info."""
+    info = qdrant.get_collection(collection)
+    return info.points_count
+
+
+def stream_and_process(qdrant, collection, rotator, checkpoint, workers, limit=None):
+    """Scroll in batches, process each batch with thread pool, then discard.
+
+    Memory-bounded: only holds SCROLL_BATCH payloads at any time (~50MB).
+    """
+    results_agg = defaultdict(int)
+    lock = threading.Lock()
+    done = 0
+    skipped_checkpoint = 0
+    skipped_no_skill = 0
+    total_estimate = count_collection(qdrant, collection)
+    start = time.time()
+
+    offset = None
+    batch_num = 0
+
+    while True:
+        batch_num += 1
+        scroll_results, offset = qdrant.scroll(
+            collection_name=collection,
+            limit=SCROLL_BATCH,
+            with_payload=True,
+            with_vectors=False,
+            offset=offset,
+        )
+
+        # Filter to points needing migration
+        pending = []
+        for p in scroll_results:
+            if "skill_level" not in p.payload:
+                skipped_no_skill += 1
+                continue
+            if checkpoint.is_done(p.id):
+                skipped_checkpoint += 1
+                continue
+            pending.append(p)
+
+        if pending:
+            with ThreadPoolExecutor(max_workers=workers) as ex:
+                futures = {
+                    ex.submit(process_point, p, qdrant, collection, rotator, checkpoint, False): p
+                    for p in pending
+                }
+                for future in as_completed(futures):
+                    try:
+                        status = future.result()
+                    except Exception as e:
+                        status = f"error: {str(e)[:80]}"
+                        log.error(f"Worker error: {e}")
+                    with lock:
+                        results_agg[status] += 1
+                        done += 1
+                        if done % 5000 == 0:
+                            elapsed = time.time() - start
+                            rate = done / elapsed * 60
+                            remaining = total_estimate - done - skipped_checkpoint - skipped_no_skill
+                            eta = remaining / (done / elapsed) / 60 if done > 0 else 0
+                            log.info(f"  {done:,} done | {rate:.0f}/min | "
+                                     f"ETA ~{eta:.0f}min | {dict(results_agg)}")
+                            checkpoint.flush()
+                    time.sleep(0.02)
+
+        if limit and done >= limit:
+            break
+        if offset is None:
+            break
+
+    checkpoint.flush()
+    return done, skipped_checkpoint, skipped_no_skill, results_agg, start
+
+
+# ── Main ────────────────────────────────────────────────────────────────────
+
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--dry-run", action="store_true",
+                        help="Classify 20 samples without writing anything")
+    parser.add_argument("--workers", type=int, default=16)
+    parser.add_argument("--limit", type=int, default=None)
+    args = parser.parse_args()
+
+    config = get_config()
+    keys = load_gemini_keys()
+    rotator = KeyRotator(keys)
+
+    qdrant = QdrantClient(
+        host=config['vector_db']['host'],
+        port=config['vector_db']['port'],
+        timeout=120
+    )
+    collection = config['vector_db']['collection']
+    checkpoint = Checkpoint(CHECKPOINT_FILE)
+
+    total_vectors = count_collection(qdrant, collection)
+    pre_checkpoint = checkpoint.count()
+
+    log.info(f"Collection has {total_vectors:,} vectors")
+    log.info(f"Checkpoint: {pre_checkpoint:,} already completed")
+    log.info(f"Workers: {args.workers} | Keys: {len(keys)} | Dry run: {args.dry_run}")
+    log.info(f"Estimated Gemini Flash cost: ~${(total_vectors - pre_checkpoint) * 0.0004:.2f}")
+    log.info(f"Streaming in batches of {SCROLL_BATCH:,} (memory-bounded)")
+
+    if args.dry_run:
+        # Scroll one batch, classify 20 diverse samples
+        log.info(f"\nDRY RUN: classifying 20 samples...\n")
+        scroll_results, _ = qdrant.scroll(
+            collection_name=collection,
+            limit=200,
+            with_payload=True,
+            with_vectors=False,
+        )
+        samples = []
+        seen_domains = set()
+        for p in scroll_results:
+            if "skill_level" not in p.payload:
+                continue
+            domains = p.payload.get("domain", [])
+            if isinstance(domains, str):
+                domains = [domains]
+            d_key = tuple(sorted(domains[:2]))
+            if d_key not in seen_domains:
+                samples.append(p)
+                seen_domains.add(d_key)
+            if len(samples) >= 20:
+                break
+
+        for i, p in enumerate(samples, 1):
+            pay = p.payload
+            title = pay.get("title", "(no title)")
+            domains = pay.get("domain", [])
+            old_skill = pay.get("skill_level", "?")
+            subdomains = pay.get("subdomain", [])
+            if isinstance(subdomains, str):
+                subdomains = [subdomains]
+            content = pay.get("content", pay.get("summary", ""))
+
+            key = rotator.next()
+            kt, cx = classify(title, domains, subdomains, content, key)
+
+            print(f"\n--- Sample {i}/{len(samples)} ---")
+            print(f"  Title:          {title}")
+            print(f"  Domain:         {domains}")
+            print(f"  Old skill:      {old_skill}")
+            print(f"  → knowledge_type: {kt}")
+            print(f"  → complexity:     {cx}")
+        est = total_vectors - pre_checkpoint
+        print(f"\nDRY RUN complete. ~{est:,} vectors would be migrated.")
+        print(f"Estimated Gemini Flash cost: ~${est * 0.0004:.2f}")
+        return
+
+    # ── Full migration run (streaming) ──────────────────────────────────────
+    done, skipped_ckpt, skipped_no_skill, results, start = stream_and_process(
+        qdrant, collection, rotator, checkpoint, args.workers, args.limit
+    )
+
+    elapsed = time.time() - start
+    log.info(f"\nComplete in {elapsed/60:.1f}min:")
+    log.info(f"  Processed:           {done:,}")
+    log.info(f"  Skipped (checkpoint): {skipped_ckpt:,}")
+    log.info(f"  Skipped (no skill):   {skipped_no_skill:,}")
+    for status, count in sorted(results.items(), key=lambda x: -x[1]):
+        log.info(f"  {status:<30} {count:>10,}")
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/rebuild_qdrant.py
+++ b/scripts/rebuild_qdrant.py
@ -0,0 +1,227 @@
+"""
+RECON Qdrant Rebuilder — patched for headless parallel execution
+
+Deletes and recreates the Qdrant collection, then re-embeds ALL concept JSONs
+from disk using parallel workers. Pass --confirm to skip interactive prompt.
+
+Usage:
+  python3 scripts/rebuild_qdrant.py --confirm [--workers 8]
+"""
+
+import json
+import os
+import sys
+import time
+import argparse
+import threading
+from concurrent.futures import ThreadPoolExecutor, as_completed
+from collections import defaultdict
+
+sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+
+import requests as http_requests
+from qdrant_client import QdrantClient
+from qdrant_client.models import VectorParams, Distance, PointStruct
+
+from lib.utils import get_config, concept_id, setup_logging
+from lib.status import StatusDB
+
+logger = setup_logging('recon.rebuild')
+
+
+def embed_content(config, content):
+    try:
+        tei_url = f"http://{config['embedding']['tei_host']}:{config['embedding']['tei_port']}/embed"
+        resp = http_requests.post(tei_url, json={"inputs": content}, timeout=120)
+        resp.raise_for_status()
+        return resp.json()[0]
+    except Exception as tei_err:
+        logger.debug(f"TEI failed, trying Ollama: {tei_err}")
+
+    ollama_url = f"http://{config['embedding']['ollama_host']}:{config['embedding']['ollama_port']}/api/embed"
+    resp = http_requests.post(ollama_url, json={
+        "model": config['embedding']['model'],
+        "input": content
+    }, timeout=120)
+    resp.raise_for_status()
+    return resp.json()['embeddings'][0]
+
+
+def process_doc(doc_hash, config, db, qdrant, collection):
+    """Embed and upsert all concepts for a single document. Returns (inserted, failed)."""
+    doc_dir = os.path.join(config['paths']['concepts'], doc_hash)
+    doc = db.get_document(doc_hash)
+    filename = doc['filename'] if doc else doc_hash[:8]
+
+    window_files = sorted([
+        f for f in os.listdir(doc_dir)
+        if f.startswith('window_') and f.endswith('.json')
+    ])
+
+    all_concepts = []
+    for wf in window_files:
+        path = os.path.join(doc_dir, wf)
+        try:
+            with open(path, encoding='utf-8') as f:
+                concepts = json.load(f)
+            if isinstance(concepts, list):
+                all_concepts.extend(concepts)
+        except Exception as e:
+            logger.warning(f"Skipping corrupted window {wf} in {doc_hash}: {e}")
+
+    if not all_concepts:
+        return 0, 0
+
+    is_web = doc.get('path', '').startswith(('http://', 'https://')) if doc else False
+
+    # Check meta.json for explicit source_type (e.g. 'transcript')
+    source_type = 'web' if is_web else 'document'
+    text_dir = os.path.join(config['paths']['text'], doc_hash)
+    meta_path = os.path.join(text_dir, 'meta.json')
+    if os.path.exists(meta_path):
+        try:
+            with open(meta_path) as mf:
+                meta = json.load(mf)
+            if meta.get('source_type'):
+                source_type = meta['source_type']
+        except Exception:
+            pass
+
+    points = []
+    failed = 0
+    batch_size = config['processing']['embed_batch_size']
+
+    for idx, concept in enumerate(all_concepts):
+        content = concept.get('content', '')
+        if not content or len(content.strip()) < 10:
+            continue
+        try:
+            vector = embed_content(config, content)
+        except Exception as e:
+            logger.warning(f"Embedding failed {doc_hash}:{idx}: {e}")
+            failed += 1
+            continue
+
+        start_page = concept.get('_start_page', 0)
+        point_id = concept_id(doc_hash, start_page, idx)
+
+        payload = {
+            'doc_hash': doc_hash,
+            'filename': filename,
+            'book_title': doc.get('book_title', '') if doc else '',
+            'book_author': doc.get('book_author', '') if doc else '',
+            'source_type': source_type,
+            'verification_status': 'unverified',
+            'credibility_score': 0.7,
+            'language': 'en',
+        }
+        for field in ['content', 'summary', 'title', 'domain', 'subdomain',
+                      'keywords', 'skill_level', 'key_facts', 'scenario_applicable',
+                      'cross_domain_tags', 'chapter', 'page_ref', 'notes',
+                      '_window', '_start_page']:
+            if field in concept:
+                payload[field] = concept[field]
+
+        points.append(PointStruct(id=point_id, vector=vector, payload=payload))
+
+        if len(points) >= batch_size:
+            qdrant.upsert(collection_name=collection, points=points)
+            points = []
+
+    if points:
+        qdrant.upsert(collection_name=collection, points=points)
+
+    inserted = len(all_concepts) - failed
+    if doc:
+        db.update_status(doc_hash, 'complete', vectors_inserted=inserted)
+
+    return inserted, failed
+
+
+def run_rebuild(workers=8):
+    config = get_config()
+    db = StatusDB()
+
+    qdrant = QdrantClient(
+        host=config['vector_db']['host'],
+        port=config['vector_db']['port'],
+        timeout=60
+    )
+    collection = config['vector_db']['collection']
+
+    # Delete and recreate
+    try:
+        qdrant.delete_collection(collection)
+        logger.info(f"Deleted collection: {collection}")
+    except Exception:
+        pass
+
+    qdrant.create_collection(
+        collection_name=collection,
+        vectors_config=VectorParams(
+            size=config['embedding']['dimensions'],
+            distance=Distance.COSINE
+        )
+    )
+    logger.info(f"Created collection: {collection} ({config['embedding']['dimensions']}d, Cosine)")
+
+    concepts_root = config['paths']['concepts']
+    doc_dirs = sorted([
+        d for d in os.listdir(concepts_root)
+        if os.path.isdir(os.path.join(concepts_root, d))
+    ])
+    logger.info(f"Found {len(doc_dirs)} document concept directories | {workers} workers")
+
+    total_inserted = 0
+    total_failed = 0
+    done = 0
+    lock = threading.Lock()
+    start = time.time()
+
+    with ThreadPoolExecutor(max_workers=workers) as ex:
+        futures = {
+            ex.submit(process_doc, h, config, StatusDB(), qdrant, collection): h
+            for h in doc_dirs
+        }
+        for future in as_completed(futures):
+            doc_hash = futures[future]
+            try:
+                inserted, failed = future.result()
+            except Exception as e:
+                logger.error(f"Worker error {doc_hash}: {e}")
+                inserted, failed = 0, 0
+
+            with lock:
+                total_inserted += inserted
+                total_failed += failed
+                done += 1
+                if done % 500 == 0:
+                    elapsed = time.time() - start
+                    rate = total_inserted / elapsed if elapsed > 0 else 0
+                    remaining = (len(doc_dirs) - done) / (done / elapsed) if elapsed > 0 else 0
+                    logger.info(
+                        f"  [{done}/{len(doc_dirs)}] "
+                        f"{total_inserted:,} vectors | "
+                        f"{rate:.0f}/sec | "
+                        f"ETA {remaining/60:.0f}min"
+                    )
+
+    elapsed = time.time() - start
+    logger.info(f"\nRebuild complete in {elapsed/60:.1f} min: "
+                f"{total_inserted:,} inserted, {total_failed:,} failed")
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--confirm', action='store_true', help='Skip interactive prompt')
+    parser.add_argument('--workers', type=int, default=8)
+    args = parser.parse_args()
+
+    if not args.confirm:
+        print("WARNING: This will DELETE and RECREATE the Qdrant collection.")
+        confirm = input("Type 'REBUILD' to proceed: ")
+        if confirm != 'REBUILD':
+            print("Aborted.")
+            sys.exit(0)
+
+    run_rebuild(workers=args.workers)
--- a/scripts/reenrich_reference.py
+++ b/scripts/reenrich_reference.py
@ -0,0 +1,314 @@
+#!/usr/bin/env python3
+"""
+reenrich_reference.py — Re-classifies all remaining Reference-tagged concepts.
+
+Scrolls Qdrant for vectors with domain == ["Reference"] or containing "Reference",
+calls Gemini with a hardened prompt that rejects Reference as a valid response,
+updates both Qdrant payload and concept JSON on disk.
+
+Usage:
+  python3 /opt/recon/scripts/reenrich_reference.py [--dry-run] [--workers 16] [--limit N]
+"""
+
+import json
+import time
+import random
+import logging
+import argparse
+import threading
+from pathlib import Path
+from concurrent.futures import ThreadPoolExecutor, as_completed
+from collections import defaultdict
+
+import google.generativeai as genai
+from qdrant_client import QdrantClient
+from qdrant_client.models import FieldCondition, MatchAny, Filter
+
+import sys
+sys.path.insert(0, '/opt/recon')
+from lib.utils import get_config, setup_logging
+
+LOG_FILE = Path("/opt/recon/logs/reenrich_reference.log")
+logging.basicConfig(
+    level=logging.INFO,
+    format="%(asctime)s %(levelname)s %(message)s",
+    handlers=[logging.FileHandler(LOG_FILE), logging.StreamHandler()]
+)
+log = logging.getLogger("reenrich_reference")
+
+CONCEPTS_DIR = Path("/opt/recon/data/concepts")
+
+CANONICAL_DOMAINS = {
+    "Defense & Tactics", "Sustainment Systems", "Off-Grid Systems",
+    "Foundational Skills", "Communications", "Medical", "Food Systems",
+    "Navigation", "Logistics", "Power Systems", "Leadership",
+    "Scenario Playbooks", "Water Systems", "Security", "Community Coordination"
+}
+
+# Hardened prompt — Reference explicitly forbidden, classification rules detailed
+CLASSIFY_PROMPT = """\
+You are a knowledge classification engine. Classify this concept into its correct domain.
+
+VALID DOMAINS — use ONLY these exact strings:
+  Defense & Tactics
+  Sustainment Systems
+  Off-Grid Systems
+  Foundational Skills
+  Communications
+  Medical
+  Food Systems
+  Navigation
+  Logistics
+  Power Systems
+  Leadership
+  Scenario Playbooks
+  Water Systems
+  Security
+  Community Coordination
+
+FORBIDDEN: Do NOT output "Reference" under any circumstances. It is not a valid domain.
+FORBIDDEN: Do NOT output an empty domain list.
+
+CLASSIFICATION RULES:
+- First aid, anatomy, pharmacology, herbs, veterinary, austere medicine, wound care → Medical
+- Food growing, foraging, hunting, fishing, animal husbandry, livestock → Sustainment Systems
+- Food preservation, canning, fermentation, food storage, dehydrating → Food Systems
+- Solar, wind, hydro, batteries, generators, inverters, charge controllers → Power Systems
+- Water sourcing, filtration, purification, sanitation, wells, rainwater → Water Systems
+- Radio, antennas, mesh networking, SIGINT, amateur radio → Communications
+- Weapons, tactics, NBC, security operations, field craft → Defense & Tactics
+- Permaculture, soil science, agroforestry, composting → Sustainment Systems
+- Shelter, construction, masonry, blacksmithing, woodworking, crafts → Foundational Skills
+- Navigation, land nav, celestial nav, map reading, compass → Navigation
+- Emergency planning, disaster prep, scenario planning → Scenario Playbooks
+- Leadership, governance, community organization → Leadership
+- Supply chain, transportation, inventory → Logistics
+- Physical security, perimeter, surveillance → Security
+- Community building, cooperation, mutual aid → Community Coordination
+- Biogas, wood gasification, rocket stoves, appropriate technology → Off-Grid Systems
+
+If uncertain between two domains, pick the most actionable one for a self-reliant household.
+
+Concept title: {title}
+Concept subdomain tags: {subdomain}
+Concept content: {content}
+
+Return ONLY valid JSON, no markdown, no explanation:
+{{"domain": ["Domain Name"]}}
+"""
+
+def load_gemini_keys():
+    keys = []
+    for line in Path("/opt/recon/.env").read_text().splitlines():
+        if line.startswith("GEMINI_KEY_"):
+            keys.append(line.split("=", 1)[1].strip())
+    return keys
+
+class KeyRotator:
+    def __init__(self, keys):
+        self.keys = keys
+        self._i = 0
+        self._lock = threading.Lock()
+    def next(self):
+        with self._lock:
+            key = self.keys[self._i % len(self.keys)]
+            self._i += 1
+            return key
+
+def classify(title, subdomains, content, key, attempt=0):
+    """Call Gemini. Rejects Reference. Falls back to subdomain heuristic if needed."""
+    prompt = CLASSIFY_PROMPT.format(
+        title=title or "(untitled)",
+        subdomain=", ".join(subdomains[:10]) if subdomains else "(none)",
+        content=str(content)[:400] if content else "(none)",
+    )
+    genai.configure(api_key=key)
+    model = genai.GenerativeModel(
+        "gemini-2.0-flash",
+        generation_config={"response_mime_type": "application/json"}
+    )
+    for retry in range(4):
+        try:
+            resp = model.generate_content(prompt)
+            data = json.loads(resp.text)
+            domains = [
+                d for d in data.get("domain", [])
+                if d in CANONICAL_DOMAINS  # strips Reference automatically
+            ]
+            if domains:
+                return domains
+            # Gemini returned Reference or empty — try once more with stronger wording
+            if retry == 0:
+                continue
+        except Exception as e:
+            err = str(e).lower()
+            if any(s in err for s in ["429", "quota", "rate", "503", "unavailable"]):
+                time.sleep(min(5 * (2 ** retry) + random.uniform(0, 3), 60))
+            else:
+                break
+
+    # Last resort: subdomain keyword heuristic
+    return subdomain_fallback(subdomains)
+
+SUBDOMAIN_FALLBACK_MAP = [
+    (["first aid", "trauma", "wound", "anatomy", "pharmacol", "herbal", "medicin", "veterinar", "dental", "surgery"], "Medical"),
+    (["foraging", "hunting", "fishing", "livestock", "permaculture", "soil", "agroforestry", "mycolog", "mushroom"], "Sustainment Systems"),
+    (["canning", "preservation", "fermentation", "food storage", "dehydrat"], "Food Systems"),
+    (["solar", "battery", "generator", "inverter", "wind turbine", "photovoltaic"], "Power Systems"),
+    (["water purif", "filtration", "sanitation", "well", "rainwater"], "Water Systems"),
+    (["radio", "antenna", "mesh", "sigint", "amateur radio", "meshtastic"], "Communications"),
+    (["weapon", "firearm", "tactic", "nbc", "chemical warfare", "ballistic"], "Defense & Tactics"),
+    (["navigation", "compass", "land nav", "celestial"], "Navigation"),
+    (["blacksmith", "woodwork", "masonry", "construct", "craft", "pottery"], "Foundational Skills"),
+    (["biogas", "gasif", "rocket stove", "appropriate tech"], "Off-Grid Systems"),
+    (["disaster", "emergency prep", "evacuation", "scenario"], "Scenario Playbooks"),
+    (["leadership", "governance", "community"], "Leadership"),
+    (["logistics", "supply chain", "transport"], "Logistics"),
+    (["security", "perimeter", "surveillance"], "Security"),
+]
+
+def subdomain_fallback(subdomains):
+    combined = " ".join(s.lower() for s in subdomains)
+    for keywords, domain in SUBDOMAIN_FALLBACK_MAP:
+        if any(kw in combined for kw in keywords):
+            return [domain]
+    return ["Foundational Skills"]  # absolute last resort
+
+def update_concept_json(doc_hash, title, new_domains):
+    """Update domain in concept JSON files on disk."""
+    doc_dir = CONCEPTS_DIR / doc_hash
+    if not doc_dir.exists():
+        return False
+    for wf in doc_dir.glob("window_*.json"):
+        try:
+            with open(wf, "r", encoding="utf-8") as f:
+                concepts = json.load(f)
+            changed = False
+            for c in concepts:
+                if not isinstance(c, dict):
+                    continue
+                if c.get("title") == title:
+                    raw = c.get("domain", [])
+                    if isinstance(raw, str):
+                        raw = [raw]
+                    if "Reference" in raw or not [d for d in raw if d in CANONICAL_DOMAINS]:
+                        c["domain"] = new_domains
+                        changed = True
+            if changed:
+                with open(wf, "w", encoding="utf-8") as f:
+                    json.dump(concepts, f, indent=2, ensure_ascii=False)
+                return True
+        except Exception:
+            pass
+    return False
+
+def process_point(point, qdrant, collection, key_rotator, dry_run):
+    payload = point.payload
+    title = payload.get("title", "")
+    subdomains = payload.get("subdomain", [])
+    if isinstance(subdomains, str):
+        subdomains = [subdomains]
+    content = payload.get("content", payload.get("summary", ""))
+    doc_hash = payload.get("doc_hash", "")
+
+    key = key_rotator.next()
+    new_domains = classify(title, subdomains, content, key)
+
+    if dry_run:
+        return "would_classify"
+
+    # Update Qdrant payload
+    qdrant.set_payload(
+        collection_name=collection,
+        payload={"domain": new_domains},
+        points=[point.id],
+    )
+
+    # Update JSON on disk
+    if doc_hash:
+        update_concept_json(doc_hash, title, new_domains)
+
+    return "ok"
+
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--dry-run", action="store_true")
+    parser.add_argument("--workers", type=int, default=16)
+    parser.add_argument("--limit", type=int, default=None)
+    args = parser.parse_args()
+
+    config = get_config()
+    keys = load_gemini_keys()
+    rotator = KeyRotator(keys)
+
+    qdrant = QdrantClient(
+        host=config['vector_db']['host'],
+        port=config['vector_db']['port'],
+        timeout=60
+    )
+    collection = config['vector_db']['collection']
+
+    log.info("Scrolling Qdrant for Reference-tagged concepts...")
+
+    # Scroll all points containing Reference in domain
+    offset = None
+    reference_points = []
+    while True:
+        results, offset = qdrant.scroll(
+            collection_name=collection,
+            scroll_filter=Filter(
+                must=[FieldCondition(
+                    key="domain",
+                    match=MatchAny(any=["Reference"])
+                )]
+            ),
+            limit=1000,
+            with_payload=True,
+            with_vectors=False,
+            offset=offset,
+        )
+        reference_points.extend(results)
+        if offset is None:
+            break
+        if args.limit and len(reference_points) >= args.limit:
+            reference_points = reference_points[:args.limit]
+            break
+
+    total = len(reference_points)
+    log.info(f"Found {total:,} Reference-tagged vectors")
+    log.info(f"Workers: {args.workers} | Keys: {len(keys)} | Dry run: {args.dry_run}")
+    log.info(f"Estimated Gemini Flash cost: ~${total * 0.0004:.2f}")
+
+    if args.dry_run:
+        log.info(f"DRY RUN: would re-classify {total:,} concepts. Exiting.")
+        return
+
+    results = defaultdict(int)
+    lock = threading.Lock()
+    done = 0
+    start = time.time()
+
+    with ThreadPoolExecutor(max_workers=args.workers) as ex:
+        futures = {
+            ex.submit(process_point, p, qdrant, collection, rotator, False): p
+            for p in reference_points
+        }
+        for future in as_completed(futures):
+            status = future.result()
+            with lock:
+                results[status] += 1
+                done += 1
+                if done % 5000 == 0:
+                    elapsed = time.time() - start
+                    rate = done / elapsed * 60
+                    eta = (total - done) / (done / elapsed) / 60
+                    log.info(f"  {done:,}/{total:,} | {rate:.0f}/min | ETA {eta:.0f}min | {dict(results)}")
+            time.sleep(0.02)
+
+    elapsed = time.time() - start
+    log.info(f"\nComplete in {elapsed/60:.1f}min:")
+    for status, count in sorted(results.items(), key=lambda x: -x[1]):
+        log.info(f"  {status:<20} {count:>10,}")
+
+if __name__ == "__main__":
+    main()
--- a/scripts/repair_corrupted.py
+++ b/scripts/repair_corrupted.py
@ -0,0 +1,315 @@
+#!/usr/bin/env python3
+"""
+repair_corrupted.py — Repairs window files corrupted by concurrent writes.
+
+Strategy:
+  1. Read corrupted_windows.txt to get the list of bad files
+  2. For each bad file, identify the parent doc hash from the path
+  3. Check if the text directory still exists for that doc
+  4. If yes: re-run Gemini enrichment on just that window
+  5. If no text: mark as unrecoverable
+  6. Report summary
+
+Usage:
+  python3 /opt/recon/scripts/repair_corrupted.py [--dry-run] [--workers 8]
+"""
+
+import json
+import time
+import random
+import logging
+import argparse
+import re
+import threading
+from pathlib import Path
+from concurrent.futures import ThreadPoolExecutor, as_completed
+from collections import defaultdict
+
+import google.generativeai as genai
+
+CORRUPTED_LIST = Path("/opt/recon/data/corrupted_windows.txt")
+TEXT_DIR = Path("/opt/recon/data/text")
+CONCEPTS_DIR = Path("/opt/recon/data/concepts")
+LOG_FILE = Path("/opt/recon/logs/repair_corrupted.log")
+UNRECOVERABLE_LOG = Path("/opt/recon/data/unrecoverable_windows.txt")
+
+logging.basicConfig(
+    level=logging.INFO,
+    format="%(asctime)s %(levelname)s %(message)s",
+    handlers=[
+        logging.FileHandler(LOG_FILE),
+        logging.StreamHandler(),
+    ]
+)
+log = logging.getLogger("repair_corrupted")
+
+CANONICAL_DOMAINS = [
+    "Defense & Tactics", "Sustainment Systems", "Off-Grid Systems",
+    "Foundational Skills", "Communications", "Medical", "Food Systems",
+    "Navigation", "Logistics", "Power Systems", "Leadership",
+    "Scenario Playbooks", "Water Systems", "Security", "Community Coordination"
+]
+
+ENRICH_PROMPT = """Extract knowledge concepts from this document text.
+
+A concept is a SELF-CONTAINED piece of knowledge that can stand alone.
+
+For each concept, provide ALL fields:
+
+Required:
+- content: Full text of the concept (complete procedure, definition, etc.)
+- summary: 1-2 sentence summary
+- title: Brief descriptive title
+- domain: Array of 1-5 from ONLY these exact strings (no others):
+    Defense & Tactics, Sustainment Systems, Off-Grid Systems, Foundational Skills,
+    Communications, Medical, Food Systems, Navigation, Logistics, Power Systems,
+    Leadership, Scenario Playbooks, Water Systems, Security, Community Coordination
+  CRITICAL: Do NOT use "Reference". Every concept belongs somewhere specific.
+- subdomain: Array of specific subcategories (up to 10)
+- keywords: Array of 3-30 searchable terms
+- skill_level: novice | intermediate | advanced
+- key_facts: Array of specific extractable claims, measurements, data points
+
+Optional (include when present):
+- scenario_applicable: Array from: tuesday_prepper, month_prepper, year_prepper, multi_year, eotwawki
+- cross_domain_tags: Array from: sustainment, medical, security, communications, leadership, logistics, navigation, power_systems, water_systems, food_systems, tactical_ops, community_coordination
+- chapter: Chapter name if identifiable
+- page_ref: Page reference
+
+Return JSON array. If no extractable concepts, return [].
+
+Document text:
+"""
+
+def load_gemini_keys():
+    env = Path("/opt/recon/.env")
+    keys = []
+    for line in env.read_text().splitlines():
+        if line.startswith("GEMINI_KEY_"):
+            keys.append(line.split("=", 1)[1].strip())
+    return keys
+
+class KeyRotator:
+    def __init__(self, keys):
+        self.keys = keys
+        self._i = 0
+        self._lock = threading.Lock()
+    def next(self):
+        with self._lock:
+            key = self.keys[self._i % len(self.keys)]
+            self._i += 1
+            return key
+
+def repair_json_truncated(text):
+    """Last-ditch attempt to salvage a truncated JSON array."""
+    text = re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f]', '', text)
+    text = re.sub(r',\s*([}\]])', r'\1', text)
+    try:
+        return json.loads(text)
+    except Exception:
+        pass
+    # Find last complete object
+    last_close = -1
+    depth = 0
+    in_str = False
+    esc = False
+    for i, ch in enumerate(text):
+        if esc:
+            esc = False; continue
+        if ch == '\\' and in_str:
+            esc = True; continue
+        if ch == '"' and not esc:
+            in_str = not in_str; continue
+        if in_str:
+            continue
+        if ch == '{': depth += 1
+        elif ch == '}':
+            depth -= 1
+            if depth == 0:
+                last_close = i
+    if last_close > 0:
+        trimmed = text[:last_close + 1].rstrip().rstrip(',')
+        open_brackets = trimmed.count('[') - trimmed.count(']')
+        try:
+            return json.loads(trimmed + ']' * open_brackets)
+        except Exception:
+            pass
+    return None
+
+def enrich_window_text(text, key):
+    """Call Gemini on raw window text, return concepts list."""
+    genai.configure(api_key=key)
+    model = genai.GenerativeModel(
+        "gemini-2.0-flash",
+        generation_config={"response_mime_type": "application/json"}
+    )
+    for attempt in range(4):
+        try:
+            resp = model.generate_content(ENRICH_PROMPT + text)
+            raw = resp.text
+            try:
+                result = json.loads(raw)
+            except Exception:
+                result = repair_json_truncated(raw)
+            if isinstance(result, list):
+                return [c for c in result if isinstance(c, dict)]
+            elif isinstance(result, dict):
+                return [result]
+            return []
+        except Exception as e:
+            err = str(e).lower()
+            if any(s in err for s in ["429", "quota", "rate", "503", "unavailable"]):
+                delay = min(5 * (2 ** attempt) + random.uniform(0, 3), 60)
+                time.sleep(delay)
+            else:
+                log.warning(f"  Non-transient error: {e}")
+                break
+    return None  # failed
+
+def get_window_text(doc_hash, window_filename):
+    """Reconstruct window text from page files."""
+    # Window filename: window_NNNN.json -> window index is NNNN
+    try:
+        w_idx = int(Path(window_filename).stem.split('_')[1]) - 1
+    except (IndexError, ValueError):
+        return None
+
+    text_path = TEXT_DIR / doc_hash
+    if not text_path.exists():
+        return None
+
+    page_files = sorted([
+        f for f in text_path.iterdir()
+        if f.name.startswith('page_') and f.name.endswith('.txt')
+    ])
+    if not page_files:
+        return None
+
+    # Re-derive which pages this window covered (window_size=5 from config)
+    window_size = 5
+    start = w_idx * window_size
+    window_pages = page_files[start:start + window_size]
+    if not window_pages:
+        return None
+
+    parts = []
+    for j, pf in enumerate(window_pages):
+        try:
+            text = pf.read_text(encoding='utf-8')
+            parts.append(f"--- Page {start + j + 1} ---\n{text}")
+        except Exception:
+            pass
+    return "\n\n".join(parts) if parts else None
+
+def repair_file(corrupted_path, key_rotator, dry_run):
+    """Attempt to repair a single corrupted window file."""
+    path = Path(corrupted_path)
+
+    # Sanity check -- maybe it fixed itself somehow
+    try:
+        with open(path) as f:
+            existing = json.load(f)
+        return "already_valid"
+    except Exception:
+        pass
+
+    # Extract doc hash and window name from path structure
+    # Expected: /opt/recon/data/concepts/{hash}/window_NNNN.json
+    doc_hash = path.parent.name
+    window_filename = path.name
+
+    # Get source text for this window
+    window_text = get_window_text(doc_hash, window_filename)
+    if not window_text:
+        return "no_source_text"
+
+    if dry_run:
+        return "would_repair"
+
+    # Re-enrich from source text
+    key = key_rotator.next()
+    concepts = enrich_window_text(window_text, key)
+
+    if concepts is None:
+        return "enrichment_failed"
+
+    # Tag concepts with metadata
+    try:
+        w_idx = int(Path(window_filename).stem.split('_')[1]) - 1
+        window_size = 5
+        start_page = w_idx * window_size + 1
+    except Exception:
+        w_idx = 0
+        start_page = 0
+
+    for c in concepts:
+        c['_window'] = w_idx + 1
+        c['_start_page'] = start_page
+        c['_doc_hash'] = doc_hash
+        c['_repaired'] = True
+
+    # Write repaired file
+    try:
+        with open(path, 'w', encoding='utf-8') as f:
+            json.dump(concepts, f, indent=2, ensure_ascii=False)
+        return "repaired"
+    except Exception as e:
+        return "write_error"
+
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--dry-run", action="store_true")
+    parser.add_argument("--workers", type=int, default=8)
+    args = parser.parse_args()
+
+    if not CORRUPTED_LIST.exists():
+        log.error(f"Corrupted list not found: {CORRUPTED_LIST}")
+        log.error("Run Task 1 first to generate it.")
+        return
+
+    keys = load_gemini_keys()
+    rotator = KeyRotator(keys)
+
+    corrupted = []
+    with open(CORRUPTED_LIST) as f:
+        for line in f:
+            parts = line.strip().split('\t')
+            if parts:
+                corrupted.append(parts[0])
+
+    log.info(f"Repairing {len(corrupted):,} corrupted window files")
+    log.info(f"Dry run: {args.dry_run} | Workers: {args.workers} | Keys: {len(keys)}")
+
+    results = defaultdict(int)
+    unrecoverable = []
+    lock = threading.Lock()
+
+    with ThreadPoolExecutor(max_workers=args.workers) as ex:
+        futures = {ex.submit(repair_file, p, rotator, args.dry_run): p for p in corrupted}
+        done = 0
+        for future in as_completed(futures):
+            path = futures[future]
+            status = future.result()
+            with lock:
+                results[status] += 1
+                if status in ("no_source_text", "enrichment_failed", "write_error"):
+                    unrecoverable.append((path, status))
+                done += 1
+                if done % 100 == 0:
+                    log.info(f"  {done:,}/{len(corrupted):,} | {dict(results)}")
+            time.sleep(0.05)
+
+    log.info("── Results ─────────────────────────────────────────────────")
+    for status, count in sorted(results.items(), key=lambda x: -x[1]):
+        log.info(f"  {status:<25} {count:>8,}")
+
+    if unrecoverable:
+        with open(UNRECOVERABLE_LOG, 'w') as f:
+            for path, reason in unrecoverable:
+                f.write(f"{path}\t{reason}\n")
+        log.info(f"\n  Unrecoverable: {len(unrecoverable)} — logged to {UNRECOVERABLE_LOG}")
+    else:
+        log.info("\n  All files repaired successfully.")
+
+if __name__ == "__main__":
+    main()
--- a/scripts/validate.py
+++ b/scripts/validate.py
@ -0,0 +1,178 @@
+#!/usr/bin/env python3
+"""
+RECON Pipeline Validator
+
+Checks pipeline consistency: paths, DB state, file integrity, and service connectivity.
+Validates TEI, Ollama, and Qdrant are reachable. Deep mode checks every document on disk.
+
+Usage: python3 scripts/validate.py [--deep]
+"""
+
+import json
+import os
+import sys
+
+sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+
+from lib.utils import get_config, setup_logging
+from lib.status import StatusDB
+
+logger = setup_logging('recon.validate')
+
+
+def run_validation(deep=False):
+    config = get_config()
+    db = StatusDB()
+
+    issues = []
+    warnings = []
+
+    print("=== RECON Validation ===\n")
+
+    # Check paths
+    for name, path in config['paths'].items():
+        if name == 'db':
+            if not os.path.exists(path):
+                issues.append(f"Database not found: {path}")
+        else:
+            if not os.path.exists(path):
+                warnings.append(f"Directory missing: {name} = {path}")
+
+    # Check library
+    if not os.path.exists(config['library_root']):
+        issues.append(f"Library root not found: {config['library_root']}")
+
+    # Check Gemini keys
+    keys = config.get('gemini_keys', [])
+    if not keys:
+        warnings.append("No Gemini API keys configured in .env")
+    else:
+        print(f"  Gemini keys: {len(keys)} configured")
+
+    # DB status counts
+    counts = db.get_status_counts()
+    cat = counts.get('catalogue', {})
+    doc = counts.get('documents', {})
+
+    print(f"  Catalogue: {sum(cat.values())} entries")
+    print(f"  Documents: {sum(doc.values())} entries")
+    print(f"  Complete: {doc.get('complete', 0)}")
+    print(f"  Failed: {doc.get('failed', 0)}")
+
+    if deep:
+        print("\n--- Deep Validation ---\n")
+
+        # Check every document in pipeline has corresponding files
+        all_docs = db.get_all_documents()
+        text_dir = config['paths']['text']
+        concepts_dir = config['paths']['concepts']
+
+        for d in all_docs:
+            h = d['hash']
+            status = d['status']
+
+            if status in ('extracted', 'enriched', 'complete'):
+                doc_text_dir = os.path.join(text_dir, h)
+                if not os.path.exists(doc_text_dir):
+                    issues.append(f"[{h[:8]}] {d['filename']}: text dir missing but status={status}")
+                elif deep:
+                    pages = [f for f in os.listdir(doc_text_dir) if f.startswith('page_')]
+                    if not pages:
+                        issues.append(f"[{h[:8]}] {d['filename']}: no page files in text dir")
+
+            if status in ('enriched', 'complete'):
+                doc_concepts_dir = os.path.join(concepts_dir, h)
+                if not os.path.exists(doc_concepts_dir):
+                    issues.append(f"[{h[:8]}] {d['filename']}: concepts dir missing but status={status}")
+                elif deep:
+                    windows = [f for f in os.listdir(doc_concepts_dir) if f.startswith('window_')]
+                    if not windows:
+                        issues.append(f"[{h[:8]}] {d['filename']}: no window files in concepts dir")
+                    else:
+                        for wf in windows:
+                            try:
+                                with open(os.path.join(doc_concepts_dir, wf)) as f:
+                                    data = json.load(f)
+                                if not isinstance(data, list):
+                                    issues.append(f"[{h[:8]}] {wf}: not a JSON array")
+                            except json.JSONDecodeError:
+                                issues.append(f"[{h[:8]}] {wf}: invalid JSON")
+
+        # Check orphaned directories
+        if os.path.exists(text_dir):
+            doc_hashes = {d['hash'] for d in all_docs}
+            for dirname in os.listdir(text_dir):
+                if dirname not in doc_hashes:
+                    warnings.append(f"Orphaned text dir: {dirname}")
+
+        if os.path.exists(concepts_dir):
+            for dirname in os.listdir(concepts_dir):
+                if dirname not in doc_hashes:
+                    warnings.append(f"Orphaned concepts dir: {dirname}")
+
+        print(f"  Checked {len(all_docs)} documents")
+
+    # Connectivity checks
+    print("\n--- Connectivity ---\n")
+
+    import requests as http_requests
+
+    # Check TEI (primary embedding backend)
+    try:
+        tei_url = f"http://{config['embedding']['tei_host']}:{config['embedding']['tei_port']}/info"
+        resp = http_requests.get(tei_url, timeout=10)
+        if resp.status_code == 200:
+            print(f"  TEI: OK (bge-m3 at {config['embedding']['tei_host']}:{config['embedding']['tei_port']})")
+        else:
+            issues.append(f"TEI: HTTP {resp.status_code}")
+    except Exception as e:
+        issues.append(f"TEI: unreachable ({e})")
+
+    # Check Ollama (fallback)
+    try:
+        ollama_url = f"http://{config['embedding']['ollama_host']}:{config['embedding']['ollama_port']}/api/tags"
+        resp = http_requests.get(ollama_url, timeout=10)
+        if resp.status_code == 200:
+            print(f"  Ollama: OK (fallback at {config['embedding']['ollama_host']}:{config['embedding']['ollama_port']})")
+        else:
+            warnings.append(f"Ollama: HTTP {resp.status_code}")
+    except Exception as e:
+        warnings.append(f"Ollama: unreachable ({e}) — fallback only, not critical")
+
+    try:
+        from qdrant_client import QdrantClient
+        qdrant = QdrantClient(
+            host=config['vector_db']['host'],
+            port=config['vector_db']['port'],
+            timeout=10
+        )
+        collections = [c.name for c in qdrant.get_collections().collections]
+        target = config['vector_db']['collection']
+        if target in collections:
+            info = qdrant.get_collection(target)
+            print(f"  Qdrant: OK ({target}: {info.points_count} points)")
+        else:
+            issues.append(f"Qdrant: collection {target} not found")
+    except Exception as e:
+        issues.append(f"Qdrant: unreachable ({e})")
+
+    # Summary
+    print("\n--- Summary ---\n")
+
+    if warnings:
+        print(f"Warnings ({len(warnings)}):")
+        for w in warnings:
+            print(f"  ⚠ {w}")
+
+    if issues:
+        print(f"\nIssues ({len(issues)}):")
+        for i in issues:
+            print(f"  ✗ {i}")
+        print(f"\nValidation FAILED: {len(issues)} issue(s)")
+    else:
+        print("Validation PASSED")
+
+
+if __name__ == '__main__':
+    deep = '--deep' in sys.argv
+    run_validation(deep=deep)
--- a/static/css/recon.css
+++ b/static/css/recon.css
@ -0,0 +1,316 @@
+/* RECON Design System
+ * Knowledge Extraction Pipeline — Dashboard CSS
+ */
+
+:root {
+    --bg-primary: #0a0a0a;
+    --bg-secondary: #111;
+    --bg-tertiary: #1a1a1a;
+    --border: #222;
+    --border-light: #333;
+    --text-primary: #c0c0c0;
+    --text-muted: #888;
+    --text-dim: #666;
+    --text-faint: #555;
+    --green: #00ff41;
+    --green-dim: #16a34a;
+    --red: #ff4444;
+    --red-dim: #dc2626;
+    --orange: #ffa500;
+    --blue: #00bfff;
+    --blue-sky: #0ea5e9;
+    --blue-dark: #0284c7;
+    --purple: #7c3aed;
+    --yellow: #fbbf24;
+
+    /* Pipeline colors */
+    --pipe-queued: #555;
+    --pipe-extracting: #b45309;
+    --pipe-extracted: #d97706;
+    --pipe-enriching: #0284c7;
+    --pipe-enriched: #0ea5e9;
+    --pipe-embedding: #7c3aed;
+    --pipe-complete: #16a34a;
+    --pipe-failed: #dc2626;
+
+    --font-mono: 'Courier New', monospace;
+    --radius: 3px;
+    --radius-md: 4px;
+}
+
+* { margin: 0; padding: 0; box-sizing: border-box; }
+body { font-family: var(--font-mono); background: var(--bg-primary); color: var(--text-primary); }
+
+/* ── Header ── */
+.header {
+    background: var(--bg-secondary);
+    border-bottom: 1px solid var(--border-light);
+    padding: 10px 24px;
+    flex-shrink: 0;
+    display: flex;
+    justify-content: space-between;
+    align-items: center;
+}
+.header-left {
+    display: flex;
+    align-items: baseline;
+    gap: 12px;
+}
+.header-subtitle {
+    font-size: 11px;
+    color: var(--text-dim);
+    letter-spacing: 1px;
+    text-transform: uppercase;
+}
+.header h1 { color: var(--green); font-size: 18px; font-weight: 700; letter-spacing: 3px; }
+.header .stats { font-size: 12px; color: var(--text-dim); }
+.header .quick-stats { font-size: 11px; color: var(--text-muted); display: flex; gap: 16px; }
+.header .quick-stats span { white-space: nowrap; }
+
+/* Heartbeat indicator */
+.heartbeat {
+    display: inline-block;
+    width: 8px;
+    height: 8px;
+    border-radius: 50%;
+    background: var(--green);
+    margin-right: 6px;
+    vertical-align: middle;
+    animation: pulse 2s ease-in-out infinite;
+}
+.heartbeat.dead {
+    background: var(--red);
+    animation: none;
+}
+@keyframes pulse {
+    0%, 100% { opacity: 1; }
+    50% { opacity: 0.4; }
+}
+
+/* ── Navigation ── */
+.nav-domain {
+    background: #0d0d0d;
+    border-bottom: 1px solid var(--border);
+    padding: 0 24px;
+    display: flex;
+    gap: 0;
+    flex-shrink: 0;
+}
+.nav-domain a {
+    color: var(--text-muted);
+    text-decoration: none;
+    font-size: 13px;
+    text-transform: uppercase;
+    letter-spacing: 1px;
+    padding: 10px 16px;
+    border-bottom: 2px solid transparent;
+    transition: color 0.15s, border-color 0.15s;
+}
+.nav-domain a:hover { color: var(--text-primary); }
+.nav-domain a.active {
+    color: var(--green);
+    border-bottom-color: var(--green);
+}
+
+.nav-sub {
+    background: var(--bg-primary);
+    border-bottom: 1px solid var(--border);
+    padding: 6px 24px;
+}
+.nav-sub a {
+    color: var(--text-dim);
+    text-decoration: none;
+    margin-right: 16px;
+    font-size: 12px;
+    transition: color 0.15s;
+}
+.nav-sub a:hover { color: var(--text-primary); }
+.nav-sub a.active { color: var(--green); }
+
+/* ── Content ── */
+.content { padding: 24px; max-width: 1400px; margin: 0 auto; }
+
+/* ── Panels ── */
+.panel {
+    background: var(--bg-secondary);
+    border: 1px solid var(--border);
+    padding: 24px;
+    margin-bottom: 24px;
+}
+
+/* ── Forms ── */
+.search-box {
+    width: 100%;
+    padding: 10px 16px;
+    background: var(--bg-secondary);
+    border: 1px solid var(--border-light);
+    color: var(--text-primary);
+    font-family: inherit;
+    font-size: 14px;
+    margin-bottom: 16px;
+}
+.search-box:focus { outline: none; border-color: var(--green); }
+
+/* ── Tables ── */
+table { width: 100%; border-collapse: collapse; font-size: 13px; }
+th { background: var(--bg-secondary); color: var(--green); text-align: left; padding: 8px 12px; border-bottom: 1px solid var(--border-light); }
+td { padding: 6px 12px; border-bottom: 1px solid var(--bg-tertiary); }
+tr:hover { background: var(--bg-secondary); }
+
+/* ── Status badges ── */
+.status { padding: 2px 8px; border-radius: var(--radius); font-size: 11px; }
+.status-complete { color: var(--green); }
+.status-enriched { color: var(--blue); }
+.status-extracted { color: var(--orange); }
+.status-failed { color: var(--red); }
+.status-queued { color: var(--text-muted); }
+.status-duplicate { color: var(--text-muted); }
+
+/* ── Stat cards ── */
+.stat-grid { display: grid; grid-template-columns: repeat(auto-fit, minmax(200px, 1fr)); gap: 16px; margin-bottom: 24px; }
+.stat-card { background: var(--bg-secondary); border: 1px solid var(--border); padding: 16px; }
+.stat-card .label { color: var(--text-dim); font-size: 11px; text-transform: uppercase; }
+.stat-card .value { color: var(--green); font-size: 28px; margin-top: 4px; }
+.stat-card .sublabel { color: var(--text-faint); font-size: 10px; margin-top: 2px; }
+
+/* ── Search results ── */
+.result { background: var(--bg-secondary); border: 1px solid var(--border); padding: 16px; margin-bottom: 12px; }
+.result .title { color: var(--green); font-size: 14px; margin-bottom: 4px; }
+.result .meta { color: var(--text-dim); font-size: 11px; margin-bottom: 8px; }
+.result .content-text { color: #999; font-size: 12px; line-height: 1.5; }
+.result .score { color: var(--orange); font-size: 12px; float: right; }
+
+/* ── Buttons ── */
+.btn {
+    background: var(--bg-tertiary);
+    border: 1px solid var(--border-light);
+    color: var(--text-primary);
+    padding: 6px 14px;
+    cursor: pointer;
+    font-family: inherit;
+    font-size: 12px;
+}
+.btn:hover { border-color: var(--green); color: var(--green); }
+.btn:disabled { opacity: 0.5; cursor: not-allowed; }
+.btn.active { border-color: var(--green); color: var(--green); }
+.btn-danger { color: var(--red); }
+.btn-danger:hover { border-color: var(--red); }
+.btn-warn { color: #ff8800; }
+.btn-warn:hover { border-color: #ff8800; }
+
+/* ── Tags ── */
+.domain-tag {
+    display: inline-block;
+    background: var(--bg-tertiary);
+    border: 1px solid var(--border-light);
+    padding: 1px 6px;
+    margin: 1px;
+    font-size: 10px;
+    color: var(--text-muted);
+}
+.badge-web { background: #1e3a5f; color: #60a5fa; padding: 2px 8px; border-radius: var(--radius); font-size: 11px; }
+.badge-pdf { background: #2d5a2d; color: #4ade80; padding: 2px 8px; border-radius: var(--radius); font-size: 11px; }
+
+/* ── Trend indicators ── */
+.trend { font-size: 11px; margin-left: 6px; }
+.trend-up { color: var(--green); }
+.trend-down { color: var(--red); }
+.trend-flat { color: var(--text-faint); }
+
+/* ── Pipeline bar ── */
+.pipeline-bar {
+    height: 24px;
+    background: var(--bg-secondary);
+    border: 1px solid var(--border);
+    border-radius: var(--radius-md);
+    overflow: hidden;
+    display: flex;
+}
+.pipeline-bar .segment { height: 100%; transition: width 0.3s ease; }
+
+.pipeline-legend { display: flex; gap: 14px; margin-top: 6px; font-size: 10px; color: var(--text-muted); flex-wrap: wrap; }
+.legend-dot {
+    display: inline-block;
+    width: 10px; height: 10px;
+    border-radius: 2px;
+    margin-right: 4px;
+    vertical-align: middle;
+}
+
+/* ── Service status dots ── */
+.svc-dot {
+    display: inline-block;
+    width: 10px;
+    height: 10px;
+    border-radius: 50%;
+    margin-right: 6px;
+    vertical-align: middle;
+}
+.svc-dot.active { background: var(--green); }
+.svc-dot.inactive { background: var(--red); }
+.svc-dot.unknown { background: var(--text-faint); }
+
+/* ── Service status row ── */
+.svc-row {
+    display: flex;
+    gap: 24px;
+    background: var(--bg-secondary);
+    border: 1px solid var(--border);
+    padding: 12px 16px;
+    margin-bottom: 24px;
+    font-size: 12px;
+}
+.svc-row .svc-item { display: flex; align-items: center; }
+
+/* ── Pagination ── */
+.pagination {
+    display: flex;
+    gap: 4px;
+    margin-top: 16px;
+    justify-content: center;
+}
+.pagination a, .pagination span {
+    padding: 4px 10px;
+    border: 1px solid var(--border-light);
+    color: var(--text-muted);
+    text-decoration: none;
+    font-size: 12px;
+}
+.pagination a:hover { border-color: var(--green); color: var(--green); }
+.pagination .current {
+    border-color: var(--green);
+    color: var(--green);
+    background: var(--bg-tertiary);
+}
+
+/* ── Misc helpers ── */
+.section-title { color: var(--green); margin-bottom: 12px; }
+.mt-24 { margin-top: 24px; }
+.mb-16 { margin-bottom: 16px; }
+.mb-24 { margin-bottom: 24px; }
+.text-muted { color: var(--text-muted); }
+.text-dim { color: var(--text-dim); }
+.text-faint { color: var(--text-faint); }
+.text-green { color: var(--green); }
+.text-red { color: var(--red); }
+.text-orange { color: var(--orange); }
+.text-blue { color: var(--blue); }
+.text-small { font-size: 12px; }
+.text-xs { font-size: 11px; }
+.text-xxs { font-size: 10px; }
+.mono { font-family: monospace; }
+
+.flex { display: flex; }
+.flex-between { display: flex; justify-content: space-between; }
+.flex-center { display: flex; align-items: center; }
+.gap-8 { gap: 8px; }
+.gap-16 { gap: 16px; }
+
+.grid-2 { display: grid; grid-template-columns: 1fr 1fr; gap: 24px; }
+.grid-3 { display: grid; grid-template-columns: repeat(3, 1fr); gap: 16px; }
+
+/* ── Collapsible errors panel ── */
+.errors-panel { display: none; }
+.errors-panel.has-errors { display: block; }
+.errors-panel summary { color: var(--red); cursor: pointer; font-size: 13px; margin-bottom: 8px; }
+.errors-panel .error-line { color: var(--text-muted); font-size: 11px; padding: 2px 0; border-bottom: 1px solid var(--border); }
--- a/static/js/channels.js
+++ b/static/js/channels.js
@ -0,0 +1,120 @@
+/* RECON PeerTube Channels page JS */
+(function() {
+    'use strict';
+
+    async function loadChannelStats() {
+        try {
+            var resp = await fetch('/api/peertube/channels/stats');
+            var data = await resp.json();
+            if (resp.ok) {
+                document.getElementById('pt-total-ch').textContent = data.total_channels;
+                document.getElementById('pt-total-vid').textContent = data.total_videos;
+                var dlEl = document.getElementById('pt-dl-status');
+                dlEl.textContent = data.downloader_active ? 'Active' : 'Stopped';
+                dlEl.style.color = data.downloader_active ? '#00ff41' : '#ff4444';
+            }
+        } catch(e) {
+            console.error('Stats error:', e);
+        }
+    }
+
+    async function loadChannels() {
+        try {
+            var resp = await fetch('/api/peertube/channels');
+            var data = await resp.json();
+            if (!resp.ok) throw new Error(data.error || 'Failed');
+            var tbody = document.getElementById('pt-channel-tbody');
+            if (!data.length) {
+                tbody.innerHTML = '<tr><td colspan="6" style="text-align:center;padding:20px;color:#555;">No channels configured</td></tr>';
+                return;
+            }
+            var cats = [];
+            var catSet = {};
+            data.forEach(function(c) { if (c.category && !catSet[c.category]) { catSet[c.category] = true; cats.push(c.category); } });
+            document.getElementById('pt-cat-list').innerHTML = cats.map(function(c) { return '<option value="' + c + '">'; }).join('');
+
+            var html = '';
+            data.forEach(function(ch) {
+                var vids = ch.videos_in_peertube || 0;
+                var statusColor = vids > 0 ? '#00ff41' : '#ffa500';
+                var statusText = vids > 0 ? 'syncing' : 'new';
+                var ytLink = ch.youtube_url ? '<a href="' + ch.youtube_url + '" target="_blank" style="color:#00a0d0;text-decoration:none;">' + ch.channel_name + '</a>' : ch.channel_name;
+                html += '<tr style="border-bottom:1px solid #1a1a1a;">' +
+                    '<td style="padding:8px 10px;">' + ytLink + '</td>' +
+                    '<td style="padding:8px 10px;text-align:center;">' + vids + '</td>' +
+                    '<td style="padding:8px 10px;color:#888;">' + (ch.category || '') + '</td>' +
+                    '<td style="padding:8px 10px;text-align:center;">' + (ch.priority || 'M') + '</td>' +
+                    '<td style="padding:8px 10px;text-align:center;"><span style="color:' + statusColor + ';">' + statusText + '</span></td>' +
+                    '<td style="padding:8px 10px;text-align:center;"><button onclick="removeChannel(\'' + ch.actor_name + '\')" style="background:none;border:1px solid #333;color:#ff4444;cursor:pointer;padding:2px 8px;font-size:11px;font-family:inherit;">x</button></td>' +
+                    '</tr>';
+            });
+            tbody.innerHTML = html;
+        } catch(e) {
+            document.getElementById('pt-channel-tbody').innerHTML = '<tr><td colspan="6" style="text-align:center;padding:20px;color:#ff4444;">Error: ' + e.message + '</td></tr>';
+        }
+    }
+
+    window.addChannel = async function() {
+        var fb = document.getElementById('pt-feedback');
+        var url = document.getElementById('pt-yt-url').value.trim();
+        if (!url) {
+            fb.style.color = '#ff4444';
+            fb.textContent = 'Enter a YouTube channel URL';
+            return;
+        }
+        var category = document.getElementById('pt-category').value.trim();
+        var priority = document.getElementById('pt-priority').value;
+        var btn = document.getElementById('pt-add-btn');
+        btn.disabled = true;
+        fb.style.color = '#ffa500';
+        fb.textContent = 'Resolving channel...';
+        try {
+            var resp = await fetch('/api/peertube/channels/add', {
+                method: 'POST',
+                headers: {'Content-Type': 'application/json'},
+                body: JSON.stringify({youtube_url: url, category: category, priority: priority})
+            });
+            var data = await resp.json();
+            if (resp.ok) {
+                fb.style.color = '#00ff41';
+                fb.textContent = 'Added: ' + (data.channel_name || 'channel');
+                document.getElementById('pt-yt-url').value = '';
+                loadChannels();
+                loadChannelStats();
+            } else {
+                fb.style.color = '#ff4444';
+                fb.textContent = data.error || 'Failed to add channel';
+            }
+        } catch(e) {
+            fb.style.color = '#ff4444';
+            fb.textContent = 'Error: ' + e.message;
+        }
+        btn.disabled = false;
+    };
+
+    window.removeChannel = async function(actorName) {
+        if (!confirm('Remove channel ' + actorName + '?')) return;
+        var fb = document.getElementById('pt-feedback');
+        fb.style.color = '#ffa500';
+        fb.textContent = 'Removing...';
+        try {
+            var resp = await fetch('/api/peertube/channels/' + encodeURIComponent(actorName), {method: 'DELETE'});
+            var data = await resp.json();
+            if (resp.ok) {
+                fb.style.color = '#00ff41';
+                fb.textContent = data.message || 'Removed';
+                loadChannels();
+                loadChannelStats();
+            } else {
+                fb.style.color = '#ff4444';
+                fb.textContent = data.error || 'Failed';
+            }
+        } catch(e) {
+            fb.style.color = '#ff4444';
+            fb.textContent = 'Error: ' + e.message;
+        }
+    };
+
+    loadChannelStats();
+    loadChannels();
+})();
--- a/static/js/charts.js
+++ b/static/js/charts.js
@ -0,0 +1,186 @@
+/* RECON Lightweight Canvas Line Chart
+ * No dependencies. drawLineChart(canvasId, datasets, opts)
+ * DPI-aware rendering for sharp lines on all displays.
+ */
+var ReconChart = (function() {
+    'use strict';
+
+    var COLORS = ['#00ff41', '#0ea5e9', '#ffa500', '#ff4444', '#7c3aed', '#fbbf24'];
+
+    function drawLineChart(canvasId, datasets, opts) {
+        opts = opts || {};
+        var canvas = document.getElementById(canvasId);
+        if (!canvas) return;
+
+        // DPI-aware sizing — match canvas bitmap to actual CSS pixels
+        var dpr = window.devicePixelRatio || 1;
+        var rect = canvas.getBoundingClientRect();
+        var cssW = rect.width || 800;
+        var cssH = rect.height || 200;
+        canvas.width = cssW * dpr;
+        canvas.height = cssH * dpr;
+
+        var ctx = canvas.getContext('2d');
+        ctx.scale(dpr, dpr);
+
+        var W = cssW;
+        var H = cssH;
+        var pad = {top: 20, right: 20, bottom: 30, left: 60};
+        var plotW = W - pad.left - pad.right;
+        var plotH = H - pad.top - pad.bottom;
+
+        // Clear
+        ctx.fillStyle = '#111';
+        ctx.fillRect(0, 0, W, H);
+
+        if (!datasets || datasets.length === 0) {
+            ctx.fillStyle = '#666';
+            ctx.font = '12px Courier New';
+            ctx.textAlign = 'center';
+            ctx.fillText('No data', W/2, H/2);
+            return;
+        }
+
+        // Find global min/max Y
+        var allY = [];
+        var allX = [];
+        datasets.forEach(function(ds) {
+            ds.points.forEach(function(p) {
+                allY.push(p.y);
+                allX.push(p.x);
+            });
+        });
+        if (allY.length === 0) return;
+
+        var minY = Math.min.apply(null, allY);
+        var maxY = Math.max.apply(null, allY);
+        var minX = Math.min.apply(null, allX);
+        var maxX = Math.max.apply(null, allX);
+
+        // Add 10% padding to Y
+        var yRange = maxY - minY || 1;
+        minY = Math.max(0, minY - yRange * 0.05);
+        maxY = maxY + yRange * 0.1;
+        var xRange = maxX - minX || 1;
+
+        function xToCanvas(x) { return pad.left + ((x - minX) / xRange) * plotW; }
+        function yToCanvas(y) { return pad.top + plotH - ((y - minY) / (maxY - minY)) * plotH; }
+
+        // Grid lines
+        ctx.strokeStyle = '#222';
+        ctx.lineWidth = 1;
+        var ySteps = 5;
+        for (var i = 0; i <= ySteps; i++) {
+            var yVal = minY + (maxY - minY) * (i / ySteps);
+            var cy = yToCanvas(yVal);
+            ctx.beginPath();
+            ctx.moveTo(pad.left, cy);
+            ctx.lineTo(W - pad.right, cy);
+            ctx.stroke();
+
+            // Y labels
+            ctx.fillStyle = '#666';
+            ctx.font = '10px Courier New';
+            ctx.textAlign = 'right';
+            ctx.fillText(Math.round(yVal).toLocaleString(), pad.left - 6, cy + 3);
+        }
+
+        // X labels (time)
+        ctx.textAlign = 'center';
+        ctx.fillStyle = '#666';
+        var xSteps = Math.min(6, allX.length);
+        for (var j = 0; j < xSteps; j++) {
+            var xVal = minX + xRange * (j / (xSteps - 1 || 1));
+            var cx = xToCanvas(xVal);
+            var d = new Date(xVal);
+            var label = d.getHours().toString().padStart(2, '0') + ':' + d.getMinutes().toString().padStart(2, '0');
+            ctx.fillText(label, cx, H - 8);
+        }
+
+        // Draw lines + dots at each data point
+        datasets.forEach(function(ds, idx) {
+            var color = ds.color || COLORS[idx % COLORS.length];
+            ctx.strokeStyle = color;
+            ctx.lineWidth = 2;
+            ctx.beginPath();
+            var pts = ds.points.sort(function(a, b) { return a.x - b.x; });
+            pts.forEach(function(p, i) {
+                var x = xToCanvas(p.x);
+                var y = yToCanvas(p.y);
+                if (i === 0) ctx.moveTo(x, y);
+                else ctx.lineTo(x, y);
+            });
+            ctx.stroke();
+
+            // Draw dots at each point for visibility with sparse data
+            ctx.fillStyle = color;
+            pts.forEach(function(p) {
+                var x = xToCanvas(p.x);
+                var y = yToCanvas(p.y);
+                ctx.beginPath();
+                ctx.arc(x, y, 3, 0, Math.PI * 2);
+                ctx.fill();
+            });
+
+            // Legend label
+            if (ds.label) {
+                ctx.fillStyle = color;
+                ctx.font = '10px Courier New';
+                ctx.textAlign = 'left';
+                ctx.fillText(ds.label, pad.left + idx * 100, 12);
+            }
+        });
+    }
+
+    function loadAndDraw(canvasId, metricType, keys, labels, hours) {
+        hours = hours || 24;
+        RECON.fetchJSON('/api/metrics/history?type=' + metricType + '&hours=' + hours).then(function(data) {
+            if (!data.points || data.points.length < 2) {
+                // Show "collecting data" message instead of hiding
+                var canvas = document.getElementById(canvasId);
+                if (!canvas) return;
+                var container = canvas.parentElement;
+                if (container) container.style.display = 'block';
+                var dpr = window.devicePixelRatio || 1;
+                var rect = canvas.getBoundingClientRect();
+                canvas.width = (rect.width || 800) * dpr;
+                canvas.height = (rect.height || 200) * dpr;
+                var ctx = canvas.getContext('2d');
+                ctx.scale(dpr, dpr);
+                ctx.fillStyle = '#111';
+                ctx.fillRect(0, 0, rect.width, rect.height);
+                ctx.fillStyle = '#555';
+                ctx.font = '12px Courier New';
+                ctx.textAlign = 'center';
+                var msg = data.points && data.points.length === 1
+                    ? 'Collecting data... (1 snapshot, need 2+)'
+                    : 'Collecting data... (snapshots every 2 min)';
+                ctx.fillText(msg, (rect.width || 800) / 2, (rect.height || 200) / 2);
+                return;
+            }
+
+            var container = document.getElementById(canvasId).parentElement;
+            if (container) container.style.display = 'block';
+
+            var datasets = keys.map(function(key, i) {
+                return {
+                    label: labels[i] || key,
+                    color: COLORS[i % COLORS.length],
+                    points: data.points.map(function(p) {
+                        return {
+                            x: new Date(p.timestamp).getTime(),
+                            y: p.data[key] || 0
+                        };
+                    })
+                };
+            });
+
+            drawLineChart(canvasId, datasets);
+        }).catch(function() {});
+    }
+
+    return {
+        drawLineChart: drawLineChart,
+        loadAndDraw: loadAndDraw
+    };
+})();
--- a/static/js/common.js
+++ b/static/js/common.js
@ -0,0 +1,163 @@
+/* RECON Common Utilities
+ * Shared fetch helpers, formatters, auto-refresh
+ */
+
+var RECON = (function() {
+    'use strict';
+
+    // Pipeline color/label maps
+    var pipeColors = {
+        queued: '#555', extracting: '#b45309', extracted: '#d97706',
+        enriching: '#0284c7', enriched: '#0ea5e9', embedding: '#7c3aed',
+        complete: '#16a34a', failed: '#dc2626'
+    };
+    var pipeLabels = {
+        queued: 'Queued', extracting: 'Extracting', extracted: 'Extracted',
+        enriching: 'Enriching', enriched: 'Enriched', embedding: 'Embedding',
+        complete: 'Complete', failed: 'Failed'
+    };
+
+    var _refreshTimers = [];
+    var _heartbeatEl = null;
+
+    function fetchJSON(url) {
+        return fetch(url).then(function(r) {
+            if (!r.ok) throw new Error('HTTP ' + r.status);
+            return r.json();
+        });
+    }
+
+    function postJSON(url, body) {
+        return fetch(url, {
+            method: 'POST',
+            headers: {'Content-Type': 'application/json'},
+            body: JSON.stringify(body || {})
+        }).then(function(r) { return r.json(); });
+    }
+
+    function set(id, text) {
+        var el = document.getElementById(id);
+        if (el) el.textContent = text;
+    }
+
+    function setHTML(id, html) {
+        var el = document.getElementById(id);
+        if (el) el.innerHTML = html;
+    }
+
+    function fmt(n) {
+        if (typeof n !== 'number' || isNaN(n)) return '—';
+        return n.toLocaleString();
+    }
+
+    function fmtBytes(bytes) {
+        if (!bytes || bytes === 0) return '0 B';
+        var units = ['B', 'KB', 'MB', 'GB', 'TB'];
+        var i = Math.floor(Math.log(bytes) / Math.log(1024));
+        return (bytes / Math.pow(1024, i)).toFixed(1) + ' ' + units[i];
+    }
+
+    function pct(n, total) {
+        if (!total || total === 0) return '0';
+        return (n / total * 100).toFixed(1);
+    }
+
+    // Trend indicator: compare current to previous
+    function trend(current, previous) {
+        if (previous === undefined || previous === null) return '';
+        var diff = current - previous;
+        if (diff > 0) return '<span class="trend trend-up">+' + fmt(diff) + ' &#9650;</span>';
+        if (diff < 0) return '<span class="trend trend-down">' + fmt(diff) + ' &#9660;</span>';
+        return '<span class="trend trend-flat">&mdash; &#9654;</span>';
+    }
+
+    // Build a segmented pipeline progress bar
+    function progressBar(segments, total) {
+        var html = '';
+        segments.forEach(function(seg) {
+            var w = total > 0 ? (seg.count / total * 100) : 0;
+            if (w > 0) {
+                html += '<div class="segment" style="width:' + w + '%;background:' +
+                    (seg.color || pipeColors[seg.status] || '#555') + ';" title="' +
+                    (seg.label || pipeLabels[seg.status] || seg.status) + ': ' + fmt(seg.count) + '"></div>';
+            }
+        });
+        return html;
+    }
+
+    // Build legend for pipeline bar
+    function progressLegend(segments) {
+        var html = '';
+        segments.forEach(function(seg) {
+            if (seg.count > 0) {
+                html += '<span><span class="legend-dot" style="background:' +
+                    (seg.color || pipeColors[seg.status] || '#555') + ';"></span>' +
+                    (seg.label || pipeLabels[seg.status] || seg.status) + ': ' + fmt(seg.count) + '</span>';
+            }
+        });
+        return html;
+    }
+
+    // Auto-refresh with heartbeat
+    function startRefresh(callback, intervalMs) {
+        _heartbeatEl = document.getElementById('heartbeat');
+
+        function tick() {
+            try {
+                var result = callback();
+                if (result && typeof result.then === 'function') {
+                    result.then(function() {
+                        if (_heartbeatEl) {
+                            _heartbeatEl.classList.remove('dead');
+                        }
+                    }).catch(function() {
+                        if (_heartbeatEl) {
+                            _heartbeatEl.classList.add('dead');
+                        }
+                    });
+                } else {
+                    if (_heartbeatEl) _heartbeatEl.classList.remove('dead');
+                }
+            } catch(e) {
+                if (_heartbeatEl) _heartbeatEl.classList.add('dead');
+            }
+        }
+
+        // Initial load
+        tick();
+        var timer = setInterval(tick, intervalMs || 30000);
+        _refreshTimers.push(timer);
+        return timer;
+    }
+
+    function stopRefresh(timer) {
+        if (timer) clearInterval(timer);
+    }
+
+    // Quick-stats loader for header
+    function loadQuickStats() {
+        fetchJSON('/api/quick-stats').then(function(data) {
+            setHTML('qs-docs', fmt(data.catalogued));
+            setHTML('qs-vectors', fmt(data.vectors));
+            setHTML('qs-pipeline', fmt(data.in_pipeline));
+        }).catch(function() {});
+    }
+
+    return {
+        fetchJSON: fetchJSON,
+        postJSON: postJSON,
+        set: set,
+        setHTML: setHTML,
+        fmt: fmt,
+        fmtBytes: fmtBytes,
+        pct: pct,
+        trend: trend,
+        progressBar: progressBar,
+        progressLegend: progressLegend,
+        startRefresh: startRefresh,
+        stopRefresh: stopRefresh,
+        loadQuickStats: loadQuickStats,
+        pipeColors: pipeColors,
+        pipeLabels: pipeLabels
+    };
+})();
--- a/static/js/dashboard.js
+++ b/static/js/dashboard.js
@ -0,0 +1,232 @@
+/* RECON Knowledge Dashboard */
+(function() {
+    'use strict';
+
+    var pipeColors = RECON.pipeColors;
+    var pipeLabels = RECON.pipeLabels;
+
+    function loadDashboard() {
+        return RECON.fetchJSON('/api/knowledge-stats').then(function(data) {
+            var t = data.totals;
+
+            // Top cards
+            RECON.set('kv-catalogued', RECON.fmt(t.catalogued || 0));
+            RECON.set('kv-pipeline', RECON.fmt(t.in_pipeline || 0));
+            var pipeSub = document.getElementById('kv-pipeline-sub');
+            if (t.in_pipeline > 0) {
+                var active = data.pipeline.filter(function(p) { return ['extracting','enriching','embedding'].indexOf(p.status) >= 0; });
+                var activeText = active.map(function(p) { return p.count + ' ' + p.status; }).join(', ');
+                pipeSub.textContent = activeText || 'processing';
+            } else { pipeSub.textContent = 'idle'; }
+            RECON.set('kv-complete', RECON.fmt(t.complete || 0));
+            var failEl = document.getElementById('kv-failed');
+            failEl.textContent = RECON.fmt(t.failed || 0);
+            failEl.style.color = t.failed > 0 ? '#ff4444' : '#00ff41';
+            RECON.set('kv-concepts', RECON.fmt(t.concepts || 0));
+            RECON.set('kv-vectors', RECON.fmt(t.vectors || 0));
+            RECON.set('kv-pages', RECON.fmt(t.pages_processed || 0));
+
+            // Progress bar
+            var total = t.catalogued || 1;
+            var notYetQueued = total - (t.documents || 0);
+            var segments = [];
+            if (notYetQueued > 0) {
+                segments.push({status: 'unqueued', count: notYetQueued, color: '#1a1a1a', label: 'Not queued'});
+            }
+            data.pipeline.forEach(function(p) {
+                if (p.count > 0) segments.push(p);
+            });
+            RECON.setHTML('progress-bar', RECON.progressBar(segments, total));
+            var completePct = total > 0 ? (t.complete / total * 100).toFixed(1) : 0;
+            RECON.set('progress-pct', completePct + '% complete (' + RECON.fmt(t.complete || 0) + ' / ' + RECON.fmt(total) + ')');
+
+            // Legend
+            var legendSegments = [];
+            if (notYetQueued > 0) legendSegments.push({status: 'unqueued', count: notYetQueued, color: '#1a1a1a', label: 'Not queued'});
+            data.pipeline.forEach(function(p) { if (p.count > 0) legendSegments.push(p); });
+            RECON.setHTML('progress-legend', RECON.progressLegend(legendSegments));
+
+            // Pipeline activity
+            var activeStatuses = data.pipeline.filter(function(p) { return ['extracting','enriching','embedding'].indexOf(p.status) >= 0 && p.count > 0; });
+            var actDiv = document.getElementById('pipeline-activity');
+            if (activeStatuses.length > 0) {
+                actDiv.style.display = 'block';
+                var actHtml = '';
+                activeStatuses.forEach(function(p) {
+                    actHtml += '<div style="margin:4px 0;"><span style="color:' + (pipeColors[p.status]||'#ffa500') + ';">&#9679; ' + (pipeLabels[p.status]||p.status) + ':</span> ' + p.count + ' documents</div>';
+                });
+                if (data.active_titles) {
+                    Object.keys(data.active_titles).forEach(function(st) {
+                        var titles = data.active_titles[st];
+                        if (titles.length > 0) actHtml += '<div style="color:#666;font-size:11px;margin-left:16px;">' + titles.slice(0,3).join(', ') + (titles.length > 3 ? ', ...' : '') + '</div>';
+                    });
+                }
+                RECON.setHTML('activity-content', actHtml);
+            } else { actDiv.style.display = 'none'; }
+
+            // Qdrant health
+            var q = data.qdrant;
+            var qEl = document.getElementById('qdrant-status');
+            if (q.error) {
+                qEl.innerHTML = '<span style="color:#ff4444;">&#9679; Offline</span> &mdash; ' + q.error;
+            } else {
+                var idxType = q.index_type || (q.vectors >= 20000 ? 'HNSW' : 'brute-force');
+                var idxColor = idxType === 'HNSW' ? '#00ff41' : '#ffa500';
+                qEl.innerHTML = '<span style="color:#00ff41;">&#9679; Online</span> | ' +
+                    RECON.fmt(q.vectors) + ' vectors | ' +
+                    '<span style="color:' + idxColor + ';">' + idxType + '</span>' +
+                    (idxType === 'HNSW' ? ' (' + RECON.fmt(q.indexed||0) + ' indexed)' : ' (HNSW auto-builds at 20K)') +
+                    ' | <span style="color:#555;">recon_knowledge</span>';
+            }
+
+            // Sources table
+            var tbody = document.getElementById('sources-tbody');
+            var totalCat = 0, totalComp = 0, totalPipe = 0, totalConcepts = 0, totalVectors = 0;
+            tbody.innerHTML = data.sources.map(function(s) {
+                var catCount = s.catalogued || 0;
+                var compCount = s.complete || 0;
+                var pipeCount = s.in_pipeline || 0;
+                totalCat += catCount; totalComp += compCount; totalPipe += pipeCount;
+                totalConcepts += s.concepts; totalVectors += s.vectors;
+                var badge = s.type === 'web' ? '<span class="badge-web">WEB</span>' : '<span class="badge-pdf">PDF</span>';
+                var compPct = catCount > 0 ? (compCount / catCount * 100) : 0;
+                var pipePct = catCount > 0 ? (pipeCount / catCount * 100) : 0;
+                var compColor = compPct >= 100 ? '#00ff41' : compPct > 0 ? '#ffa500' : '#666';
+                var pipeColor = pipeCount > 0 ? '#0ea5e9' : '#555';
+                var barW = 80;
+                var compW = (compPct / 100 * barW).toFixed(1);
+                var pipeW = (pipePct / 100 * barW).toFixed(1);
+                var miniBar = '<div style="display:flex;align-items:center;gap:6px;">' +
+                    '<div style="width:' + barW + 'px;height:10px;background:#1a1a1a;border-radius:3px;overflow:hidden;display:flex;">' +
+                    '<div style="width:' + compW + 'px;background:#16a34a;height:100%;"></div>' +
+                    '<div style="width:' + pipeW + 'px;background:#0284c7;height:100%;"></div>' +
+                    '</div><span style="color:#888;font-size:10px;">' + compPct.toFixed(0) + '%</span></div>';
+                return '<tr><td>' + s.name + '</td><td>' + badge + '</td><td>' +
+                    RECON.fmt(catCount) + '</td><td><span style="color:' + compColor + ';">' +
+                    RECON.fmt(compCount) + '</span></td><td><span style="color:' + pipeColor + ';">' +
+                    RECON.fmt(pipeCount) + '</span></td><td>' + miniBar + '</td><td>' +
+                    RECON.fmt(s.concepts) + '</td><td>' + RECON.fmt(s.vectors) + '</td></tr>';
+            }).join('');
+            RECON.setHTML('sources-tfoot',
+                '<tr style="border-top:1px solid #333;font-weight:bold;"><td>TOTAL</td><td></td><td>' +
+                RECON.fmt(totalCat) + '</td><td>' + RECON.fmt(totalComp) + '</td><td>' +
+                RECON.fmt(totalPipe) + '</td><td></td><td>' +
+                RECON.fmt(totalConcepts) + '</td><td>' + RECON.fmt(totalVectors) + '</td></tr>');
+
+            // Domain bars
+            var dc = document.getElementById('domain-bars');
+            var domEntries = Object.entries(data.domains);
+            if (domEntries.length === 0) {
+                dc.innerHTML = '<span class="text-dim">No domain data</span>';
+            } else {
+                var maxD = Math.max.apply(null, domEntries.map(function(e) { return e[1]; }));
+                dc.innerHTML = domEntries.map(function(entry) {
+                    var name = entry[0], count = entry[1];
+                    var pct = (count / maxD * 100).toFixed(1);
+                    return '<div style="display:flex;align-items:center;gap:10px;margin:5px 0;">' +
+                        '<span style="width:160px;text-align:right;color:#aaa;white-space:nowrap;overflow:hidden;text-overflow:ellipsis;">' + name + '</span>' +
+                        '<div style="flex:1;height:18px;background:#1a1a1a;border-radius:3px;overflow:hidden;">' +
+                        '<div style="height:100%;background:#00cc66;border-radius:3px;width:' + pct + '%;"></div></div>' +
+                        '<span style="width:50px;color:#ccc;text-align:right;">' + RECON.fmt(count) + '</span></div>';
+                }).join('');
+            }
+
+            // Knowledge Type bars
+            var ktEl = document.getElementById('knowledge-type-bars');
+            var ktEntries = Object.entries(data.knowledge_types || {});
+            var totalKt = ktEntries.reduce(function(a, e) { return a + e[1]; }, 0);
+            if (ktEntries.length === 0) {
+                ktEl.innerHTML = '<span class="text-dim">No data yet (migration in progress)</span>';
+            } else {
+                var ktColors = {foundational: '#60a5fa', procedural: '#4ade80', operational: '#fbbf24'};
+                var maxKt = Math.max.apply(null, ktEntries.map(function(e) { return e[1]; }));
+                ktEl.innerHTML = ktEntries.map(function(entry) {
+                    var name = entry[0], count = entry[1];
+                    var pctVal = totalKt > 0 ? (count / totalKt * 100).toFixed(0) : 0;
+                    var barPct = (count / maxKt * 100).toFixed(1);
+                    var color = ktColors[name] || '#888';
+                    return '<div style="display:flex;align-items:center;gap:10px;margin:5px 0;">' +
+                        '<span style="width:100px;text-align:right;color:' + color + ';">' + name + '</span>' +
+                        '<div style="flex:1;height:18px;background:#1a1a1a;border-radius:3px;overflow:hidden;">' +
+                        '<div style="height:100%;background:' + color + ';opacity:0.6;border-radius:3px;width:' + barPct + '%;"></div></div>' +
+                        '<span style="width:80px;color:#ccc;text-align:right;">' + RECON.fmt(count) + ' (' + pctVal + '%)</span></div>';
+                }).join('');
+            }
+            var ktMig = document.getElementById('knowledge-type-migration');
+            ktMig.textContent = RECON.fmt(totalKt) + ' / ' + RECON.fmt(data.sample_size) + ' migrated';
+
+            // Complexity bars
+            var cxEl = document.getElementById('complexity-bars');
+            var cxEntries = Object.entries(data.complexities || {});
+            var totalCx = cxEntries.reduce(function(a, e) { return a + e[1]; }, 0);
+            if (cxEntries.length === 0) {
+                cxEl.innerHTML = '<span class="text-dim">No data yet (migration in progress)</span>';
+            } else {
+                var cxColors = {basic: '#4ade80', intermediate: '#fbbf24', advanced: '#f87171'};
+                var maxCx = Math.max.apply(null, cxEntries.map(function(e) { return e[1]; }));
+                cxEl.innerHTML = cxEntries.map(function(entry) {
+                    var name = entry[0], count = entry[1];
+                    var pctVal = totalCx > 0 ? (count / totalCx * 100).toFixed(0) : 0;
+                    var barPct = (count / maxCx * 100).toFixed(1);
+                    var color = cxColors[name] || '#888';
+                    return '<div style="display:flex;align-items:center;gap:10px;margin:5px 0;">' +
+                        '<span style="width:100px;text-align:right;color:' + color + ';">' + name + '</span>' +
+                        '<div style="flex:1;height:18px;background:#1a1a1a;border-radius:3px;overflow:hidden;">' +
+                        '<div style="height:100%;background:' + color + ';opacity:0.6;border-radius:3px;width:' + barPct + '%;"></div></div>' +
+                        '<span style="width:80px;color:#ccc;text-align:right;">' + RECON.fmt(count) + ' (' + pctVal + '%)</span></div>';
+                }).join('');
+            }
+            var cxMig = document.getElementById('complexity-migration');
+            cxMig.textContent = RECON.fmt(totalCx) + ' / ' + RECON.fmt(data.sample_size) + ' migrated';
+
+            // Recent completions
+            var rtb = document.getElementById('recent-tbody');
+            if (data.recent_complete.length === 0) {
+                rtb.innerHTML = '<tr><td colspan="4" class="text-dim">None yet</td></tr>';
+            } else {
+                rtb.innerHTML = data.recent_complete.map(function(r) {
+                    var badge = r.type === 'web' ? '<span class="badge-web">WEB</span>' : '<span class="badge-pdf">PDF</span>';
+                    return '<tr><td>' + r.title + '</td><td>' + badge + '</td><td>' +
+                        r.concepts + '</td><td>' + r.vectors + '</td></tr>';
+                }).join('');
+            }
+        });
+    }
+
+    function loadCharts() {
+        if (typeof ReconChart !== 'undefined') {
+            ReconChart.loadAndDraw('kb-chart', 'knowledge',
+                ['complete', 'concepts'], ['Completed', 'Concepts'], 24);
+        }
+    }
+
+    function initSourcesToggle() {
+        var toggle = document.getElementById('sources-toggle');
+        var arrow = document.getElementById('sources-arrow');
+        var thead = document.getElementById('sources-thead');
+        var tbody = document.getElementById('sources-tbody');
+        var expanded = localStorage.getItem('recon-sources-expanded') === 'true';
+
+        function apply() {
+            var show = expanded ? '' : 'none';
+            thead.style.display = show;
+            tbody.style.display = show;
+            arrow.innerHTML = expanded ? '&#9660;' : '&#9654;';
+        }
+
+        toggle.addEventListener('click', function() {
+            expanded = !expanded;
+            localStorage.setItem('recon-sources-expanded', expanded);
+            apply();
+        });
+
+        apply();
+    }
+
+    document.addEventListener('DOMContentLoaded', function() {
+        initSourcesToggle();
+        RECON.startRefresh(loadDashboard, 30000);
+        loadCharts();
+        setInterval(loadCharts, 300000); // refresh charts every 5 min
+    });
+})();
--- a/static/js/peertube.js
+++ b/static/js/peertube.js
@ -0,0 +1,106 @@
+/* RECON PeerTube Dashboard JS */
+(function() {
+    'use strict';
+
+    function loadPTDashboard() {
+        return RECON.fetchJSON('/api/peertube/dashboard').then(function(data) {
+            // Video states
+            var vs = data.video_states || {};
+            // PeerTube state codes: 1=published, 2=to_transcode, 3=to_import, 4=waiting_for_live, 5=live_ended, 6=to_move_to_external_storage, 7=transcoding_failed, 8=to_edit, 9=waiting_for_live_to_end
+            var published = vs['1'] || 0;
+            var inPipeline = (vs['2'] || 0) + (vs['3'] || 0) + (vs['6'] || 0) + (vs['8'] || 0);
+            var failed = vs['7'] || 0;
+            RECON.set('pt-published', RECON.fmt(published));
+            RECON.set('pt-in-pipeline', RECON.fmt(inPipeline));
+            var failEl = document.getElementById('pt-failed');
+            failEl.textContent = RECON.fmt(failed);
+            failEl.style.color = failed > 0 ? '#ff4444' : '#00ff41';
+
+            // Import rate from downloader state
+            var ds = data.downloader_state || {};
+            var rate = ds.imports_last_hour || 0;
+            RECON.set('pt-import-rate', RECON.fmt(rate));
+
+            // GPU
+            var gpu = data.gpu || {};
+            if (gpu.name) {
+                RECON.set('pt-gpu-util', gpu.utilization_gpu || '—');
+                RECON.set('pt-gpu-temp', gpu.temperature_gpu || '—');
+                var gpuPanel = document.getElementById('pt-gpu-panel');
+                gpuPanel.style.display = 'block';
+                document.getElementById('pt-gpu-detail').innerHTML =
+                    '<strong>' + gpu.name + '</strong> | VRAM: ' +
+                    RECON.fmt(parseInt(gpu.memory_used || 0)) + ' / ' + RECON.fmt(parseInt(gpu.memory_total || 0)) + ' MiB | ' +
+                    'Util: ' + (gpu.utilization_gpu || '?') + '% | ' +
+                    'Temp: ' + (gpu.temperature_gpu || '?') + '&deg;C';
+            } else {
+                RECON.set('pt-gpu-util', '—');
+                RECON.set('pt-gpu-temp', '—');
+                document.getElementById('pt-gpu-panel').style.display = 'none';
+            }
+
+            // Services
+            var svcs = data.services || {};
+            ['downloader', 'importer', 'transcoder', 'runner'].forEach(function(s) {
+                var el = document.getElementById('svc-' + s);
+                el.className = 'svc-dot ' + (svcs[s] === 'active' ? 'active' : svcs[s] === 'inactive' ? 'inactive' : 'unknown');
+            });
+
+            // Pipeline dirs
+            var dirs = data.pipeline_dirs || {};
+            var storageHtml = '';
+            var dirOrder = ['staging', 'completed', 'transcoded', 'failed'];
+            var dirLabels = {staging: 'Downloaded', completed: 'Awaiting Transcode', transcoded: 'Ready to Import', failed: 'Failed'};
+            var dirColors = {staging: '#b45309', completed: '#0284c7', transcoded: '#7c3aed', failed: '#dc2626'};
+            var totalVideos = 0;
+            dirOrder.forEach(function(d) {
+                var info = dirs[d] || {};
+                var videos = info.videos || 0;
+                var bytes = info.bytes || 0;
+                totalVideos += videos;
+                storageHtml += '<div class="flex-between" style="margin:4px 0;">' +
+                    '<span><span class="legend-dot" style="background:' + (dirColors[d] || '#555') + ';"></span>' + (dirLabels[d] || d) + '</span>' +
+                    '<span>' + videos + ' videos / ' + RECON.fmtBytes(bytes) + '</span></div>';
+            });
+            RECON.setHTML('pt-storage-content', storageHtml);
+
+            // Pipeline bar (using video counts)
+            var segments = dirOrder.map(function(d) {
+                return {status: d, count: (dirs[d] || {}).videos || 0, color: dirColors[d], label: dirLabels[d] || d};
+            });
+            RECON.setHTML('pt-pipeline-bar', RECON.progressBar(segments, totalVideos || 1));
+            RECON.setHTML('pt-pipeline-legend', RECON.progressLegend(segments));
+            RECON.set('pt-pipeline-summary', totalVideos + ' videos in pipeline');
+
+            // Errors
+            var errors = data.recent_errors || [];
+            var errPanel = document.getElementById('pt-errors-panel');
+            RECON.set('pt-error-count', errors.length);
+            if (errors.length > 0) {
+                errPanel.classList.add('has-errors');
+                var errHtml = '';
+                errors.forEach(function(e) {
+                    errHtml += '<div class="error-line">' + e + '</div>';
+                });
+                RECON.setHTML('pt-errors-content', errHtml);
+            } else {
+                errPanel.classList.remove('has-errors');
+            }
+        }).catch(function(err) {
+            console.error('PT dashboard error:', err);
+        });
+    }
+
+    function loadCharts() {
+        if (typeof ReconChart !== 'undefined') {
+            ReconChart.loadAndDraw('pt-chart', 'peertube',
+                ['published', 'backlog'], ['Published', 'Backlog'], 24);
+        }
+    }
+
+    document.addEventListener('DOMContentLoaded', function() {
+        RECON.startRefresh(loadPTDashboard, 30000);
+        loadCharts();
+        setInterval(loadCharts, 300000);
+    });
+})();
--- a/static/js/web-ingest.js
+++ b/static/js/web-ingest.js
@ -0,0 +1,193 @@
+/* RECON Web Ingest page JS */
+(function() {
+    'use strict';
+
+    window.showSection = function(name) {
+        document.getElementById('section-single').style.display = name === 'single' ? '' : 'none';
+        document.getElementById('section-crawl').style.display = name === 'crawl' ? '' : 'none';
+        document.getElementById('tab-single').className = 'btn' + (name === 'single' ? ' active' : '');
+        document.getElementById('tab-crawl').className = 'btn' + (name === 'crawl' ? ' active' : '');
+    };
+
+    window.doWebIngest = async function() {
+        var btn = document.getElementById('wi-btn');
+        var status = document.getElementById('wi-status');
+        var resultsDiv = document.getElementById('wi-results');
+        var urlText = document.getElementById('wi-urls').value.trim();
+        var category = document.getElementById('wi-category').value.trim() || 'Web';
+
+        if (!urlText) {
+            status.style.color = '#ff4444';
+            status.textContent = 'Enter at least one URL';
+            return;
+        }
+
+        var urls = urlText.split('\n').map(function(u) { return u.trim(); }).filter(function(u) { return u && !u.startsWith('#'); });
+        if (urls.length === 0) {
+            status.style.color = '#ff4444';
+            status.textContent = 'No valid URLs';
+            return;
+        }
+
+        btn.disabled = true;
+        status.style.color = '#ffa500';
+        resultsDiv.style.display = 'none';
+
+        if (urls.length === 1) {
+            status.textContent = 'Fetching and extracting...';
+            try {
+                var resp = await fetch('/api/ingest-url', {
+                    method: 'POST',
+                    headers: {'Content-Type': 'application/json'},
+                    body: JSON.stringify({ url: urls[0], category: category, process: true })
+                });
+                var data = await resp.json();
+                if (resp.ok || resp.status === 409) {
+                    var color = data.status === 'duplicate' ? '#888' : '#00ff41';
+                    status.style.color = color;
+                    status.textContent = data.status.toUpperCase() + ': ' + (data.title || urls[0]);
+                    resultsDiv.style.display = 'block';
+                    resultsDiv.innerHTML = '<span style="color:' + color + ';">' + data.status.toUpperCase() + '</span><br>' +
+                        '<span class="text-dim">Hash: ' + data.hash + '</span><br>' +
+                        (data.page_count ? '<span class="text-dim">Pages: ' + data.page_count + '</span><br>' : '') +
+                        (data.title ? '<span class="text-dim">Title: ' + data.title + '</span><br>' : '') +
+                        (data.pipeline ? '<span style="color:#00ff41;">Pipeline: enriched ' + (data.pipeline.enriched || 0) + ', embedded ' + (data.pipeline.embedded || 0) + '</span>' : '');
+                } else {
+                    status.style.color = '#ff4444';
+                    status.textContent = data.error || 'Ingestion failed';
+                }
+            } catch (err) {
+                status.style.color = '#ff4444';
+                status.textContent = 'Network error: ' + err.message;
+            }
+        } else {
+            status.textContent = 'Processing ' + urls.length + ' URLs...';
+            try {
+                var resp = await fetch('/api/ingest-urls', {
+                    method: 'POST',
+                    headers: {'Content-Type': 'application/json'},
+                    body: JSON.stringify({ urls: urls, category: category, process: true })
+                });
+                var data = await resp.json();
+                if (resp.ok) {
+                    var s = data.summary;
+                    status.style.color = '#00ff41';
+                    var batchPipe = data.pipeline && data.pipeline.enriched ? ' | enriched: ' + data.pipeline.enriched + ', embedded: ' + data.pipeline.embedded : '';
+                    status.textContent = s.succeeded + ' new, ' + s.duplicates + ' dupes, ' + s.failed + ' failed' + batchPipe;
+                    resultsDiv.style.display = 'block';
+                    var html = '';
+                    for (var i = 0; i < data.results.length; i++) {
+                        var r = data.results[i];
+                        var c = r.status === 'failed' ? '#ff4444' : r.status === 'duplicate' ? '#888' : '#00ff41';
+                        html += '<div style="margin-bottom:4px;"><span style="color:' + c + ';">' +
+                            r.status.toUpperCase() + '</span> ' + (r.title || r.url) + '</div>';
+                    }
+                    resultsDiv.innerHTML = html;
+                } else {
+                    status.style.color = '#ff4444';
+                    status.textContent = data.error || 'Batch ingestion failed';
+                }
+            } catch (err) {
+                status.style.color = '#ff4444';
+                status.textContent = 'Network error: ' + err.message;
+            }
+        }
+        btn.disabled = false;
+    };
+
+    window.doCrawl = async function(dryRun) {
+        var status = document.getElementById('crawl-status');
+        var resultsDiv = document.getElementById('crawl-results');
+        var url = document.getElementById('crawl-url').value.trim();
+        var category = document.getElementById('crawl-category').value.trim() || 'Web';
+        var maxPages = parseInt(document.getElementById('crawl-max-pages').value) || 500;
+        var includeRaw = document.getElementById('crawl-include').value.trim();
+        var excludeRaw = document.getElementById('crawl-exclude').value.trim();
+
+        if (!url) {
+            status.style.color = '#ff4444';
+            status.textContent = 'Enter a site URL';
+            return;
+        }
+
+        var include = includeRaw ? includeRaw.split(',').map(function(s) { return s.trim(); }).filter(Boolean) : null;
+        var exclude = excludeRaw ? excludeRaw.split(',').map(function(s) { return s.trim(); }).filter(Boolean) : null;
+
+        var btnP = document.getElementById('crawl-preview-btn');
+        var btnC = document.getElementById('crawl-btn');
+        btnP.disabled = true;
+        btnC.disabled = true;
+        status.style.color = '#ffa500';
+        status.textContent = dryRun ? 'Discovering URLs...' : 'Starting crawl...';
+        resultsDiv.style.display = 'none';
+
+        try {
+            var body = { url: url, category: category, max_pages: maxPages, dry_run: dryRun };
+            if (include) body.include = include;
+            if (exclude) body.exclude = exclude;
+
+            var resp = await fetch('/api/crawl', {
+                method: 'POST',
+                headers: {'Content-Type': 'application/json'},
+                body: JSON.stringify(body)
+            });
+            var data = await resp.json();
+
+            if (dryRun) {
+                var urls = data.urls || [];
+                status.style.color = '#00ff41';
+                status.textContent = urls.length + ' URLs found (' + (data.discovery_method || 'unknown') + ')';
+                resultsDiv.style.display = 'block';
+                var html = '<div style="color:#00ff41;margin-bottom:8px;">Discovery: ' + (data.discovery_method || 'unknown') + ' — ' + urls.length + ' URLs</div>';
+                urls.forEach(function(u, i) {
+                    html += '<div class="text-muted">' + (i+1) + '. ' + u + '</div>';
+                });
+                resultsDiv.innerHTML = html;
+            } else if (data.crawl_id) {
+                status.style.color = '#00ff41';
+                status.textContent = 'Crawl started — ID: ' + data.crawl_id;
+                resultsDiv.style.display = 'block';
+                resultsDiv.innerHTML = '<div style="color:#ffa500;">Crawl running in background...</div>' +
+                    '<div class="text-dim" style="margin-top:4px;">ID: ' + data.crawl_id + '</div>';
+                pollCrawl(data.crawl_id, resultsDiv);
+            } else {
+                status.style.color = '#ff4444';
+                status.textContent = data.error || 'Crawl failed';
+            }
+        } catch (err) {
+            status.style.color = '#ff4444';
+            status.textContent = 'Network error: ' + err.message;
+        }
+        btnP.disabled = false;
+        btnC.disabled = false;
+    };
+
+    function pollCrawl(crawlId, resultsDiv) {
+        var check = async function() {
+            try {
+                var resp = await fetch('/api/crawl/' + crawlId + '/status');
+                var data = await resp.json();
+                if (data.status === 'running') {
+                    var stageText = data.stage ? ' (' + data.stage + ')' : '';
+                    resultsDiv.innerHTML = '<div style="color:#ffa500;">Pipeline running' + stageText + '...</div>' +
+                        '<div class="text-dim">Site: ' + (data.site || '') + '</div>';
+                    setTimeout(check, 5000);
+                } else if (data.summary) {
+                    var s = data.summary;
+                    var pipeInfo = data.pipeline ? ' | Enriched: ' + (data.pipeline.enriched || 0) + ' | Embedded: ' + (data.pipeline.embedded || 0) : '';
+                    resultsDiv.innerHTML = '<div style="color:#00ff41;">Pipeline complete!</div>' +
+                        '<div class="text-dim" style="margin-top:4px;">New: ' + s.succeeded + ' | Duplicates: ' + s.duplicates + ' | Failed: ' + s.failed + ' | Total: ' + s.total + pipeInfo + '</div>';
+                    document.getElementById('crawl-status').style.color = '#00ff41';
+                    document.getElementById('crawl-status').textContent = 'Complete: ' + s.succeeded + ' new' + pipeInfo;
+                } else if (data.error) {
+                    resultsDiv.innerHTML = '<div style="color:#ff4444;">Crawl failed: ' + data.error + '</div>';
+                }
+            } catch (err) {
+                resultsDiv.innerHTML += '<div style="color:#ff4444;">Poll error: ' + err.message + '</div>';
+            }
+        };
+        setTimeout(check, 5000);
+    }
+
+    showSection('single');
+})();
--- a/sweep_gated.sh
+++ b/sweep_gated.sh
@ -0,0 +1,115 @@
+#!/usr/bin/env bash
+# sweep_gated.sh — Qdrant-gated sweep wrapper for Stream B.2 Phase 4
+# Runs recon.py pipeline sweep in bounded chunks with Qdrant health checks
+# between each invocation. Aborts cleanly if Qdrant becomes unreachable.
+
+set -euo pipefail
+
+QDRANT_URL="${QDRANT_URL:-http://192.168.1.150:6333/collections/recon_knowledge_hybrid}"
+BATCH_SIZE="${BATCH_SIZE:-500}"
+MAX_ENTRIES="${MAX_ENTRIES:-500}"
+PLAN_FILE="${PLAN_FILE:-/opt/recon/data/sweep/sweep_plan.json}"
+RECON_DIR="/opt/recon"
+# Checkpoint co-locates with plan file: plan.json -> plan_checkpoint.json
+CHECKPOINT_FILE="${PLAN_FILE%.json}_checkpoint.json"
+
+log() { echo "[$(date +%Y-%m-%dT%H:%M:%S)] $*"; }
+
+probe_qdrant() {
+    local resp
+    resp=$(curl -sf -o /dev/null -w '%{http_code}' --connect-timeout 5 --max-time 10 "$QDRANT_URL" 2>/dev/null) || true
+    if [ "$resp" = "200" ]; then
+        return 0
+    else
+        return 1
+    fi
+}
+
+report_progress() {
+    if [ -f "$CHECKPOINT_FILE" ]; then
+        python3 -c "
+import json
+cp = json.load(open('$CHECKPOINT_FILE'))
+s = cp['stats']
+idx = cp['last_completed_index']
+print(f'  last_completed_index={idx}')
+print(f'  relocated={s[\"relocated\"]} rescued={s[\"rescued\"]} unclassified={s[\"unclassified_moved\"]}')
+print(f'  noop={s[\"no_op_marked\"]} dup={s[\"duplicates\"]} skip={s[\"skipped\"]} fail={s[\"failed\"]}')
+print(f'  qdrant_updated={s[\"qdrant_updated\"]}')
+" 2>/dev/null || log "  (could not read checkpoint)"
+    else
+        log "  no checkpoint file at $CHECKPOINT_FILE"
+    fi
+}
+
+parse_processed() {
+    # Parse the sweep output to count total entries processed this iteration
+    python3 -c "
+import sys, re
+lines = sys.stdin.read()
+total = 0
+for key in ['Relocated', 'Rescued', 'Unclassified moved', 'No-op .marked.', 'Duplicates', 'Skipped', 'Failed']:
+    m = re.search(key + r':\s+(\d+)', lines)
+    if m:
+        total += int(m.group(1))
+print(total)
+" 2>/dev/null || echo "-1"
+}
+
+log "Plan file: $PLAN_FILE"
+log "Batch size: $BATCH_SIZE, Max entries per chunk: $MAX_ENTRIES"
+
+iteration=0
+
+while true; do
+    iteration=$((iteration + 1))
+    log "=== Iteration $iteration ==="
+
+    # Pre-flight Qdrant probe
+    log "Probing Qdrant at $QDRANT_URL ..."
+    if ! probe_qdrant; then
+        log "ABORT: Qdrant unreachable before iteration $iteration"
+        report_progress
+        exit 1
+    fi
+    log "Qdrant OK"
+
+    # Run sweep chunk
+    log "Running: recon.py pipeline sweep --execute --resume --batch-size $BATCH_SIZE --max-entries $MAX_ENTRIES --plan-file $PLAN_FILE"
+    set +e
+    output=$(cd "$RECON_DIR" && python3 recon.py pipeline sweep --execute --resume \
+        --batch-size "$BATCH_SIZE" --max-entries "$MAX_ENTRIES" --plan-file "$PLAN_FILE" 2>&1)
+    rc=$?
+    set -e
+
+    echo "$output"
+
+    if [ $rc -ne 0 ]; then
+        log "ABORT: recon.py exited with code $rc"
+        report_progress
+        exit 2
+    fi
+
+    # Check if sweep is done (all counters zero = nothing left to process)
+    processed=$(echo "$output" | parse_processed)
+
+    if [ "$processed" = "0" ]; then
+        log "Sweep complete — nothing left to process"
+        report_progress
+        exit 0
+    fi
+
+    log "Chunk processed $processed entries"
+
+    # Post-flight Qdrant probe
+    log "Post-flight Qdrant probe..."
+    if ! probe_qdrant; then
+        log "ABORT: Qdrant unreachable after iteration $iteration"
+        log "Last chunk may have filesystem/Qdrant drift — verify with: recon.py pipeline sweep --verify"
+        report_progress
+        exit 3
+    fi
+    log "Qdrant still healthy, continuing..."
+    report_progress
+    echo
+done
--- a/templates/base.html
+++ b/templates/base.html
@ -0,0 +1,39 @@
+<!DOCTYPE html>
+<html>
+<head>
+<title>RECON // Aurora Intelligence Pipeline{% if page_title %} — {{ page_title }}{% endif %}</title>
+<meta charset="utf-8">
+<link rel="stylesheet" href="/static/css/recon.css">
+</head>
+<body>
+<div class="header">
+    <div class="header-left"><h1><span id="heartbeat" class="heartbeat"></span>RECON</h1><span class="header-subtitle">AURORA INTELLIGENCE PIPELINE</span></div>
+    <div class="flex gap-16">
+        <div class="quick-stats">
+            <span>Docs: <span id="qs-docs">—</span></span>
+            <span>Vectors: <span id="qs-vectors">—</span></span>
+            <span>Pipeline: <span id="qs-pipeline">—</span></span>
+        </div>
+    </div>
+</div>
+<div class="nav-domain">
+    <a href="/"{% if domain == 'knowledge' %} class="active"{% endif %}>Knowledge</a>
+    <a href="/peertube"{% if domain == 'peertube' %} class="active"{% endif %}>PeerTube</a>
+    <a href="/search"{% if domain == 'search' %} class="active"{% endif %}>Search</a>
+    <a href="/settings/keys"{% if domain == 'settings' %} class="active"{% endif %}>Settings</a>
+</div>
+{% if subnav %}
+<div class="nav-sub">
+    {% for item in subnav %}
+    <a href="{{ item.href }}"{% if item.href == active_page %} class="active"{% endif %}>{{ item.label }}</a>
+    {% endfor %}
+</div>
+{% endif %}
+<div class="content" id="main">
+    {% block content %}{% endblock %}
+</div>
+<script src="/static/js/common.js"></script>
+<script>document.addEventListener('DOMContentLoaded', function() { RECON.loadQuickStats(); });</script>
+{% block scripts %}{% endblock %}
+</body>
+</html>
--- a/templates/knowledge/catalogue.html
+++ b/templates/knowledge/catalogue.html
@ -0,0 +1,53 @@
+{% extends "base.html" %}
+{% block content %}
+<h3 class="section-title mb-16">Document Catalogue</h3>
+
+{% if sources %}
+<div class="mb-16">
+    <a href="/catalogue" class="btn{% if not current_source %} active{% endif %}" style="margin-right:4px;">All</a>
+    {% for s in sources %}
+    <a href="/catalogue?source={{ s }}" class="btn{% if current_source == s %} active{% endif %}" style="margin-right:4px;">{{ s }}</a>
+    {% endfor %}
+</div>
+{% endif %}
+
+<div class="text-dim text-xs mb-16">
+    Showing {{ docs|length }}{% if total_count %} of {{ total_count }}{% endif %} documents
+    {% if current_source %} in <strong>{{ current_source }}</strong>{% endif %}
+    (page {{ page }} of {{ total_pages }})
+</div>
+
+<table>
+    <tr><th>Filename</th><th>Source</th><th>Status</th><th>Pages</th><th>Concepts</th><th>Vectors</th></tr>
+    {% for d in docs %}
+    <tr>
+        <td>{{ d.filename or '?' }}</td>
+        <td>{{ d.source or '' }}</td>
+        <td><span class="status status-{{ d.status or 'unknown' }}">{{ d.status or 'unknown' }}</span></td>
+        <td>{{ d.pages_extracted or 0 }}</td>
+        <td>{{ d.concepts_extracted or 0 }}</td>
+        <td>{{ d.vectors_inserted or 0 }}</td>
+    </tr>
+    {% endfor %}
+</table>
+
+{% if total_pages > 1 %}
+<div class="pagination">
+    {% if page > 1 %}
+    <a href="/catalogue?page={{ page - 1 }}{% if current_source %}&source={{ current_source }}{% endif %}&per_page={{ per_page }}">&laquo;</a>
+    {% endif %}
+    {% for p in range(1, total_pages + 1) %}
+        {% if p == page %}
+        <span class="current">{{ p }}</span>
+        {% elif p <= 3 or p > total_pages - 3 or (p >= page - 2 and p <= page + 2) %}
+        <a href="/catalogue?page={{ p }}{% if current_source %}&source={{ current_source }}{% endif %}&per_page={{ per_page }}">{{ p }}</a>
+        {% elif p == 4 or p == total_pages - 3 %}
+        <span class="text-dim">...</span>
+        {% endif %}
+    {% endfor %}
+    {% if page < total_pages %}
+    <a href="/catalogue?page={{ page + 1 }}{% if current_source %}&source={{ current_source }}{% endif %}&per_page={{ per_page }}">&raquo;</a>
+    {% endif %}
+</div>
+{% endif %}
+{% endblock %}
--- a/templates/knowledge/dashboard.html
+++ b/templates/knowledge/dashboard.html
@ -0,0 +1,72 @@
+{% extends "base.html" %}
+{% block content %}
+<div id="kb-dashboard">
+    <div class="stat-grid">
+        <div class="stat-card"><div class="label">Catalogued</div><div class="value" id="kv-catalogued">—</div><div class="sublabel">total known documents</div></div>
+        <div class="stat-card"><div class="label">In Pipeline</div><div class="value" id="kv-pipeline">—</div><div class="sublabel" id="kv-pipeline-sub">processing</div></div>
+        <div class="stat-card"><div class="label">Complete</div><div class="value" id="kv-complete">—</div><div class="sublabel">in Qdrant</div></div>
+        <div class="stat-card"><div class="label">Failed</div><div class="value" id="kv-failed">—</div><div class="sublabel">&nbsp;</div></div>
+    </div>
+
+    <div class="mb-24">
+        <div class="flex-between mb-16" style="margin-bottom:4px;font-size:11px;color:#888;">
+            <span id="progress-label">Pipeline Progress</span>
+            <span id="progress-pct"></span>
+        </div>
+        <div id="progress-bar" class="pipeline-bar"></div>
+        <div id="progress-legend" class="pipeline-legend"></div>
+    </div>
+
+    <div class="stat-grid grid-3">
+        <div class="stat-card"><div class="label">Concepts</div><div class="value" id="kv-concepts">—</div><div class="sublabel">extracted</div></div>
+        <div class="stat-card"><div class="label">Vectors</div><div class="value" id="kv-vectors">—</div><div class="sublabel">in Qdrant</div></div>
+        <div class="stat-card"><div class="label">Pages</div><div class="value" id="kv-pages">—</div><div class="sublabel">processed</div></div>
+    </div>
+
+    <div id="pipeline-activity" class="panel" style="display:none;">
+        <h3 style="color:#ffa500;font-size:13px;margin-bottom:8px;">Pipeline Activity</h3>
+        <div id="activity-content" style="font-size:12px;color:#ccc;"></div>
+    </div>
+
+    <div id="qdrant-health" class="panel" style="padding:10px 16px;font-size:12px;color:#888;">
+        Qdrant: <span id="qdrant-status">checking...</span>
+    </div>
+
+    <div id="kb-chart-container" class="panel" style="display:none;">
+        <h3 class="section-title" style="margin-bottom:8px;">Pipeline Activity (24h)</h3>
+        <canvas id="kb-chart" width="800" height="200" style="width:100%;height:200px;"></canvas>
+    </div>
+
+    <h3 class="section-title" id="sources-toggle" style="cursor:pointer;user-select:none;"><span id="sources-arrow">&#9654;</span> Sources</h3>
+    <table>
+        <thead id="sources-thead" style="display:none;"><tr><th>Source</th><th>Type</th><th>Catalogued</th><th>Complete</th><th>In Pipeline</th><th>Progress</th><th>Concepts</th><th>Vectors</th></tr></thead>
+        <tbody id="sources-tbody" style="display:none;"><tr><td colspan="8" class="text-dim">Loading...</td></tr></tbody>
+        <tfoot id="sources-tfoot"></tfoot>
+    </table>
+
+    <div class="grid-2 mt-24">
+        <div>
+            <h3 class="section-title">Domain Distribution</h3>
+            <div id="domain-bars" class="text-small">Loading...</div>
+        </div>
+        <div>
+            <h3 class="section-title">Knowledge Type</h3>
+            <div id="knowledge-type-bars" class="text-small">Loading...</div>
+            <div id="knowledge-type-migration" class="text-small" style="margin-top:6px;color:#666;font-size:11px;"></div>
+            <h3 class="section-title" style="margin-top:16px;">Complexity</h3>
+            <div id="complexity-bars" class="text-small">Loading...</div>
+            <div id="complexity-migration" class="text-small" style="margin-top:6px;color:#666;font-size:11px;"></div>
+        </div>
+    </div>
+
+    <h3 class="section-title mt-24">Recently Completed</h3>
+    <table>
+        <thead><tr><th>Title</th><th>Type</th><th>Concepts</th><th>Vectors</th></tr></thead>
+        <tbody id="recent-tbody"><tr><td colspan="4" class="text-dim">Loading...</td></tr></tbody>
+    </table>
+</div>
+{% endblock %}
+{% block scripts %}
+<script src="/static/js/charts.js"></script>
+<script src="/static/js/dashboard.js"></script>
+{% endblock %}
--- a/templates/knowledge/failures.html
+++ b/templates/knowledge/failures.html
@ -0,0 +1,56 @@
+{% extends "base.html" %}
+{% block content %}
+<h3 style="color:#ff4444;margin-bottom:16px;">Failed Documents</h3>
+{% if not failures %}
+<p class="text-dim">No failures.</p>
+{% else %}
+<div style="margin-bottom:16px;">
+    <button class="btn" id="retry-all-btn" onclick="retryAll()">Retry All ({{ failures|length }})</button>
+    <span id="retry-all-status" style="margin-left:12px;font-size:12px;"></span>
+</div>
+<table>
+    <tr><th>Filename</th><th>Error</th><th>Age</th><th>Retries</th><th>Actions</th></tr>
+    {% for f in failures %}
+    <tr>
+        <td>{{ f.filename or '?' }}</td>
+        <td style="color:#ff4444;font-size:11px;">{{ (f.error_message or 'unknown')[:100] }}</td>
+        <td class="text-dim text-xs">{{ f.discovered_at or '' }}</td>
+        <td>{{ f.retry_count or 0 }}</td>
+        <td>
+            <form method="post" action="/api/retry/{{ f.hash }}" style="display:inline;">
+                <button class="btn" type="submit">Retry</button>
+            </form>
+        </td>
+    </tr>
+    {% endfor %}
+</table>
+{% endif %}
+{% endblock %}
+{% block scripts %}
+<script>
+async function retryAll() {
+    var btn = document.getElementById('retry-all-btn');
+    var status = document.getElementById('retry-all-status');
+    if (!confirm('Retry all {{ failures|length }} failed documents?')) return;
+    btn.disabled = true;
+    status.style.color = '#ffa500';
+    status.textContent = 'Retrying...';
+    try {
+        var resp = await fetch('/api/retry-all', {method: 'POST'});
+        var data = await resp.json();
+        if (resp.ok) {
+            status.style.color = '#00ff41';
+            status.textContent = 'Retried ' + data.count + ' documents';
+            setTimeout(function() { location.reload(); }, 2000);
+        } else {
+            status.style.color = '#ff4444';
+            status.textContent = data.error || 'Failed';
+        }
+    } catch(e) {
+        status.style.color = '#ff4444';
+        status.textContent = 'Error: ' + e.message;
+    }
+    btn.disabled = false;
+}
+</script>
+{% endblock %}
--- a/templates/knowledge/upload.html
+++ b/templates/knowledge/upload.html
@ -0,0 +1,83 @@
+{% extends "base.html" %}
+{% block content %}
+<h3 class="section-title mb-16">Upload PDF</h3>
+<div class="panel">
+    <form id="upload-form" enctype="multipart/form-data">
+        <div class="mb-16">
+            <label class="text-dim text-xs" style="text-transform:uppercase;display:block;margin-bottom:4px;">PDF File</label>
+            <input type="file" name="file" accept=".pdf" id="upload-file"
+                style="background:#0a0a0a;border:1px solid #333;color:#c0c0c0;padding:8px;width:100%;font-family:inherit;">
+        </div>
+        <div class="mb-16">
+            <label class="text-dim text-xs" style="text-transform:uppercase;display:block;margin-bottom:4px;">Category</label>
+            <input type="text" name="category" id="upload-category" list="cat-list" class="search-box"
+                placeholder="Select or type a category..." style="margin-bottom:0;">
+            <datalist id="cat-list">{{ options_html|safe }}</datalist>
+        </div>
+        <button type="submit" class="btn" id="upload-btn">Upload</button>
+        <span id="upload-status" style="margin-left:12px;font-size:12px;"></span>
+    </form>
+</div>
+<div id="upload-result" style="display:none;" class="panel"></div>
+
+<h3 class="section-title">Recent Documents</h3>
+<table>
+    <tr><th>Filename</th><th>Source</th><th>Status</th></tr>
+    {% for d in recent %}
+    <tr>
+        <td>{{ d.filename or '?' }}</td>
+        <td>{{ d.source or '' }}</td>
+        <td><span class="status status-{{ d.status or 'unknown' }}">{{ d.status or 'unknown' }}</span></td>
+    </tr>
+    {% endfor %}
+</table>
+{% endblock %}
+{% block scripts %}
+<script>
+document.getElementById('upload-form').addEventListener('submit', async function(e) {
+    e.preventDefault();
+    var btn = document.getElementById('upload-btn');
+    var status = document.getElementById('upload-status');
+    var result = document.getElementById('upload-result');
+    var fileInput = document.getElementById('upload-file');
+    var category = document.getElementById('upload-category').value;
+
+    if (!fileInput.files.length) {
+        status.style.color = '#ff4444';
+        status.textContent = 'No file selected';
+        return;
+    }
+
+    btn.disabled = true;
+    status.style.color = '#ffa500';
+    status.textContent = 'Uploading...';
+    result.style.display = 'none';
+
+    var formData = new FormData();
+    formData.append('file', fileInput.files[0]);
+    formData.append('category', category);
+
+    try {
+        var resp = await fetch('/api/upload', { method: 'POST', body: formData });
+        var data = await resp.json();
+        if (resp.ok) {
+            status.style.color = '#00ff41';
+            status.textContent = 'Upload successful';
+            result.style.display = 'block';
+            result.innerHTML = '<span style="color:#00ff41;">Queued for processing</span><br>' +
+                '<span class="text-dim">Hash: ' + data.hash + '</span><br>' +
+                '<span class="text-dim">File: ' + data.filename + '</span><br>' +
+                '<span class="text-dim">Category: ' + data.source + '/' + data.category + '</span>';
+            fileInput.value = '';
+        } else {
+            status.style.color = '#ff4444';
+            status.textContent = data.error || 'Upload failed';
+        }
+    } catch (err) {
+        status.style.color = '#ff4444';
+        status.textContent = 'Network error: ' + err.message;
+    }
+    btn.disabled = false;
+});
+</script>
+{% endblock %}
--- a/templates/knowledge/web_ingest.html
+++ b/templates/knowledge/web_ingest.html
@ -0,0 +1,76 @@
+{% extends "base.html" %}
+{% block content %}
+<h3 class="section-title mb-16">Web Ingest</h3>
+<div style="margin-bottom:8px;">
+    <a href="#single" class="btn active" onclick="showSection('single')" id="tab-single">Single/Batch URL</a>
+    <a href="#crawl" class="btn" onclick="showSection('crawl')" id="tab-crawl">Site Crawl</a>
+</div>
+
+<div id="section-single">
+<div class="panel">
+    <div class="mb-16">
+        <label class="text-dim text-xs" style="text-transform:uppercase;display:block;margin-bottom:4px;">URL(s) — one per line for batch</label>
+        <textarea id="wi-urls" class="search-box" rows="4" placeholder="https://example.com/article" style="resize:vertical;margin-bottom:0;"></textarea>
+    </div>
+    <div class="mb-16">
+        <label class="text-dim text-xs" style="text-transform:uppercase;display:block;margin-bottom:4px;">Category</label>
+        <input type="text" id="wi-category" list="wi-cat-list" class="search-box" value="Web"
+            placeholder="Category..." style="margin-bottom:0;">
+        <datalist id="wi-cat-list">{{ options_html|safe }}</datalist>
+    </div>
+    <button class="btn" id="wi-btn" onclick="doWebIngest()">Ingest</button>
+    <span id="wi-status" style="margin-left:12px;font-size:12px;"></span>
+</div>
+<div id="wi-results" style="display:none;" class="panel" style="max-height:300px;overflow-y:auto;"></div>
+</div>
+
+<div id="section-crawl" style="display:none;">
+<div class="panel">
+    <div class="mb-16">
+        <label class="text-dim text-xs" style="text-transform:uppercase;display:block;margin-bottom:4px;">Site URL</label>
+        <input type="text" id="crawl-url" class="search-box" placeholder="https://example.com" style="margin-bottom:0;">
+    </div>
+    <div class="grid-2 mb-16">
+        <div>
+            <label class="text-dim text-xs" style="text-transform:uppercase;display:block;margin-bottom:4px;">Category</label>
+            <input type="text" id="crawl-category" list="wi-cat-list" class="search-box" value="Web" style="margin-bottom:0;">
+        </div>
+        <div>
+            <label class="text-dim text-xs" style="text-transform:uppercase;display:block;margin-bottom:4px;">Max Pages</label>
+            <input type="number" id="crawl-max-pages" class="search-box" value="500" min="1" max="5000" style="margin-bottom:0;">
+        </div>
+    </div>
+    <div class="grid-2 mb-16">
+        <div>
+            <label class="text-dim text-xs" style="text-transform:uppercase;display:block;margin-bottom:4px;">Include Paths (comma-separated)</label>
+            <input type="text" id="crawl-include" class="search-box" placeholder="/docs/, /blog/" style="margin-bottom:0;">
+        </div>
+        <div>
+            <label class="text-dim text-xs" style="text-transform:uppercase;display:block;margin-bottom:4px;">Exclude Paths (comma-separated)</label>
+            <input type="text" id="crawl-exclude" class="search-box" placeholder="/search, /login" style="margin-bottom:0;">
+        </div>
+    </div>
+    <button class="btn" id="crawl-preview-btn" onclick="doCrawl(true)">Preview</button>
+    <button class="btn" id="crawl-btn" onclick="doCrawl(false)" style="margin-left:8px;">Crawl &amp; Ingest</button>
+    <span id="crawl-status" style="margin-left:12px;font-size:12px;"></span>
+</div>
+<div id="crawl-results" style="display:none;" class="panel" style="max-height:400px;overflow-y:auto;font-size:12px;"></div>
+</div>
+
+<h3 class="section-title mt-24">Recent Web Ingestions</h3>
+<table>
+    <tr><th>Title</th><th>Source/Category</th><th>Status</th><th>Pages</th><th>Concepts</th></tr>
+    {% for d in web_docs %}
+    <tr>
+        <td title="{{ d.path or '' }}" style="max-width:400px;overflow:hidden;text-overflow:ellipsis;white-space:nowrap;">{{ d.book_title or d.filename or '?' }}</td>
+        <td>{{ d.source or '' }}/{{ d.category or '' }}</td>
+        <td><span class="status status-{{ d.status or 'unknown' }}">{{ d.status or 'unknown' }}</span></td>
+        <td>{{ d.pages_extracted or 0 }}</td>
+        <td>{{ d.concepts_extracted or 0 }}</td>
+    </tr>
+    {% endfor %}
+</table>
+{% endblock %}
+{% block scripts %}
+<script src="/static/js/web-ingest.js"></script>
+{% endblock %}
--- a/templates/peertube/channels.html
+++ b/templates/peertube/channels.html
@ -0,0 +1,53 @@
+{% extends "base.html" %}
+{% block content %}
+<h3 class="section-title mb-16">PeerTube Channels</h3>
+
+<div class="stat-grid" id="pt-stats" style="margin-bottom:24px;">
+    <div class="stat-card"><div class="value" id="pt-total-ch">—</div><div class="label">Channels</div></div>
+    <div class="stat-card"><div class="value" id="pt-total-vid">—</div><div class="label">Videos</div></div>
+    <div class="stat-card"><div class="value" id="pt-dl-status">—</div><div class="label">Downloader</div></div>
+</div>
+
+<div class="panel">
+    <div class="flex gap-8" style="flex-wrap:wrap;align-items:flex-end;margin-bottom:12px;">
+        <div style="flex:1;min-width:250px;">
+            <label class="text-dim text-xs" style="text-transform:uppercase;display:block;margin-bottom:4px;">YouTube URL</label>
+            <input type="text" id="pt-yt-url" class="search-box" placeholder="https://www.youtube.com/@ChannelName" style="margin-bottom:0;width:100%;">
+        </div>
+        <div style="min-width:150px;">
+            <label class="text-dim text-xs" style="text-transform:uppercase;display:block;margin-bottom:4px;">Category</label>
+            <input type="text" id="pt-category" list="pt-cat-list" class="search-box" placeholder="e.g. OPSEC/Privacy" style="margin-bottom:0;width:100%;">
+            <datalist id="pt-cat-list"></datalist>
+        </div>
+        <div style="min-width:60px;">
+            <label class="text-dim text-xs" style="text-transform:uppercase;display:block;margin-bottom:4px;">Priority</label>
+            <select id="pt-priority" style="background:#0a0a0a;border:1px solid #333;color:#c0c0c0;padding:6px 10px;font-family:inherit;font-size:12px;width:100%;">
+                <option value="M">M</option>
+                <option value="H">H</option>
+                <option value="L">L</option>
+            </select>
+        </div>
+        <button class="btn" id="pt-add-btn" onclick="addChannel()">Add Channel</button>
+    </div>
+    <div id="pt-feedback" style="font-size:12px;min-height:18px;"></div>
+</div>
+
+<div style="background:#111;border:1px solid #222;overflow-x:auto;">
+    <table style="width:100%;border-collapse:collapse;font-size:12px;" id="pt-channel-table">
+        <thead>
+            <tr style="border-bottom:1px solid #222;">
+                <th style="text-align:left;padding:10px;">Channel</th>
+                <th style="text-align:center;padding:10px;">Videos</th>
+                <th style="text-align:left;padding:10px;">Category</th>
+                <th style="text-align:center;padding:10px;">Pri</th>
+                <th style="text-align:center;padding:10px;">Status</th>
+                <th style="text-align:center;padding:10px;width:60px;"></th>
+            </tr>
+        </thead>
+        <tbody id="pt-channel-tbody"><tr><td colspan="6" style="text-align:center;padding:20px;color:#555;">Loading...</td></tr></tbody>
+    </table>
+</div>
+{% endblock %}
+{% block scripts %}
+<script src="/static/js/channels.js"></script>
+{% endblock %}
--- a/templates/peertube/dashboard.html
+++ b/templates/peertube/dashboard.html
@ -0,0 +1,53 @@
+{% extends "base.html" %}
+{% block content %}
+<div id="pt-dashboard">
+    <div class="stat-grid" style="grid-template-columns:repeat(6, 1fr);">
+        <div class="stat-card"><div class="label">Published</div><div class="value" id="pt-published">—</div></div>
+        <div class="stat-card"><div class="label">In Pipeline</div><div class="value" id="pt-in-pipeline">—</div></div>
+        <div class="stat-card"><div class="label">Failed</div><div class="value" id="pt-failed">—</div></div>
+        <div class="stat-card"><div class="label">Import Rate</div><div class="value" id="pt-import-rate">—</div><div class="sublabel">/hour</div></div>
+        <div class="stat-card"><div class="label">GPU Util</div><div class="value" id="pt-gpu-util">—</div><div class="sublabel">%</div></div>
+        <div class="stat-card"><div class="label">GPU Temp</div><div class="value" id="pt-gpu-temp">—</div><div class="sublabel">&deg;C</div></div>
+    </div>
+
+    <div class="mb-24">
+        <div class="flex-between" style="margin-bottom:4px;font-size:11px;color:#888;">
+            <span>Pipeline Flow</span>
+            <span id="pt-pipeline-summary"></span>
+        </div>
+        <div id="pt-pipeline-bar" class="pipeline-bar"></div>
+        <div id="pt-pipeline-legend" class="pipeline-legend"></div>
+    </div>
+
+    <div class="svc-row">
+        <div class="svc-item"><span class="svc-dot unknown" id="svc-downloader"></span>Downloader</div>
+        <div class="svc-item"><span class="svc-dot unknown" id="svc-importer"></span>Importer</div>
+        <div class="svc-item"><span class="svc-dot unknown" id="svc-transcoder"></span>Transcoder</div>
+        <div class="svc-item"><span class="svc-dot unknown" id="svc-runner"></span>Runner</div>
+    </div>
+
+    <div id="pt-gpu-panel" class="panel" style="display:none;">
+        <h3 class="section-title" style="margin-bottom:8px;">GPU Status</h3>
+        <div id="pt-gpu-detail" class="text-small text-muted"></div>
+    </div>
+
+    <div id="pt-chart-container" class="panel" style="display:none;">
+        <h3 class="section-title" style="margin-bottom:8px;">Pipeline Activity (24h)</h3>
+        <canvas id="pt-chart" width="800" height="200" style="width:100%;height:200px;"></canvas>
+    </div>
+
+    <div id="pt-storage" class="panel">
+        <h3 class="section-title" style="margin-bottom:12px;">Pipeline Storage</h3>
+        <div id="pt-storage-content" class="text-small text-muted">Loading...</div>
+    </div>
+
+    <details id="pt-errors-panel" class="errors-panel panel">
+        <summary>Recent Errors (<span id="pt-error-count">0</span>)</summary>
+        <div id="pt-errors-content" style="margin-top:8px;"></div>
+    </details>
+</div>
+{% endblock %}
+{% block scripts %}
+<script src="/static/js/charts.js"></script>
+<script src="/static/js/peertube.js"></script>
+{% endblock %}
--- a/templates/search.html
+++ b/templates/search.html
@ -0,0 +1,41 @@
+{% extends "base.html" %}
+{% block content %}
+<h3 class="section-title mb-16">Semantic Search</h3>
+<form method="get" action="/search">
+    <input type="text" name="q" class="search-box" placeholder="Search the knowledge base..." value="{{ query or '' }}" autofocus>
+</form>
+
+{% if not query %}
+<p class="text-dim text-small" style="margin-top:8px;">Enter a query to search across all embedded concepts.</p>
+{% elif results is defined %}
+<p class="text-dim text-small mb-16">{{ results|length }} results for: <strong class="text-green">{{ query }}</strong></p>
+
+{% for r in results %}
+<div class="result">
+    <span class="score">{{ '%.4f'|format(r.score) }}</span>
+    <div class="title">{{ r.title }}</div>
+    <div class="meta">
+        {{ r.citation }}
+        {% if r.download_url %}
+            {% if r.source_type == 'web' or (r.download_url.startswith('http') and 'files.echo6.co' not in r.download_url) %}
+            | <a href="{{ r.download_url }}" target="_blank" style="color:#00bfff;text-decoration:none;">Web</a>
+            {% else %}
+            | <a href="{{ r.download_url }}" style="color:#00bfff;text-decoration:none;">PDF</a>
+            {% endif %}
+        {% endif %}
+        {% if r.knowledge_type %}| {{ r.knowledge_type }}{% endif %}
+        {% if r.complexity %}/ {{ r.complexity }}{% endif %}
+    </div>
+    <div class="content-text">{{ r.summary }}</div>
+    <div style="margin-top:6px;">
+        {% for d in r.domains %}
+        <span class="domain-tag">{{ d }}</span>
+        {% endfor %}
+    </div>
+</div>
+{% endfor %}
+
+{% elif error %}
+<p style="color:#ff4444;">Search error: {{ error }}</p>
+{% endif %}
+{% endblock %}
--- a/templates/settings/cookies.html
+++ b/templates/settings/cookies.html
@ -0,0 +1,94 @@
+{% extends "base.html" %}
+{% block content %}
+<h3 class="section-title mb-16">YouTube Cookies</h3>
+<div class="panel">
+    <div id="cookie-status" style="margin-bottom:16px;font-size:12px;color:#666;">Loading cookie status...</div>
+    <div class="mb-16">
+        <label class="text-dim text-xs" style="text-transform:uppercase;display:block;margin-bottom:4px;">Cookies.txt File (Netscape format)</label>
+        <input type="file" id="cookie-file" accept=".txt"
+            style="background:#0a0a0a;border:1px solid #333;color:#c0c0c0;padding:8px;width:100%;font-family:inherit;">
+    </div>
+    <button class="btn" id="cookie-btn" onclick="uploadCookies()">Upload Cookies</button>
+    <span id="cookie-upload-status" style="margin-left:12px;font-size:12px;"></span>
+    <div id="cookie-result" style="display:none;background:#0a0a0a;border:1px solid #222;padding:12px;margin-top:16px;font-size:11px;white-space:pre-wrap;color:#888;max-height:200px;overflow-y:auto;"></div>
+</div>
+{% endblock %}
+{% block scripts %}
+<script>
+async function loadCookieStatus() {
+    try {
+        var resp = await fetch('/api/cookies/status');
+        var data = await resp.json();
+        if (resp.ok) {
+            var age = data.age_hours;
+            var ageStr, ageColor;
+            if (age < 24) {
+                ageStr = Math.round(age) + ' hours ago';
+                ageColor = '#00ff41';
+            } else {
+                var days = Math.round(age / 24);
+                ageStr = days + ' days ago';
+                ageColor = days > 14 ? '#ff4444' : days > 7 ? '#ffa500' : '#00ff41';
+            }
+            var html = '<span style="color:' + ageColor + ';">Last updated: ' + ageStr + '</span>';
+            if (data.is_stale) {
+                html += ' <span style="color:#ff4444;font-weight:bold;">[STALE - cookies likely expired]</span>';
+            }
+            if (data.recent_rate_limits > 0) {
+                html += '<br><span style="color:#ffa500;">YouTube rate limits in last 30min: ' + data.recent_rate_limits + '</span>';
+            }
+            html += '<br><span class="text-faint">Downloader: ' + (data.downloader_active ? 'active' : 'stopped') + '</span>';
+            document.getElementById('cookie-status').innerHTML = html;
+        } else {
+            document.getElementById('cookie-status').innerHTML = '<span class="text-red">Could not check cookie status</span>';
+        }
+    } catch(e) {
+        document.getElementById('cookie-status').innerHTML = '<span class="text-red">Error: ' + e.message + '</span>';
+    }
+}
+
+async function uploadCookies() {
+    var fileInput = document.getElementById('cookie-file');
+    var btn = document.getElementById('cookie-btn');
+    var status = document.getElementById('cookie-upload-status');
+    var result = document.getElementById('cookie-result');
+    if (!fileInput.files.length) {
+        status.style.color = '#ff4444';
+        status.textContent = 'No file selected';
+        return;
+    }
+    btn.disabled = true;
+    status.style.color = '#ffa500';
+    status.textContent = 'Uploading and testing cookies...';
+    result.style.display = 'none';
+    var formData = new FormData();
+    formData.append('file', fileInput.files[0]);
+    try {
+        var resp = await fetch('/api/cookies/upload', { method: 'POST', body: formData });
+        var data = await resp.json();
+        if (data.ok) {
+            status.style.color = '#00ff41';
+            status.textContent = 'Cookies updated and verified';
+            result.style.display = 'block';
+            result.style.borderColor = '#00ff41';
+            result.innerHTML = '<span style="color:#00ff41;">SUCCESS</span><br>' + (data.test_output || '') + '<br>Data lines: ' + data.data_lines;
+            loadCookieStatus();
+        } else {
+            status.style.color = data.error ? '#ff4444' : '#ffa500';
+            status.textContent = data.error || data.message || 'Upload issue';
+            if (data.test_output) {
+                result.style.display = 'block';
+                result.style.borderColor = '#ff4444';
+                result.textContent = data.test_output;
+            }
+        }
+    } catch(e) {
+        status.style.color = '#ff4444';
+        status.textContent = 'Network error: ' + e.message;
+    }
+    btn.disabled = false;
+}
+
+loadCookieStatus();
+</script>
+{% endblock %}
--- a/templates/settings/health.html
+++ b/templates/settings/health.html
@ -0,0 +1,68 @@
+{% extends "base.html" %}
+{% block content %}
+<h3 class="section-title mb-16">Service Health</h3>
+
+<div id="health-grid" class="stat-grid" style="grid-template-columns:repeat(auto-fit, minmax(250px, 1fr));">
+    <div class="stat-card">
+        <div class="label">Qdrant</div>
+        <div class="value text-small" id="h-qdrant"><span class="svc-dot unknown"></span>Checking...</div>
+    </div>
+    <div class="stat-card">
+        <div class="label">TEI Embeddings</div>
+        <div class="value text-small" id="h-tei"><span class="svc-dot unknown"></span>Checking...</div>
+    </div>
+    <div class="stat-card">
+        <div class="label">NFS Mount</div>
+        <div class="value text-small" id="h-nfs"><span class="svc-dot unknown"></span>Checking...</div>
+    </div>
+    <div class="stat-card">
+        <div class="label">Gemini API</div>
+        <div class="value text-small" id="h-gemini"><span class="svc-dot unknown"></span>Checking...</div>
+    </div>
+</div>
+
+<h3 class="section-title mt-24">Pipeline Status</h3>
+<div id="h-pipeline" class="panel text-small text-dim">Loading...</div>
+{% endblock %}
+{% block scripts %}
+<script>
+async function loadHealth() {
+    try {
+        var resp = await fetch('/api/health');
+        var data = await resp.json();
+        var c = data.components || {};
+
+        function dot(status) {
+            var cls = status === 'up' ? 'active' : (status === 'configured' ? 'active' : 'inactive');
+            return '<span class="svc-dot ' + cls + '"></span>';
+        }
+
+        var q = c.qdrant || {};
+        document.getElementById('h-qdrant').innerHTML = dot(q.status) + (q.status === 'up' ? 'Online — ' + RECON.fmt(q.vectors) + ' vectors' : 'Offline' + (q.error ? ' — ' + q.error : ''));
+
+        var t = c.tei || {};
+        document.getElementById('h-tei').innerHTML = dot(t.status) + (t.status === 'up' ? 'Online' : 'Offline' + (t.error ? ' — ' + t.error : ''));
+
+        var n = c.nfs || {};
+        document.getElementById('h-nfs').innerHTML = dot(n.status) + (n.status === 'up' ? 'Mounted' : 'Not mounted');
+
+        var g = c.gemini || {};
+        document.getElementById('h-gemini').innerHTML = dot(g.status === 'configured' ? 'up' : 'down') + (g.status === 'configured' ? g.keys + ' keys configured' : 'No keys');
+
+        // Pipeline
+        var p = data.pipeline || {};
+        var html = '';
+        Object.keys(p).forEach(function(k) {
+            html += '<div style="margin:4px 0;"><span class="status status-' + k + '">' + k + '</span>: ' + p[k] + '</div>';
+        });
+        document.getElementById('h-pipeline').innerHTML = html || '<span class="text-dim">No pipeline data</span>';
+    } catch(e) {
+        document.getElementById('h-qdrant').innerHTML = '<span class="svc-dot inactive"></span>Error: ' + e.message;
+    }
+}
+
+document.addEventListener('DOMContentLoaded', function() {
+    RECON.startRefresh(loadHealth, 30000);
+});
+</script>
+{% endblock %}
--- a/templates/settings/keys.html
+++ b/templates/settings/keys.html
@ -0,0 +1,137 @@
+{% extends "base.html" %}
+{% block content %}
+<h3 class="section-title mb-16">API Keys</h3>
+<div style="margin-bottom:20px;">
+    <button class="btn" onclick="validateAll()" id="btn-validate">Validate All</button>
+    <button class="btn" onclick="reloadKeys()" style="margin-left:8px;">Reload from .env</button>
+    <button class="btn btn-warn" onclick="restartService()" style="margin-left:8px;">Restart Service</button>
+    <span id="validate-status" style="margin-left:12px;color:#666;font-size:12px;"></span>
+</div>
+<table id="keys-table">
+    <tr><th>#</th><th>Key</th><th>Status</th><th>Calls</th><th>Errors</th><th>Last Used</th><th>Actions</th></tr>
+    {% for k in keys_data %}
+    <tr id="key-row-{{ k.index }}">
+        <td>{{ k.index + 1 }}</td>
+        <td class="mono text-small">{{ k.masked }}</td>
+        <td>
+            {% if k.valid is true %}
+            <span class="text-green">Valid</span>
+            {% elif k.valid is false %}
+            <span class="text-red">Invalid</span>
+            {% else %}
+            <span class="text-dim">&mdash;</span>
+            {% endif %}
+        </td>
+        <td>{{ k.calls }}</td>
+        <td class="{% if k.errors %}text-red{% else %}text-muted{% endif %}">{{ k.errors }}</td>
+        <td class="text-dim text-xs">{{ k.last_used or '&mdash;' }}</td>
+        <td>
+            <button class="btn text-xs" onclick="validateKey({{ k.index }})">Test</button>
+            <button class="btn btn-danger text-xs" onclick="removeKey({{ k.index }})">Remove</button>
+        </td>
+    </tr>
+    {% endfor %}
+</table>
+
+<div style="margin-top:24px;border-top:1px solid #222;padding-top:16px;">
+    <h4 class="text-muted" style="margin-bottom:12px;">Add Key</h4>
+    <div class="flex gap-8" style="align-items:center;">
+        <input type="text" id="new-key" placeholder="Paste Gemini API key..."
+            style="flex:1;background:#1a1a1a;border:1px solid #333;color:#ccc;padding:8px 12px;border-radius:4px;font-family:monospace;font-size:13px;">
+        <button class="btn" onclick="addKey()">Add</button>
+    </div>
+    <div id="add-result" style="margin-top:8px;font-size:12px;"></div>
+</div>
+
+<div style="margin-top:24px;border-top:1px solid #222;padding-top:16px;">
+    <h4 class="text-muted" style="margin-bottom:12px;">Replace Key</h4>
+    <div class="flex gap-8" style="align-items:center;">
+        <input type="number" id="replace-index" placeholder="#" min="0" max="9"
+            style="width:50px;background:#1a1a1a;border:1px solid #333;color:#ccc;padding:8px;border-radius:4px;text-align:center;">
+        <input type="text" id="replace-key" placeholder="New Gemini API key..."
+            style="flex:1;background:#1a1a1a;border:1px solid #333;color:#ccc;padding:8px 12px;border-radius:4px;font-family:monospace;font-size:13px;">
+        <button class="btn" onclick="replaceKey()">Replace</button>
+    </div>
+    <div id="replace-result" style="margin-top:8px;font-size:12px;"></div>
+</div>
+{% endblock %}
+{% block scripts %}
+<script>
+async function validateAll() {
+    document.getElementById('btn-validate').disabled = true;
+    document.getElementById('validate-status').textContent = 'Validating...';
+    try {
+        var r = await fetch('/api/keys/validate', {method:'POST'});
+        var data = await r.json();
+        document.getElementById('validate-status').textContent = 'Done — ' + data.results.filter(function(r){return r.valid;}).length + '/' + data.results.length + ' valid';
+        setTimeout(function() { location.reload(); }, 1000);
+    } catch(e) {
+        document.getElementById('validate-status').textContent = 'Error: ' + e;
+    }
+    document.getElementById('btn-validate').disabled = false;
+}
+
+async function validateKey(idx) {
+    try {
+        var r = await fetch('/api/keys/' + idx + '/validate', {method:'POST'});
+        var data = await r.json();
+        alert('Key ' + (idx+1) + ': ' + data.message);
+        location.reload();
+    } catch(e) { alert('Error: ' + e); }
+}
+
+async function removeKey(idx) {
+    if (!confirm('Remove key ' + (idx+1) + '? Pipeline needs at least 1 key.')) return;
+    try {
+        var r = await fetch('/api/keys/' + idx, {method:'DELETE'});
+        var data = await r.json();
+        if (data.error) { alert(data.error); return; }
+        location.reload();
+    } catch(e) { alert('Error: ' + e); }
+}
+
+async function addKey() {
+    var key = document.getElementById('new-key').value.trim();
+    if (!key) return;
+    try {
+        var r = await fetch('/api/keys', {method:'POST', headers:{'Content-Type':'application/json'}, body:JSON.stringify({key:key})});
+        var data = await r.json();
+        if (data.error) { document.getElementById('add-result').innerHTML = '<span class="text-red">' + data.error + '</span>'; return; }
+        document.getElementById('add-result').innerHTML = '<span class="text-green">Added at position ' + (data.index+1) + '</span>';
+        setTimeout(function() { location.reload(); }, 1000);
+    } catch(e) { document.getElementById('add-result').innerHTML = '<span class="text-red">' + e + '</span>'; }
+}
+
+async function replaceKey() {
+    var idx = parseInt(document.getElementById('replace-index').value) - 1;
+    var key = document.getElementById('replace-key').value.trim();
+    if (isNaN(idx) || !key) return;
+    try {
+        var r = await fetch('/api/keys/' + idx, {method:'PUT', headers:{'Content-Type':'application/json'}, body:JSON.stringify({key:key})});
+        var data = await r.json();
+        if (data.error) { document.getElementById('replace-result').innerHTML = '<span class="text-red">' + data.error + '</span>'; return; }
+        document.getElementById('replace-result').innerHTML = '<span class="text-green">Replaced key ' + (idx+1) + '</span>';
+        setTimeout(function() { location.reload(); }, 1000);
+    } catch(e) { document.getElementById('replace-result').innerHTML = '<span class="text-red">' + e + '</span>'; }
+}
+
+async function restartService() {
+    if (!confirm('Restart RECON service? Pipeline will pause for ~10 seconds.')) return;
+    document.getElementById('validate-status').textContent = 'Restarting...';
+    try {
+        await fetch('/api/service/restart', {method:'POST'});
+    } catch(e) {}
+    document.getElementById('validate-status').innerHTML = '<span style="color:#ff8800;">Restarting... page will reload in 10s</span>';
+    setTimeout(function() { location.reload(); }, 30000);
+}
+
+async function reloadKeys() {
+    try {
+        var r = await fetch('/api/keys/reload', {method:'POST'});
+        var data = await r.json();
+        alert('Reloaded ' + data.count + ' key(s) from .env');
+        location.reload();
+    } catch(e) { alert('Error: ' + e); }
+}
+</script>
+{% endblock %}
--- a/templates/settings/vpn.html
+++ b/templates/settings/vpn.html
@ -0,0 +1,97 @@
+{% extends "base.html" %}
+{% block content %}
+<h3 class="section-title mb-16">NordVPN</h3>
+<div class="panel">
+    <div id="vpn-status" style="margin-bottom:16px;font-size:12px;color:#666;">Loading VPN status...</div>
+    <div class="flex gap-8" style="flex-wrap:wrap;margin-bottom:12px;">
+        <button class="btn" onclick="vpnRotate()" id="vpn-rotate-btn">Rotate</button>
+        <button class="btn" onclick="vpnDisconnect()" id="vpn-disconnect-btn">Disconnect</button>
+        <select id="vpn-country" style="background:#0a0a0a;border:1px solid #333;color:#c0c0c0;padding:6px;font-family:inherit;font-size:12px;">
+            <option value="United_States">United States</option>
+            <option value="Canada">Canada</option>
+            <option value="United_Kingdom">United Kingdom</option>
+            <option value="Germany">Germany</option>
+            <option value="Netherlands">Netherlands</option>
+            <option value="Sweden">Sweden</option>
+        </select>
+        <button class="btn" onclick="vpnConnect()" id="vpn-connect-btn">Connect</button>
+    </div>
+    <span id="vpn-action-status" style="font-size:12px;"></span>
+    <details style="margin-top:16px;">
+        <summary class="text-faint" style="cursor:pointer;font-size:11px;">Setup (one-time)</summary>
+        <div style="margin-top:8px;">
+            <input type="password" id="vpn-token" placeholder="NordVPN token"
+                style="background:#0a0a0a;border:1px solid #333;color:#c0c0c0;padding:6px;width:300px;font-family:inherit;font-size:12px;">
+            <button class="btn" onclick="vpnLogin()">Login</button>
+            <span id="vpn-login-status" style="font-size:11px;margin-left:8px;"></span>
+        </div>
+    </details>
+</div>
+{% endblock %}
+{% block scripts %}
+<script>
+async function loadVpnStatus() {
+    try {
+        var resp = await fetch('/api/vpn/status');
+        var data = await resp.json();
+        if (resp.ok) {
+            var dot = data.connected ? '<span style="color:#00ff41;">&#9679;</span>' : '<span style="color:#ff4444;">&#9679;</span>';
+            var html = dot + ' ' + (data.connected ? 'Connected' : 'Disconnected');
+            if (data.connected) {
+                html += ' &mdash; <span style="color:#00ff41;">' + data.country + '</span>';
+                html += ' <span class="text-faint">(' + data.ip + ')</span>';
+            }
+            if (data.rotations_today > 0) {
+                html += '<br><span class="text-faint">Rotations today: ' + data.rotations_today + '</span>';
+            }
+            document.getElementById('vpn-status').innerHTML = html;
+        }
+    } catch(e) {
+        document.getElementById('vpn-status').innerHTML = '<span class="text-red">Error: ' + e.message + '</span>';
+    }
+}
+
+async function vpnAction(url, opts, statusEl) {
+    var el = document.getElementById(statusEl || 'vpn-action-status');
+    el.style.color = '#ffa500';
+    el.textContent = 'Working...';
+    try {
+        var resp = await fetch(url, opts);
+        var data = await resp.json();
+        if (data.ok) {
+            el.style.color = '#00ff41';
+            el.textContent = data.country ? (data.country + ' (' + data.ip + ')') : (data.message || 'Done');
+        } else {
+            el.style.color = '#ff4444';
+            el.textContent = data.error || data.message || 'Failed';
+        }
+        loadVpnStatus();
+    } catch(e) {
+        el.style.color = '#ff4444';
+        el.textContent = 'Error: ' + e.message;
+    }
+}
+
+function vpnRotate() { vpnAction('/api/vpn/rotate', {method:'POST'}); }
+function vpnDisconnect() { vpnAction('/api/vpn/disconnect', {method:'POST'}); }
+function vpnConnect() {
+    var country = document.getElementById('vpn-country').value;
+    vpnAction('/api/vpn/connect', {
+        method: 'POST',
+        headers: {'Content-Type': 'application/json'},
+        body: JSON.stringify({country: country})
+    });
+}
+function vpnLogin() {
+    var token = document.getElementById('vpn-token').value;
+    if (!token) return;
+    vpnAction('/api/vpn/login', {
+        method: 'POST',
+        headers: {'Content-Type': 'application/json'},
+        body: JSON.stringify({token: token})
+    }, 'vpn-login-status');
+}
+
+loadVpnStatus();
+</script>
+{% endblock %}