Add PROJECT-BIBLE.md: canonical architectural reference for RECON

Consolidated orientation document for future sessions. Covers pipeline lifecycle (acquire → dispatch → process → enrich/embed → file), acquisition modules, dispatcher, per-type processors, filing, StatusDB schema, config, service threads, dashboard/API, filesystem layout, refactor history, runbook, known gotchas, and follow-ups. Sourced from live code on CT 130 (/opt/recon/) including recon.py, dispatcher.py, filing.py, status.py, the three processors, acquisition/peertube.py, config.yaml, and api.py. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-05-20 06:34:34 +02:00 · 2026-04-16 04:41:03 +00:00 · 2026-04-16 04:41:03 +00:00 · c9a8f1ecb5
commit c9a8f1ecb5
parent 5b0d4eed90
1 changed files with 629 additions and 0 deletions
--- a/PROJECT-BIBLE.md
+++ b/PROJECT-BIBLE.md
@ -0,0 +1,629 @@
+# RECON — Project Bible
+
+Canonical architectural reference for **RECON** (the knowledge extraction
+pipeline running on CT 130 / `data.echo6` / `100.64.0.24`). This document
+is the orientation dossier for any future session. It is skim-and-find,
+not a tutorial.
+
+- **Repo:** `ssh://git@forge.echo6.co:2222/matt/refactored-recon.git` (design)
+- **Code:** `/opt/recon/` on CT 130 (zvx owns the tree; service runs as zvx)
+- **Service:** `systemctl status recon`
+- **Dashboard:** `https://recon.echo6.co` (zvx-only via Authentik)
+- **Files server:** `https://files.echo6.co` (Authentik forward auth)
+
+---
+
+## 1. Mission
+
+RECON ingests documents from multiple sources (manual PDF uploads,
+PeerTube auto-captioned transcripts, future Kiwix/HTML/RSS feeds) and
+produces a **searchable, domain-organized library** plus a hybrid
+dense/sparse vector index in Qdrant on cortex.
+
+Every piece of content ends up in two places:
+
+1. A file under `/mnt/library/<Domain>/<Subdomain>/<canonical_name>.<ext>`
+   (PDFs, HTML) **or** at a source URL like `https://stream.echo6.co/w/<uuid>`
+   (PeerTube transcripts — no local copy after Phase 5a).
+2. Page-level embeddings in Qdrant collection `recon_knowledge_hybrid`
+   (dense `bge-m3` + sparse SPLADE-style vectors, 1024-dim dense).
+
+Search returns page-grounded citations back to the file or stream URL.
+
+---
+
+## 2. System Topology
+
+```
+                        ┌─────────────────────────┐
+                        │  CT 130 (recon)         │
+  Library (NFS)         │  /opt/recon/            │     ┌──────────────┐
+  pi-nas:/export/library│   ├─ data/              │     │ Qdrant       │
+  → /mnt/library/       │   │   ├─ acquired/      │     │ cortex:6333  │
+  (read/write)          │   │   ├─ processing/    │ ←→  │ recon_knowledge_hybrid
+                        │   │   ├─ concepts/      │     │ (1024-d dense + sparse)
+                        │   │   └─ recon.db       │     └──────────────┘
+                        │   ├─ lib/               │
+                        │   ├─ recon.py           │     ┌──────────────┐
+                        │   └─ config.yaml        │ ←→  │ TEI          │
+                        │  recon.service          │     │ cortex:8090  │
+                        │  nginx :8888 (files)    │     │ bge-m3 dense │
+                        └─────────────────────────┘     └──────────────┘
+                                   ▲                    ┌──────────────┐
+                                   │                    │ Sparse svc   │
+           ┌───────────────────────┴─────┐         ←→   │ cortex:8091  │
+           │                             │              │ bge-m3 sparse│
+ PeerTube (CT 110 / stream.echo6.co)    Gemini API      └──────────────┘
+ api_base: http://192.168.1.170          (enrichment,
+                                          vision OCR)
+```
+
+Shared caddy reverse proxy (CT 101) surfaces the dashboard (8420) and
+nginx file server (8888) as `recon.echo6.co` and `files.echo6.co`.
+
+---
+
+## 3. Pipeline Lifecycle
+
+Every document follows the same five-stage arc regardless of source type.
+The filesystem location at any given moment tells you which stage the
+item is in — **state is a directory.**
+
+```
+  ┌──────────────┐   ┌──────────────┐   ┌──────────────┐   ┌──────────────┐   ┌──────────────┐
+  │ 1. ACQUIRE   │ → │ 2. DISPATCH  │ → │ 3. PROCESS   │ → │ 4. ENRICH /  │ → │ 5. FILE      │
+  │              │   │ (pre_flight) │   │              │   │    EMBED     │   │              │
+  │ data/acquired│   │ dispatcher.py│   │ per-type     │   │ shared       │   │ shared       │
+  │ /<type>/     │   │ watches      │   │ processor    │   │ stage loops  │   │ filing worker│
+  │ <hash>.{ext} │   │ subfolders,  │   │ moves file   │   │ bge-m3 →     │   │ moves file   │
+  │ <hash>.meta  │   │ hands to     │   │ to processing│   │ Qdrant       │   │ processing → │
+  │              │   │ processor    │   │ /{hash}/     │   │              │   │ library,     │
+  │              │   │              │   │              │   │              │   │ updates DB + │
+  │              │   │              │   │              │   │              │   │ Qdrant       │
+  └──────────────┘   └──────────────┘   └──────────────┘   └──────────────┘   └──────────────┘
+```
+
+**Status column on documents:**
+`catalogued → queued → extracting → extracted → enriching → enriched →
+embedding → complete` (plus terminal `error`, `content_failure`,
+`duplicate` states).
+
+`organized_at IS NULL` while in flight, set to `CURRENT_TIMESTAMP` after
+filing. Transcripts are marked organized in-place during pre_flight
+(they have no filesystem target — the watch URL is their "home").
+
+---
+
+## 4. Acquisition Layer (`lib/acquisition/`)
+
+Acquisition modules fetch content from external sources and drop
+`{hash}.<ext>` + `{hash}.meta.json` **flat file pairs** into
+`data/acquired/<type>/`. They do **not** touch the database — that's the
+processor's job.
+
+### Atomic drop protocol
+1. Write content to `<hash>.<ext>.tmp` (unknown extension, safe from dispatcher).
+2. Compute hash; rename tmp to final `<hash>.<ext>`.
+3. Write `<hash>.meta.json.tmp`, then rename to `<hash>.meta.json`.
+4. **Meta goes final first, content goes final last.** Dispatcher only
+   picks up when content file exists and is stable, so a half-visible
+   pair without meta never gets dispatched.
+
+### PeerTube acquisition (`lib/acquisition/peertube.py`)
+- Daemon loop `acquisition_loop(stop_event, db, config, interval=1800)`.
+- Queries catalogue for `source='stream.echo6.co'` rows, builds sets of
+  known UUIDs (`/w/<uuid>` extracted from `path`) **and** known titles
+  (from `filename`) — both cohorts are checked so Phase 5b-rewritten
+  rows and pre-5b library-path rows dedupe correctly.
+- Lists PeerTube videos via `peertube_scraper.get_videos`, filters to
+  those with captions, prefers English caption.
+- For each new one: fetches VTT, converts to text with `vtt_to_text`,
+  atomically drops pair into `data/acquired/stream/`.
+- Rate limits at `peertube.rate_limit_delay` (default 0.5s) —
+  **PeerTube returns 429 if captions are fetched too fast.**
+
+### Manual uploads / URL ingest
+`api.py` exposes `/api/upload`, `/api/ingest-url`, `/api/ingest-urls`,
+`/api/ingest-peertube` — all end by dropping a pair into `acquired/<type>/`.
+
+---
+
+## 5. Dispatcher (`lib/dispatcher.py`)
+
+The dispatcher is one daemon thread (`dispatch_loop`, interval=30s). It
+watches each configured subfolder under `data/acquired/` and hands
+stable file pairs to the registered processor.
+
+### Config-driven dispatch table
+```yaml
+pipeline:
+  acquired_root: /opt/recon/data/acquired
+  processing_root: /opt/recon/data/processing
+  dispatch:
+    pdf: pdf_processor
+    stream: transcript_processor
+    html: html_processor        # not yet implemented
+    text: text_processor
+  mtime_stability_seconds: 10
+```
+
+### Extension constants
+- `CONTENT_EXTENSIONS = {'.txt', '.vtt', '.html', '.pdf'}` — the
+  dispatcher considers a file "content" only if its extension is in
+  this set. **`.tmp` is not in the set**, so partial writes are safe.
+- `CONVERTIBLE_EXTENSIONS = {'.epub', '.mobi', '.doc', '.docx'}` —
+  these are normalized to PDF **before** dispatch.
+
+### Normalization step
+`_normalize_formats(subfolder_path)`:
+- `.epub` / `.mobi` → PDF via `ebook-convert` (Calibre CLI).
+- `.doc` / `.docx` → PDF via `libreoffice --headless`.
+- Sidecar `.meta.json` is renamed to match the new PDF hash so pairing
+  holds.
+
+### Pair finding
+`_find_pairs(subfolder_path)` returns tuples of (content_path,
+meta_path_or_None). Pairs where only content exists are still valid —
+meta is not required. A meta without its content is ignored.
+
+### Stability check
+`_is_stable(filepath, stability_seconds)` — mtime must be at least
+`mtime_stability_seconds` old (default 10s) before dispatch. Prevents
+racing active writers.
+
+---
+
+## 6. Processors (`lib/processors/`)
+
+Each processor implements **one function**: `pre_flight(content_path,
+meta_path, db, config) → dict`. It owns all the type-specific logic and
+**all the database writes** for that item up to status=`extracted`.
+
+### Common pre_flight contract
+Every processor does, in order:
+1. Hash content (SHA-256 via `content_hash()` in `lib/utils.py`).
+2. Stale state cleanup: `rm -rf processing/{hash}/` and
+   `concepts/{hash}/` if they exist (guards against re-runs).
+3. Hash dedupe: if `hash` already exists in `catalogue`, delete the
+   pair, return action `duplicate`.
+4. Type-specific metadata extraction + level-4 dedupe check (PDF only).
+5. Move content + meta into `processing/{hash}/` with a type-specific
+   layout.
+6. `db.add_to_catalogue`, `db.queue_document`, set `documents.text_dir`
+   and `page_count`, `db.update_status(hash, 'extracted', ...)`.
+
+Return dict keys: `hash`, `action`, `source_path`, `error`. Actions are
+one of: `extracted`, `duplicate`, `level4_duplicate`, `content_failure`,
+`error`, and `duplicate` (for transcripts) or `skip_empty` (for text).
+
+### `pdf_processor.py`
+The heaviest processor. Layered metadata extraction:
+- **Source A — PDF dict:** `PdfReader(...).metadata`, mapped to
+  `{title, author, edition, year}`.
+- **Source B — Filename:** regex parse the original filename.
+- **Source C — Gemini Vision OCR** on first 3 pages when A+B
+  disagree or are missing. Returns structured JSON via Gemini's
+  `response_mime_type: application/json`.
+- **Voting:** `_vote_metadata(A, B, C)` reconciles the three sources;
+  2-of-3 wins; ties prefer Source A.
+- **Level-4 dedupe:** if all four fields (`title, edition, author,
+  year`) are present and match an existing catalogue row with a
+  different hash, the PDF is quarantined to `_duplicates/` for human
+  review.
+- **Size cap:** `processing.max_pdf_size_mb` (default 2000MB). Oversize
+  PDFs move to `_rejected/`.
+- **Text extraction order:** PyPDF2 → `pdftotext` (poppler) → Tesseract
+  OCR → Gemini Vision on a per-page basis. Output:
+  `processing/{hash}/page_NNNN.txt`.
+
+### `transcript_processor.py`
+Lightweight. The VTT→text conversion already happened in acquisition,
+so pre_flight just:
+- Hashes `<hash>.txt` file.
+- Reads meta.json sidecar.
+- `chunk_text(raw_text, WORDS_PER_PAGE=2000)` splits into
+  `page_NNNN.txt` files.
+- Writes the transcript as `processing/{hash}/transcript.txt` plus page
+  chunks.
+- Registers with category `Transcript`, source `stream.echo6.co`.
+- Sets `text_dir`, `page_count`, and **`organized_at = CURRENT_TIMESTAMP`
+  immediately** — transcripts are filed-in-place (their "location" is
+  the PeerTube watch URL, set later as the catalogue `path` via Phase
+  5a).
+
+### `text_processor.py`
+Raw `.txt` files dropped via manual upload. Two-source metadata vote
+(filename + meta.json). Similar flow to transcript processor but no
+fixed category or source.
+
+---
+
+## 7. Enrichment & Embedding
+
+Both are **source-agnostic stage loops** that just poll documents by
+status and do their work. They live in `lib/enricher.py` and
+`lib/embedder.py`, wrapped by `stage_loop(stage, ...)` in `recon.py`.
+
+### Enrichment (`enrich_workers: 16` threads per batch)
+- Polls `status = 'extracted' AND retries < max_retries`.
+- Sets `enriching`, reads `processing/{hash}/page_NNNN.txt`.
+- Windows pages (`enrich_window_size: 5` per window) and sends each
+  window to Gemini with a structured prompt.
+- Stores `concepts/{hash}/window_N.json` per window.
+- Backoff: `enrich_base_delay=5s`, doubling up to
+  `enrich_max_delay=120s`, max `enrich_max_retries=5`.
+- On success: `update_status(hash, 'enriched')`.
+
+### Embedding (`embed_workers: 4`)
+- Polls `status = 'enriched'`.
+- Reads concept JSONs, builds page-level chunks.
+- Dense: POST to TEI at `cortex:8090` (`bge-m3`, 1024-d). Batches of
+  128 per TEI request. Throughput ~1,711 emb/sec.
+- Sparse: POST to the sparse service at `cortex:8091` (bge-m3 sparse
+  mode; `sparse_embedding.enabled: true`).
+- Upserts into Qdrant `cortex:6333`, collection `recon_knowledge_hybrid`,
+  batch size `embed_batch_size=500` vectors per upsert.
+- Payload carries: `hash`, `filename`, `original_filename`,
+  `download_url`, `page`, `text`, `title`, `domain`, `subdomain`,
+  `category`.
+- Ollama is a fallback backend (much slower, ~8 emb/sec) via
+  `embedding.backend: ollama`.
+- On success: `update_status(hash, 'complete')`.
+
+---
+
+## 8. Filing (`lib/filing.py`)
+
+One daemon thread, `filing_worker_loop(interval=30)`. It polls:
+
+```sql
+SELECT hash FROM documents
+ WHERE status = 'complete'
+   AND organized_at IS NULL
+   AND path LIKE '/opt/recon/data/processing/%'
+ LIMIT 50
+```
+
+The `path LIKE '/opt/recon/data/processing/%'` filter naturally
+**excludes transcripts** — their `documents.path` was never a
+filesystem path but the PeerTube watch URL.
+
+For each row, `file_processed_item(doc_hash, source_file_path, db,
+config)` does:
+1. `determine_dominant_domain(hash)` reads concept JSONs, returns the
+   top-voted `Domain/Subdomain`.
+2. `_build_target_path(...)` derives the canonical name starting at
+   level 1 (`Title`), escalating to level 2/3/4 only if a collision
+   exists in the target folder. **Preserves source file's actual
+   extension** (not hardcoded to `.pdf`).
+3. `shutil.move(source, target)` atomically. Target is
+   `/mnt/library/<Domain>/<Subdomain>/<canonical>.<ext>`.
+4. Updates:
+   - `catalogue.path` → new target
+   - `catalogue.filename` → new canonical name
+   - `documents.path` → new target
+   - Qdrant payload via `update_qdrant_payload(...)`:
+     `download_url = generate_download_url(new_path, ...)`,
+     `filename`, `original_filename` set on every point for that hash.
+5. `db.mark_organized(hash)` sets `organized_at` + cleans up
+   `processing/{hash}/`.
+
+### Download URL helper (`lib/utils.py:generate_download_url`)
+- If the path is already `http://` or `https://` (transcripts), return
+  it unchanged.
+- Otherwise strip `library_root` prefix and prepend
+  `book_server.base_url` (→ `https://files.echo6.co/<rel>`).
+
+---
+
+## 9. StatusDB (`lib/status.py`)
+
+SQLite (`data/recon.db`) in WAL mode with thread-local connections
+(`_get_conn()` uses `threading.local`).
+
+### Tables
+| Table | Purpose |
+|---|---|
+| `catalogue` | Canonical record keyed by `hash` — title, filename, path, source, category, size |
+| `documents` | Pipeline state machine — status, path, text_dir, page_count, retries, organized_at, timestamps |
+| `intel` | ARGUS intel feed entries (separate pipeline) |
+| `metrics_snapshots` | Time-series rollups for the dashboard |
+| `file_operations` | Audit log of Phase-5-style file moves and renames |
+| `duplicate_review` | Level-4 dedupe quarantine queue |
+
+### Key methods
+- `add_to_catalogue(hash, title, url, size, source, category)`
+- `queue_document(hash)` — insert into `documents` with status=`queued`
+- `update_status(hash, status, **kwargs)` — single point of status truth
+- `mark_organized(hash)` — sets `organized_at`, final transition
+- `sync_document_path(hash, new_path)` + `update_catalogue_path(...)` —
+  used by filing worker and Phase 5a un-file
+- `get_path_updates` / `clear_path_update` — small change queue for
+  backfills
+
+### Connection safety
+All writers take a short-lived connection via `_get_conn()`. WAL mode
+allows concurrent readers; writes are serialized at the SQLite level.
+No explicit `BEGIN` — rely on autocommit semantics with occasional
+`conn.commit()` after grouped updates.
+
+---
+
+## 10. Configuration (`config.yaml`)
+
+Lives at `/opt/recon/config.yaml`. Secrets (`GEMINI_KEYS`,
+`PEERTUBE_TOKEN`, etc.) live in `/opt/recon/.env` — never in
+`config.yaml`, never in git.
+
+### Top-level keys
+| Key | Meaning |
+|---|---|
+| `library_root` | `/mnt/library` — NFS mount root |
+| `processing` | Worker counts, window sizes, timeouts, retry policy |
+| `embedding` | TEI host/port, model (`bge-m3`), 1024-d dense |
+| `sparse_embedding` | Separate service on cortex:8091 |
+| `vector_db` | Qdrant host, port, collection name |
+| `gemini` | Model (`gemini-2.0-flash`), JSON response mode |
+| `web` | Dashboard bind host + port (8420) |
+| `paths` | `base`, `data`, `text`, `concepts`, `intel`, `logs`, `db` |
+| `book_server` | `base_url`, `strip_prefix` for download URL generation |
+| `upload_paths` | Category → filesystem path for upload routing |
+| `service` | `scan_interval`, `stage_poll_interval`, `progress_interval` |
+| `peertube` | `api_base`, `public_url`, `rate_limit_delay`, `poll_interval` |
+| `pipeline` | `acquired_root`, `processing_root`, `dispatch` table, `mtime_stability_seconds` |
+| `crawler` / `web_scraper` | Currently disabled (`sites: []`) pending re-architecture |
+| `new_pipeline` | Stream-B (old) pipeline, `enabled: false` |
+
+---
+
+## 11. Service & Threads (`recon.py cmd_service`)
+
+`systemctl start recon` → `python3 recon.py service`. The service runs
+seven daemon threads plus a metrics collector:
+
+| Thread | Function | Interval |
+|---|---|---|
+| `dispatcher` | `dispatcher.dispatch_loop` | 30s |
+| `enrich` | `stage_loop('enrich', ...)` | 30s idle |
+| `embed` | `stage_loop('embed', ...)` | 30s idle |
+| `filing` | `filing.filing_worker_loop` | 30s |
+| `peertube-acq` | `acquisition.peertube.acquisition_loop` | 1800s |
+| `progress` | Log status rollup line | 60s |
+| `dashboard` | `api.run_server` (Flask) | bound |
+
+Plus `peertube_collector.start_collector` for metrics scrape.
+
+All threads receive a shared `stop_event` (`threading.Event`) and exit
+cleanly on SIGTERM via `signal.signal(SIGTERM, lambda *_: stop_event.set())`.
+
+### CLI commands (`recon.py` top-level)
+`scan`, `queue`, `extract`, `enrich`, `embed`, `run`, `status`,
+`catalogue`, `failures`, `search`, `upload`, `ingest-url`, `ingest`,
+`ingest-peertube`, `validate`, `rebuild`, `serve`, `service`,
+`organize`, `pipeline`.
+
+Most commands are thin wrappers around library functions — useful for
+one-off maintenance from the CT 130 shell.
+
+---
+
+## 12. Dashboard & API (`lib/api.py`)
+
+Flask app bound to `0.0.0.0:8420`. Pages are server-rendered Jinja
+templates; data is pulled via AJAX from `/api/*` endpoints.
+
+### Page routes
+`/`, `/search`, `/catalogue`, `/upload`, `/web-ingest`, `/failures`,
+`/peertube`, `/peertube/channels`, `/settings/{keys,cookies,vpn,health}`.
+
+### API surface (grouped)
+| Group | Endpoints |
+|---|---|
+| Upload | `POST /api/upload`, `GET /api/upload/<hash>/status`, `GET /api/upload/categories` |
+| Ingest | `POST /api/ingest-url`, `/api/ingest-urls`, `/api/ingest`, `/api/ingest-peertube`, `/api/crawl`, `GET /api/crawl/<id>/status`, `GET /api/ingest-peertube/<job>/status` |
+| Search | `POST /api/search` |
+| Status | `GET /api/status`, `/api/quick-stats`, `/api/knowledge-stats`, `/api/health` |
+| Retry | `POST /api/retry/<hash>`, `/api/retry-all` |
+| Service | `POST /api/service/restart` |
+| Keys | Full CRUD on `/api/keys`, `/api/keys/validate`, `/api/keys/reload` |
+| Cookies | `GET /api/cookies/status`, `POST /api/cookies/upload` |
+| VPN | `GET /api/vpn/status`, `POST /api/vpn/{connect,disconnect,rotate,login}` |
+| PeerTube | `/api/peertube/{dashboard,channels,channels/stats,channels/add,channels/<actor>}`, `/api/peertube/stats` |
+| Metrics | `GET /api/metrics/history` |
+
+### Qdrant scroll
+`_qdrant_scroll(host, port, collection, req)` is the shared paged-read
+helper for rebuilding the knowledge-stats panel.
+
+### Cache warmer
+`start_cache_warmer(stop_event)` pre-computes the expensive quick-stats
+and knowledge-stats panels so the dashboard loads instantly.
+
+---
+
+## 13. Filesystem Layout
+
+```
+/opt/recon/
+├── recon.py                    # CLI + service entry point
+├── config.yaml
+├── .env                        # secrets (GEMINI_KEYS etc.)
+├── PROJECT-BIBLE.md            # this file (copy on CT 130)
+├── backups/                    # local DB backups
+├── data/
+│   ├── acquired/               # hopper — {hash}.ext + {hash}.meta.json
+│   │   ├── pdf/
+│   │   ├── stream/             # PeerTube transcripts
+│   │   ├── html/               # (future)
+│   │   └── text/
+│   ├── processing/{hash}/      # in-flight scratch
+│   │   ├── page_NNNN.txt
+│   │   ├── meta.json
+│   │   └── (original file or transcript.txt)
+│   ├── concepts/{hash}/
+│   │   └── window_N.json       # Gemini enrichment output
+│   ├── intel/                  # ARGUS intel feeds
+│   ├── _duplicates/            # level-4 name-match quarantine
+│   ├── _rejected/              # oversize / unreadable PDFs
+│   └── recon.db                # SQLite WAL mode
+├── lib/
+│   ├── acquisition/peertube.py
+│   ├── processors/{pdf,transcript,text}_processor.py
+│   ├── dispatcher.py
+│   ├── filing.py
+│   ├── enricher.py
+│   ├── embedder.py
+│   ├── status.py               # StatusDB class
+│   ├── api.py                  # Flask dashboard + API
+│   ├── new_pipeline.py         # update_qdrant_payload helper lives here
+│   ├── utils.py                # content_hash, generate_download_url, get_config, setup_logging
+│   ├── peertube_scraper.py     # PeerTube API client
+│   └── organizer.py            # determine_dominant_domain, level 1-4 naming
+└── logs/
+
+/mnt/library/                   # NFS from pi-nas, read-write
+├── <Domain>/<Subdomain>/<canonical_name>.<ext>
+└── _acquired/ _review/ _staging/ signal-archive/  # not touched by pipeline
+```
+
+---
+
+## 14. Refactor History (2026-04)
+
+The refactor is tracked as dated phases under `phases/`. Status
+implementations are in the RECON repo; design lives here.
+
+| Phase | Focus |
+|---|---|
+| 0 | Baseline capture — DB dumps, directory listings, config pin |
+| 1 | Scaffolding — create `acquired/`, `processing/`, config keys |
+| 2 | Shared filing function — extract organizer logic into `filing.py` |
+| 3 | Transcript processor — first end-to-end test of the new pattern |
+| 4 | PDF processor — layered A/B/C metadata vote, level-4 dedupe |
+| 5a | Transcript resweep — 16,340 transcripts migrated from `library/*.txt` path to `stream.echo6.co/w/<uuid>` watch URLs; catalogue/documents/Qdrant all updated atomically, physical `.txt` files deleted |
+| 5b | Transcript unprocess — clean up stale rows and processing dirs |
+| 5c-1 | Service loop rewire — retire old scan_library thread, wire dispatcher in |
+| 5c-2 | Service start & transcript drain — clear the hopper backlog |
+| 6a | Transcript organized-in-place — set `organized_at` during pre_flight so filing worker ignores transcripts |
+| 6b | Dashboard "Untitled / WEB" bug fix — recently-completed table query |
+| 6c | Code cleanup — dead-code audit |
+| 6d | PeerTube acquisition module — replace ad-hoc ingester with `acquisition/peertube.py` |
+| 6e | ShadowLib skill + dashboard PeerTube endpoint cleanup (partial — 6e-2 reverted) |
+
+### Baseline pre-refactor (per `current-state.md`)
+- 18,855 transcripts in `/mnt/library/_sources/streamecho6/`.
+- Old stream-B `new_pipeline` ran off `/mnt/library/_acquired/`.
+- `scan_library()` polled the NFS mount for new PDFs — now deprecated.
+
+---
+
+## 15. Operational Runbook
+
+### Service control (on CT 130 as zvx)
+```bash
+sudo systemctl {status,start,stop,restart} recon
+journalctl -u recon -f
+tail -f /opt/recon/logs/recon.log
+```
+
+### Backups
+```bash
+# Local DB backup before risky operations
+cp /opt/recon/data/recon.db /tmp/recon.db.bak.$(date +%s)
+# Contabo offsite (automatic): rsync every 6 hours, see recon-backup.timer
+```
+
+### Inspect pipeline state at a glance
+```bash
+ls /opt/recon/data/acquired/*/    # hopper contents
+ls /opt/recon/data/processing/ | wc -l   # in-flight count
+sqlite3 /opt/recon/data/recon.db \
+  "SELECT status, COUNT(*) FROM documents GROUP BY status;"
+```
+
+### Re-queue a failed document
+```bash
+sqlite3 /opt/recon/data/recon.db \
+  "UPDATE documents SET status='extracted', retries=0 WHERE hash='<hash>';"
+# or via API:
+curl -X POST https://recon.echo6.co/api/retry/<hash>
+```
+
+### Manual ingest
+```bash
+# Drop a PDF into the hopper (dispatcher will pick it up on next cycle)
+sha=$(sha256sum foo.pdf | cut -d' ' -f1)
+cp foo.pdf /opt/recon/data/acquired/pdf/${sha}.pdf
+```
+
+### Qdrant health
+```bash
+curl -s http://100.64.0.14:6333/collections/recon_knowledge_hybrid \
+  | jq '.result | {status, points_count, optimizer_status}'
+# status "grey" with optimizer_status.ok=true is healthy (background indexing).
+```
+
+---
+
+## 16. Known Gotchas
+
+- **Logger setup.** RECON modules must use `setup_logging('recon.<name>')`
+  from `lib.utils`, never raw `logging.getLogger()`. The root logger
+  has no handlers; calls to a raw logger silently disappear.
+- **Qdrant status "grey" is healthy** if `optimizer_status.ok == true`.
+  Only treat red + not-ok as a real failure.
+- **Catalogue row count can grow during long-running jobs** because
+  parallel ingestion may add rows. Only a *decrease* is a real
+  integrity failure.
+- **Dispatcher `.tmp` safety.** `CONTENT_EXTENSIONS` does not include
+  `.tmp`, so active acquisition writes are invisible to the dispatcher
+  until the atomic rename lands.
+- **Transcripts are filed in-place.** Their `documents.path` is a URL
+  and filing worker's `path LIKE '/opt/recon/data/processing/%'`
+  filter excludes them.
+- **PeerTube 429.** Respect `peertube.rate_limit_delay` between caption
+  API calls or you'll get throttled.
+- **SSH heredocs with Python code break.** When editing remote files,
+  write to a temp file via `scp` or `cat > file` rather than bash
+  heredocs with parens/quotes.
+- **The crawler is off.** `crawler.sites: []`. Re-enabling requires a
+  re-architecture for the new pipeline.
+
+---
+
+## 17. Credentials & Hosts
+
+| Host | Role | Access |
+|---|---|---|
+| CT 130 (192.168.1.130 / 100.64.0.24) | RECON service | `ssh zvx@192.168.1.130` (key auth) |
+| cortex VM (192.168.1.150) | Qdrant, TEI, sparse svc, Ollama | `ssh zvx@cortex` |
+| CT 110 (192.168.1.170) | PeerTube `stream.echo6.co` | `ssh zvx@192.168.1.170` |
+| pi-nas (192.168.1.245) | NFS server for `/mnt/library` | `ssh zvx@pi-nas` |
+| CT 101 (192.168.1.101) | Caddy reverse proxy (home) | `ssh root@192.168.1.241 'pct exec 101'` |
+
+Secrets: `/home/zvx/projects/.ref/credentials` on TOC (this machine).
+RECON Gemini/PeerTube keys: `/opt/recon/.env` on CT 130.
+
+---
+
+## 18. Open Follow-ups
+
+- **82 MULTI_MATCH + 141 UNMATCHED** transcript rows still carry
+  library paths post Phase 5a (audit trail at
+  `/tmp/phase5a_remaining.txt` on CT 130). Either hand-resolve or
+  tombstone.
+- **HTML processor** (`lib/processors/html_processor.py`) is scaffolded
+  in config but not implemented. Next-up for Kiwix / web ingest.
+- **Crawler re-architecture.** The tier-1 sites list in `config.yaml`
+  is a valuable target list but the old crawler is off pending a new
+  acquisition-module-shaped implementation.
+- **ARGUS intel pipeline** shares the DB but its lifecycle is
+  documented separately — not covered here.
+- **Phase 6e-2** (PeerTube channel sync endpoint) was reverted and
+  needs a redesign before reinstating.
+- **Level-4 dedupe review queue** (`duplicate_review` table) has no UI
+  yet; items pile up silently.
+
+---
+
+*Last updated: 2026-04-15 — Phase 5a transcript un-file complete, Phase 6e partial. Living document; edit in place as the system evolves.*