From c9a8f1ecb5963a93100c1da2a0b8e471a0ceca2a Mon Sep 17 00:00:00 2001 From: Matt Date: Thu, 16 Apr 2026 04:41:03 +0000 Subject: [PATCH] Add PROJECT-BIBLE.md: canonical architectural reference for RECON MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Consolidated orientation document for future sessions. Covers pipeline lifecycle (acquire → dispatch → process → enrich/embed → file), acquisition modules, dispatcher, per-type processors, filing, StatusDB schema, config, service threads, dashboard/API, filesystem layout, refactor history, runbook, known gotchas, and follow-ups. Sourced from live code on CT 130 (/opt/recon/) including recon.py, dispatcher.py, filing.py, status.py, the three processors, acquisition/peertube.py, config.yaml, and api.py. Co-Authored-By: Claude Opus 4.6 --- PROJECT-BIBLE.md | 629 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 629 insertions(+) create mode 100644 PROJECT-BIBLE.md diff --git a/PROJECT-BIBLE.md b/PROJECT-BIBLE.md new file mode 100644 index 0000000..c9edfc5 --- /dev/null +++ b/PROJECT-BIBLE.md @@ -0,0 +1,629 @@ +# RECON — Project Bible + +Canonical architectural reference for **RECON** (the knowledge extraction +pipeline running on CT 130 / `data.echo6` / `100.64.0.24`). This document +is the orientation dossier for any future session. It is skim-and-find, +not a tutorial. + +- **Repo:** `ssh://git@forge.echo6.co:2222/matt/refactored-recon.git` (design) +- **Code:** `/opt/recon/` on CT 130 (zvx owns the tree; service runs as zvx) +- **Service:** `systemctl status recon` +- **Dashboard:** `https://recon.echo6.co` (zvx-only via Authentik) +- **Files server:** `https://files.echo6.co` (Authentik forward auth) + +--- + +## 1. Mission + +RECON ingests documents from multiple sources (manual PDF uploads, +PeerTube auto-captioned transcripts, future Kiwix/HTML/RSS feeds) and +produces a **searchable, domain-organized library** plus a hybrid +dense/sparse vector index in Qdrant on cortex. + +Every piece of content ends up in two places: + +1. A file under `/mnt/library///.` + (PDFs, HTML) **or** at a source URL like `https://stream.echo6.co/w/` + (PeerTube transcripts — no local copy after Phase 5a). +2. Page-level embeddings in Qdrant collection `recon_knowledge_hybrid` + (dense `bge-m3` + sparse SPLADE-style vectors, 1024-dim dense). + +Search returns page-grounded citations back to the file or stream URL. + +--- + +## 2. System Topology + +``` + ┌─────────────────────────┐ + │ CT 130 (recon) │ + Library (NFS) │ /opt/recon/ │ ┌──────────────┐ + pi-nas:/export/library│ ├─ data/ │ │ Qdrant │ + → /mnt/library/ │ │ ├─ acquired/ │ │ cortex:6333 │ + (read/write) │ │ ├─ processing/ │ ←→ │ recon_knowledge_hybrid + │ │ ├─ concepts/ │ │ (1024-d dense + sparse) + │ │ └─ recon.db │ └──────────────┘ + │ ├─ lib/ │ + │ ├─ recon.py │ ┌──────────────┐ + │ └─ config.yaml │ ←→ │ TEI │ + │ recon.service │ │ cortex:8090 │ + │ nginx :8888 (files) │ │ bge-m3 dense │ + └─────────────────────────┘ └──────────────┘ + ▲ ┌──────────────┐ + │ │ Sparse svc │ + ┌───────────────────────┴─────┐ ←→ │ cortex:8091 │ + │ │ │ bge-m3 sparse│ + PeerTube (CT 110 / stream.echo6.co) Gemini API └──────────────┘ + api_base: http://192.168.1.170 (enrichment, + vision OCR) +``` + +Shared caddy reverse proxy (CT 101) surfaces the dashboard (8420) and +nginx file server (8888) as `recon.echo6.co` and `files.echo6.co`. + +--- + +## 3. Pipeline Lifecycle + +Every document follows the same five-stage arc regardless of source type. +The filesystem location at any given moment tells you which stage the +item is in — **state is a directory.** + +``` + ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ + │ 1. ACQUIRE │ → │ 2. DISPATCH │ → │ 3. PROCESS │ → │ 4. ENRICH / │ → │ 5. FILE │ + │ │ │ (pre_flight) │ │ │ │ EMBED │ │ │ + │ data/acquired│ │ dispatcher.py│ │ per-type │ │ shared │ │ shared │ + │ // │ │ watches │ │ processor │ │ stage loops │ │ filing worker│ + │ .{ext} │ │ subfolders, │ │ moves file │ │ bge-m3 → │ │ moves file │ + │ .meta │ │ hands to │ │ to processing│ │ Qdrant │ │ processing → │ + │ │ │ processor │ │ /{hash}/ │ │ │ │ library, │ + │ │ │ │ │ │ │ │ │ updates DB + │ + │ │ │ │ │ │ │ │ │ Qdrant │ + └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ +``` + +**Status column on documents:** +`catalogued → queued → extracting → extracted → enriching → enriched → +embedding → complete` (plus terminal `error`, `content_failure`, +`duplicate` states). + +`organized_at IS NULL` while in flight, set to `CURRENT_TIMESTAMP` after +filing. Transcripts are marked organized in-place during pre_flight +(they have no filesystem target — the watch URL is their "home"). + +--- + +## 4. Acquisition Layer (`lib/acquisition/`) + +Acquisition modules fetch content from external sources and drop +`{hash}.` + `{hash}.meta.json` **flat file pairs** into +`data/acquired//`. They do **not** touch the database — that's the +processor's job. + +### Atomic drop protocol +1. Write content to `..tmp` (unknown extension, safe from dispatcher). +2. Compute hash; rename tmp to final `.`. +3. Write `.meta.json.tmp`, then rename to `.meta.json`. +4. **Meta goes final first, content goes final last.** Dispatcher only + picks up when content file exists and is stable, so a half-visible + pair without meta never gets dispatched. + +### PeerTube acquisition (`lib/acquisition/peertube.py`) +- Daemon loop `acquisition_loop(stop_event, db, config, interval=1800)`. +- Queries catalogue for `source='stream.echo6.co'` rows, builds sets of + known UUIDs (`/w/` extracted from `path`) **and** known titles + (from `filename`) — both cohorts are checked so Phase 5b-rewritten + rows and pre-5b library-path rows dedupe correctly. +- Lists PeerTube videos via `peertube_scraper.get_videos`, filters to + those with captions, prefers English caption. +- For each new one: fetches VTT, converts to text with `vtt_to_text`, + atomically drops pair into `data/acquired/stream/`. +- Rate limits at `peertube.rate_limit_delay` (default 0.5s) — + **PeerTube returns 429 if captions are fetched too fast.** + +### Manual uploads / URL ingest +`api.py` exposes `/api/upload`, `/api/ingest-url`, `/api/ingest-urls`, +`/api/ingest-peertube` — all end by dropping a pair into `acquired//`. + +--- + +## 5. Dispatcher (`lib/dispatcher.py`) + +The dispatcher is one daemon thread (`dispatch_loop`, interval=30s). It +watches each configured subfolder under `data/acquired/` and hands +stable file pairs to the registered processor. + +### Config-driven dispatch table +```yaml +pipeline: + acquired_root: /opt/recon/data/acquired + processing_root: /opt/recon/data/processing + dispatch: + pdf: pdf_processor + stream: transcript_processor + html: html_processor # not yet implemented + text: text_processor + mtime_stability_seconds: 10 +``` + +### Extension constants +- `CONTENT_EXTENSIONS = {'.txt', '.vtt', '.html', '.pdf'}` — the + dispatcher considers a file "content" only if its extension is in + this set. **`.tmp` is not in the set**, so partial writes are safe. +- `CONVERTIBLE_EXTENSIONS = {'.epub', '.mobi', '.doc', '.docx'}` — + these are normalized to PDF **before** dispatch. + +### Normalization step +`_normalize_formats(subfolder_path)`: +- `.epub` / `.mobi` → PDF via `ebook-convert` (Calibre CLI). +- `.doc` / `.docx` → PDF via `libreoffice --headless`. +- Sidecar `.meta.json` is renamed to match the new PDF hash so pairing + holds. + +### Pair finding +`_find_pairs(subfolder_path)` returns tuples of (content_path, +meta_path_or_None). Pairs where only content exists are still valid — +meta is not required. A meta without its content is ignored. + +### Stability check +`_is_stable(filepath, stability_seconds)` — mtime must be at least +`mtime_stability_seconds` old (default 10s) before dispatch. Prevents +racing active writers. + +--- + +## 6. Processors (`lib/processors/`) + +Each processor implements **one function**: `pre_flight(content_path, +meta_path, db, config) → dict`. It owns all the type-specific logic and +**all the database writes** for that item up to status=`extracted`. + +### Common pre_flight contract +Every processor does, in order: +1. Hash content (SHA-256 via `content_hash()` in `lib/utils.py`). +2. Stale state cleanup: `rm -rf processing/{hash}/` and + `concepts/{hash}/` if they exist (guards against re-runs). +3. Hash dedupe: if `hash` already exists in `catalogue`, delete the + pair, return action `duplicate`. +4. Type-specific metadata extraction + level-4 dedupe check (PDF only). +5. Move content + meta into `processing/{hash}/` with a type-specific + layout. +6. `db.add_to_catalogue`, `db.queue_document`, set `documents.text_dir` + and `page_count`, `db.update_status(hash, 'extracted', ...)`. + +Return dict keys: `hash`, `action`, `source_path`, `error`. Actions are +one of: `extracted`, `duplicate`, `level4_duplicate`, `content_failure`, +`error`, and `duplicate` (for transcripts) or `skip_empty` (for text). + +### `pdf_processor.py` +The heaviest processor. Layered metadata extraction: +- **Source A — PDF dict:** `PdfReader(...).metadata`, mapped to + `{title, author, edition, year}`. +- **Source B — Filename:** regex parse the original filename. +- **Source C — Gemini Vision OCR** on first 3 pages when A+B + disagree or are missing. Returns structured JSON via Gemini's + `response_mime_type: application/json`. +- **Voting:** `_vote_metadata(A, B, C)` reconciles the three sources; + 2-of-3 wins; ties prefer Source A. +- **Level-4 dedupe:** if all four fields (`title, edition, author, + year`) are present and match an existing catalogue row with a + different hash, the PDF is quarantined to `_duplicates/` for human + review. +- **Size cap:** `processing.max_pdf_size_mb` (default 2000MB). Oversize + PDFs move to `_rejected/`. +- **Text extraction order:** PyPDF2 → `pdftotext` (poppler) → Tesseract + OCR → Gemini Vision on a per-page basis. Output: + `processing/{hash}/page_NNNN.txt`. + +### `transcript_processor.py` +Lightweight. The VTT→text conversion already happened in acquisition, +so pre_flight just: +- Hashes `.txt` file. +- Reads meta.json sidecar. +- `chunk_text(raw_text, WORDS_PER_PAGE=2000)` splits into + `page_NNNN.txt` files. +- Writes the transcript as `processing/{hash}/transcript.txt` plus page + chunks. +- Registers with category `Transcript`, source `stream.echo6.co`. +- Sets `text_dir`, `page_count`, and **`organized_at = CURRENT_TIMESTAMP` + immediately** — transcripts are filed-in-place (their "location" is + the PeerTube watch URL, set later as the catalogue `path` via Phase + 5a). + +### `text_processor.py` +Raw `.txt` files dropped via manual upload. Two-source metadata vote +(filename + meta.json). Similar flow to transcript processor but no +fixed category or source. + +--- + +## 7. Enrichment & Embedding + +Both are **source-agnostic stage loops** that just poll documents by +status and do their work. They live in `lib/enricher.py` and +`lib/embedder.py`, wrapped by `stage_loop(stage, ...)` in `recon.py`. + +### Enrichment (`enrich_workers: 16` threads per batch) +- Polls `status = 'extracted' AND retries < max_retries`. +- Sets `enriching`, reads `processing/{hash}/page_NNNN.txt`. +- Windows pages (`enrich_window_size: 5` per window) and sends each + window to Gemini with a structured prompt. +- Stores `concepts/{hash}/window_N.json` per window. +- Backoff: `enrich_base_delay=5s`, doubling up to + `enrich_max_delay=120s`, max `enrich_max_retries=5`. +- On success: `update_status(hash, 'enriched')`. + +### Embedding (`embed_workers: 4`) +- Polls `status = 'enriched'`. +- Reads concept JSONs, builds page-level chunks. +- Dense: POST to TEI at `cortex:8090` (`bge-m3`, 1024-d). Batches of + 128 per TEI request. Throughput ~1,711 emb/sec. +- Sparse: POST to the sparse service at `cortex:8091` (bge-m3 sparse + mode; `sparse_embedding.enabled: true`). +- Upserts into Qdrant `cortex:6333`, collection `recon_knowledge_hybrid`, + batch size `embed_batch_size=500` vectors per upsert. +- Payload carries: `hash`, `filename`, `original_filename`, + `download_url`, `page`, `text`, `title`, `domain`, `subdomain`, + `category`. +- Ollama is a fallback backend (much slower, ~8 emb/sec) via + `embedding.backend: ollama`. +- On success: `update_status(hash, 'complete')`. + +--- + +## 8. Filing (`lib/filing.py`) + +One daemon thread, `filing_worker_loop(interval=30)`. It polls: + +```sql +SELECT hash FROM documents + WHERE status = 'complete' + AND organized_at IS NULL + AND path LIKE '/opt/recon/data/processing/%' + LIMIT 50 +``` + +The `path LIKE '/opt/recon/data/processing/%'` filter naturally +**excludes transcripts** — their `documents.path` was never a +filesystem path but the PeerTube watch URL. + +For each row, `file_processed_item(doc_hash, source_file_path, db, +config)` does: +1. `determine_dominant_domain(hash)` reads concept JSONs, returns the + top-voted `Domain/Subdomain`. +2. `_build_target_path(...)` derives the canonical name starting at + level 1 (`Title`), escalating to level 2/3/4 only if a collision + exists in the target folder. **Preserves source file's actual + extension** (not hardcoded to `.pdf`). +3. `shutil.move(source, target)` atomically. Target is + `/mnt/library///.`. +4. Updates: + - `catalogue.path` → new target + - `catalogue.filename` → new canonical name + - `documents.path` → new target + - Qdrant payload via `update_qdrant_payload(...)`: + `download_url = generate_download_url(new_path, ...)`, + `filename`, `original_filename` set on every point for that hash. +5. `db.mark_organized(hash)` sets `organized_at` + cleans up + `processing/{hash}/`. + +### Download URL helper (`lib/utils.py:generate_download_url`) +- If the path is already `http://` or `https://` (transcripts), return + it unchanged. +- Otherwise strip `library_root` prefix and prepend + `book_server.base_url` (→ `https://files.echo6.co/`). + +--- + +## 9. StatusDB (`lib/status.py`) + +SQLite (`data/recon.db`) in WAL mode with thread-local connections +(`_get_conn()` uses `threading.local`). + +### Tables +| Table | Purpose | +|---|---| +| `catalogue` | Canonical record keyed by `hash` — title, filename, path, source, category, size | +| `documents` | Pipeline state machine — status, path, text_dir, page_count, retries, organized_at, timestamps | +| `intel` | ARGUS intel feed entries (separate pipeline) | +| `metrics_snapshots` | Time-series rollups for the dashboard | +| `file_operations` | Audit log of Phase-5-style file moves and renames | +| `duplicate_review` | Level-4 dedupe quarantine queue | + +### Key methods +- `add_to_catalogue(hash, title, url, size, source, category)` +- `queue_document(hash)` — insert into `documents` with status=`queued` +- `update_status(hash, status, **kwargs)` — single point of status truth +- `mark_organized(hash)` — sets `organized_at`, final transition +- `sync_document_path(hash, new_path)` + `update_catalogue_path(...)` — + used by filing worker and Phase 5a un-file +- `get_path_updates` / `clear_path_update` — small change queue for + backfills + +### Connection safety +All writers take a short-lived connection via `_get_conn()`. WAL mode +allows concurrent readers; writes are serialized at the SQLite level. +No explicit `BEGIN` — rely on autocommit semantics with occasional +`conn.commit()` after grouped updates. + +--- + +## 10. Configuration (`config.yaml`) + +Lives at `/opt/recon/config.yaml`. Secrets (`GEMINI_KEYS`, +`PEERTUBE_TOKEN`, etc.) live in `/opt/recon/.env` — never in +`config.yaml`, never in git. + +### Top-level keys +| Key | Meaning | +|---|---| +| `library_root` | `/mnt/library` — NFS mount root | +| `processing` | Worker counts, window sizes, timeouts, retry policy | +| `embedding` | TEI host/port, model (`bge-m3`), 1024-d dense | +| `sparse_embedding` | Separate service on cortex:8091 | +| `vector_db` | Qdrant host, port, collection name | +| `gemini` | Model (`gemini-2.0-flash`), JSON response mode | +| `web` | Dashboard bind host + port (8420) | +| `paths` | `base`, `data`, `text`, `concepts`, `intel`, `logs`, `db` | +| `book_server` | `base_url`, `strip_prefix` for download URL generation | +| `upload_paths` | Category → filesystem path for upload routing | +| `service` | `scan_interval`, `stage_poll_interval`, `progress_interval` | +| `peertube` | `api_base`, `public_url`, `rate_limit_delay`, `poll_interval` | +| `pipeline` | `acquired_root`, `processing_root`, `dispatch` table, `mtime_stability_seconds` | +| `crawler` / `web_scraper` | Currently disabled (`sites: []`) pending re-architecture | +| `new_pipeline` | Stream-B (old) pipeline, `enabled: false` | + +--- + +## 11. Service & Threads (`recon.py cmd_service`) + +`systemctl start recon` → `python3 recon.py service`. The service runs +seven daemon threads plus a metrics collector: + +| Thread | Function | Interval | +|---|---|---| +| `dispatcher` | `dispatcher.dispatch_loop` | 30s | +| `enrich` | `stage_loop('enrich', ...)` | 30s idle | +| `embed` | `stage_loop('embed', ...)` | 30s idle | +| `filing` | `filing.filing_worker_loop` | 30s | +| `peertube-acq` | `acquisition.peertube.acquisition_loop` | 1800s | +| `progress` | Log status rollup line | 60s | +| `dashboard` | `api.run_server` (Flask) | bound | + +Plus `peertube_collector.start_collector` for metrics scrape. + +All threads receive a shared `stop_event` (`threading.Event`) and exit +cleanly on SIGTERM via `signal.signal(SIGTERM, lambda *_: stop_event.set())`. + +### CLI commands (`recon.py` top-level) +`scan`, `queue`, `extract`, `enrich`, `embed`, `run`, `status`, +`catalogue`, `failures`, `search`, `upload`, `ingest-url`, `ingest`, +`ingest-peertube`, `validate`, `rebuild`, `serve`, `service`, +`organize`, `pipeline`. + +Most commands are thin wrappers around library functions — useful for +one-off maintenance from the CT 130 shell. + +--- + +## 12. Dashboard & API (`lib/api.py`) + +Flask app bound to `0.0.0.0:8420`. Pages are server-rendered Jinja +templates; data is pulled via AJAX from `/api/*` endpoints. + +### Page routes +`/`, `/search`, `/catalogue`, `/upload`, `/web-ingest`, `/failures`, +`/peertube`, `/peertube/channels`, `/settings/{keys,cookies,vpn,health}`. + +### API surface (grouped) +| Group | Endpoints | +|---|---| +| Upload | `POST /api/upload`, `GET /api/upload//status`, `GET /api/upload/categories` | +| Ingest | `POST /api/ingest-url`, `/api/ingest-urls`, `/api/ingest`, `/api/ingest-peertube`, `/api/crawl`, `GET /api/crawl//status`, `GET /api/ingest-peertube//status` | +| Search | `POST /api/search` | +| Status | `GET /api/status`, `/api/quick-stats`, `/api/knowledge-stats`, `/api/health` | +| Retry | `POST /api/retry/`, `/api/retry-all` | +| Service | `POST /api/service/restart` | +| Keys | Full CRUD on `/api/keys`, `/api/keys/validate`, `/api/keys/reload` | +| Cookies | `GET /api/cookies/status`, `POST /api/cookies/upload` | +| VPN | `GET /api/vpn/status`, `POST /api/vpn/{connect,disconnect,rotate,login}` | +| PeerTube | `/api/peertube/{dashboard,channels,channels/stats,channels/add,channels/}`, `/api/peertube/stats` | +| Metrics | `GET /api/metrics/history` | + +### Qdrant scroll +`_qdrant_scroll(host, port, collection, req)` is the shared paged-read +helper for rebuilding the knowledge-stats panel. + +### Cache warmer +`start_cache_warmer(stop_event)` pre-computes the expensive quick-stats +and knowledge-stats panels so the dashboard loads instantly. + +--- + +## 13. Filesystem Layout + +``` +/opt/recon/ +├── recon.py # CLI + service entry point +├── config.yaml +├── .env # secrets (GEMINI_KEYS etc.) +├── PROJECT-BIBLE.md # this file (copy on CT 130) +├── backups/ # local DB backups +├── data/ +│ ├── acquired/ # hopper — {hash}.ext + {hash}.meta.json +│ │ ├── pdf/ +│ │ ├── stream/ # PeerTube transcripts +│ │ ├── html/ # (future) +│ │ └── text/ +│ ├── processing/{hash}/ # in-flight scratch +│ │ ├── page_NNNN.txt +│ │ ├── meta.json +│ │ └── (original file or transcript.txt) +│ ├── concepts/{hash}/ +│ │ └── window_N.json # Gemini enrichment output +│ ├── intel/ # ARGUS intel feeds +│ ├── _duplicates/ # level-4 name-match quarantine +│ ├── _rejected/ # oversize / unreadable PDFs +│ └── recon.db # SQLite WAL mode +├── lib/ +│ ├── acquisition/peertube.py +│ ├── processors/{pdf,transcript,text}_processor.py +│ ├── dispatcher.py +│ ├── filing.py +│ ├── enricher.py +│ ├── embedder.py +│ ├── status.py # StatusDB class +│ ├── api.py # Flask dashboard + API +│ ├── new_pipeline.py # update_qdrant_payload helper lives here +│ ├── utils.py # content_hash, generate_download_url, get_config, setup_logging +│ ├── peertube_scraper.py # PeerTube API client +│ └── organizer.py # determine_dominant_domain, level 1-4 naming +└── logs/ + +/mnt/library/ # NFS from pi-nas, read-write +├── //. +└── _acquired/ _review/ _staging/ signal-archive/ # not touched by pipeline +``` + +--- + +## 14. Refactor History (2026-04) + +The refactor is tracked as dated phases under `phases/`. Status +implementations are in the RECON repo; design lives here. + +| Phase | Focus | +|---|---| +| 0 | Baseline capture — DB dumps, directory listings, config pin | +| 1 | Scaffolding — create `acquired/`, `processing/`, config keys | +| 2 | Shared filing function — extract organizer logic into `filing.py` | +| 3 | Transcript processor — first end-to-end test of the new pattern | +| 4 | PDF processor — layered A/B/C metadata vote, level-4 dedupe | +| 5a | Transcript resweep — 16,340 transcripts migrated from `library/*.txt` path to `stream.echo6.co/w/` watch URLs; catalogue/documents/Qdrant all updated atomically, physical `.txt` files deleted | +| 5b | Transcript unprocess — clean up stale rows and processing dirs | +| 5c-1 | Service loop rewire — retire old scan_library thread, wire dispatcher in | +| 5c-2 | Service start & transcript drain — clear the hopper backlog | +| 6a | Transcript organized-in-place — set `organized_at` during pre_flight so filing worker ignores transcripts | +| 6b | Dashboard "Untitled / WEB" bug fix — recently-completed table query | +| 6c | Code cleanup — dead-code audit | +| 6d | PeerTube acquisition module — replace ad-hoc ingester with `acquisition/peertube.py` | +| 6e | ShadowLib skill + dashboard PeerTube endpoint cleanup (partial — 6e-2 reverted) | + +### Baseline pre-refactor (per `current-state.md`) +- 18,855 transcripts in `/mnt/library/_sources/streamecho6/`. +- Old stream-B `new_pipeline` ran off `/mnt/library/_acquired/`. +- `scan_library()` polled the NFS mount for new PDFs — now deprecated. + +--- + +## 15. Operational Runbook + +### Service control (on CT 130 as zvx) +```bash +sudo systemctl {status,start,stop,restart} recon +journalctl -u recon -f +tail -f /opt/recon/logs/recon.log +``` + +### Backups +```bash +# Local DB backup before risky operations +cp /opt/recon/data/recon.db /tmp/recon.db.bak.$(date +%s) +# Contabo offsite (automatic): rsync every 6 hours, see recon-backup.timer +``` + +### Inspect pipeline state at a glance +```bash +ls /opt/recon/data/acquired/*/ # hopper contents +ls /opt/recon/data/processing/ | wc -l # in-flight count +sqlite3 /opt/recon/data/recon.db \ + "SELECT status, COUNT(*) FROM documents GROUP BY status;" +``` + +### Re-queue a failed document +```bash +sqlite3 /opt/recon/data/recon.db \ + "UPDATE documents SET status='extracted', retries=0 WHERE hash='';" +# or via API: +curl -X POST https://recon.echo6.co/api/retry/ +``` + +### Manual ingest +```bash +# Drop a PDF into the hopper (dispatcher will pick it up on next cycle) +sha=$(sha256sum foo.pdf | cut -d' ' -f1) +cp foo.pdf /opt/recon/data/acquired/pdf/${sha}.pdf +``` + +### Qdrant health +```bash +curl -s http://100.64.0.14:6333/collections/recon_knowledge_hybrid \ + | jq '.result | {status, points_count, optimizer_status}' +# status "grey" with optimizer_status.ok=true is healthy (background indexing). +``` + +--- + +## 16. Known Gotchas + +- **Logger setup.** RECON modules must use `setup_logging('recon.')` + from `lib.utils`, never raw `logging.getLogger()`. The root logger + has no handlers; calls to a raw logger silently disappear. +- **Qdrant status "grey" is healthy** if `optimizer_status.ok == true`. + Only treat red + not-ok as a real failure. +- **Catalogue row count can grow during long-running jobs** because + parallel ingestion may add rows. Only a *decrease* is a real + integrity failure. +- **Dispatcher `.tmp` safety.** `CONTENT_EXTENSIONS` does not include + `.tmp`, so active acquisition writes are invisible to the dispatcher + until the atomic rename lands. +- **Transcripts are filed in-place.** Their `documents.path` is a URL + and filing worker's `path LIKE '/opt/recon/data/processing/%'` + filter excludes them. +- **PeerTube 429.** Respect `peertube.rate_limit_delay` between caption + API calls or you'll get throttled. +- **SSH heredocs with Python code break.** When editing remote files, + write to a temp file via `scp` or `cat > file` rather than bash + heredocs with parens/quotes. +- **The crawler is off.** `crawler.sites: []`. Re-enabling requires a + re-architecture for the new pipeline. + +--- + +## 17. Credentials & Hosts + +| Host | Role | Access | +|---|---|---| +| CT 130 (192.168.1.130 / 100.64.0.24) | RECON service | `ssh zvx@192.168.1.130` (key auth) | +| cortex VM (192.168.1.150) | Qdrant, TEI, sparse svc, Ollama | `ssh zvx@cortex` | +| CT 110 (192.168.1.170) | PeerTube `stream.echo6.co` | `ssh zvx@192.168.1.170` | +| pi-nas (192.168.1.245) | NFS server for `/mnt/library` | `ssh zvx@pi-nas` | +| CT 101 (192.168.1.101) | Caddy reverse proxy (home) | `ssh root@192.168.1.241 'pct exec 101'` | + +Secrets: `/home/zvx/projects/.ref/credentials` on TOC (this machine). +RECON Gemini/PeerTube keys: `/opt/recon/.env` on CT 130. + +--- + +## 18. Open Follow-ups + +- **82 MULTI_MATCH + 141 UNMATCHED** transcript rows still carry + library paths post Phase 5a (audit trail at + `/tmp/phase5a_remaining.txt` on CT 130). Either hand-resolve or + tombstone. +- **HTML processor** (`lib/processors/html_processor.py`) is scaffolded + in config but not implemented. Next-up for Kiwix / web ingest. +- **Crawler re-architecture.** The tier-1 sites list in `config.yaml` + is a valuable target list but the old crawler is off pending a new + acquisition-module-shaped implementation. +- **ARGUS intel pipeline** shares the DB but its lifecycle is + documented separately — not covered here. +- **Phase 6e-2** (PeerTube channel sync endpoint) was reverted and + needs a redesign before reinstating. +- **Level-4 dedupe review queue** (`duplicate_review` table) has no UI + yet; items pile up silently. + +--- + +*Last updated: 2026-04-15 — Phase 5a transcript un-file complete, Phase 6e partial. Living document; edit in place as the system evolves.*