refactored-recon/PROJECT-BIBLE.md
Matt b1c05c4d02 PROJECT-BIBLE: fix storage topology — library is LXC bind-mount, not NFS
- Section 2 topology diagram: 'Library (LXC bind) / data /mnt/data/library
  → /mnt/library/ (read/write, local SSD)'
- Section 10 Config table: library_root described as bind-mount root
- Section 13 Filesystem layout: /mnt/library annotated as LXC bind-mount
- Section 14 Refactor history: storage migration note added (NFS history
  preserved as historical context)
- Section 15 Operational runbook: replaced recon-backup.timer reference
  with planned/TBD note
- Section 16 Known Gotchas: new bullet on bind-mount file ownership and
  the absence of NFS / root_squash in the path
- Section 17 Credentials & Hosts: added data host row; rewrote pi-nas
  role to backup target (planned, not yet configured) reflecting the
  2026-04-15 wipe of /export/library
- Section 18 Open Follow-ups: added backup architecture entry capturing
  the missing rsync job and the now-available ~300G pi-nas headroom
2026-04-16 06:50:36 +00:00

674 lines
33 KiB
Markdown

# RECON — Project Bible
Canonical architectural reference for **RECON** (the knowledge extraction
pipeline running on CT 130 / `data.echo6` / `100.64.0.24`). This document
is the orientation dossier for any future session. It is skim-and-find,
not a tutorial.
- **Repo:** `ssh://git@forge.echo6.co:2222/matt/refactored-recon.git` (design)
- **Code:** `/opt/recon/` on CT 130 (zvx owns the tree; service runs as zvx)
- **Service:** `systemctl status recon`
- **Dashboard:** `https://recon.echo6.co` (zvx-only via Authentik)
- **Files server:** `https://files.echo6.co` (Authentik forward auth)
---
## 1. Mission
RECON ingests documents from multiple sources (manual PDF uploads,
PeerTube auto-captioned transcripts, future Kiwix/HTML/RSS feeds) and
produces a **searchable, domain-organized library** plus a hybrid
dense/sparse vector index in Qdrant on cortex.
Every piece of content ends up in two places:
1. A file under `/mnt/library/<Domain>/<Subdomain>/<canonical_name>.<ext>`
(PDFs, HTML) **or** at a source URL like `https://stream.echo6.co/w/<uuid>`
(PeerTube transcripts — no local copy after Phase 5a).
2. Page-level embeddings in Qdrant collection `recon_knowledge_hybrid`
(dense `bge-m3` + sparse SPLADE-style vectors, 1024-dim dense).
Search returns page-grounded citations back to the file or stream URL.
---
## 2. System Topology
```
┌─────────────────────────┐
│ CT 130 (recon) │
Library (LXC bind) │ /opt/recon/ │ ┌──────────────┐
data /mnt/data/library│ ├─ data/ │ │ Qdrant │
→ /mnt/library/ │ │ ├─ acquired/ │ │ cortex:6333 │
(read/write, local SSD)│ │ ├─ processing/ │ ←→ │ recon_knowledge_hybrid
│ │ ├─ concepts/ │ │ (1024-d dense + sparse)
│ │ └─ recon.db │ └──────────────┘
│ ├─ lib/ │
│ ├─ recon.py │ ┌──────────────┐
│ └─ config.yaml │ ←→ │ TEI │
│ recon.service │ │ cortex:8090 │
│ nginx :8888 (files) │ │ bge-m3 dense │
└─────────────────────────┘ └──────────────┘
▲ ┌──────────────┐
│ │ Sparse svc │
┌───────────────────────┴─────┐ ←→ │ cortex:8091 │
│ │ │ bge-m3 sparse│
PeerTube (CT 110 / stream.echo6.co) Gemini API └──────────────┘
api_base: http://192.168.1.170 (enrichment,
vision OCR)
```
Shared caddy reverse proxy (CT 101) surfaces the dashboard (8420) and
nginx file server (8888) as `recon.echo6.co` and `files.echo6.co`.
---
## 3. Pipeline Lifecycle
Every document follows the same five-stage arc regardless of source type.
The filesystem location at any given moment tells you which stage the
item is in — **state is a directory.**
```
┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ 1. ACQUIRE │ → │ 2. DISPATCH │ → │ 3. PROCESS │ → │ 4. ENRICH / │ → │ 5. FILE │
│ │ │ (pre_flight) │ │ │ │ EMBED │ │ │
│ data/acquired│ │ dispatcher.py│ │ per-type │ │ shared │ │ shared │
│ /<type>/ │ │ watches │ │ processor │ │ stage loops │ │ filing worker│
│ <hash>.{ext} │ │ subfolders, │ │ moves file │ │ bge-m3 → │ │ moves file │
│ <hash>.meta │ │ hands to │ │ to processing│ │ Qdrant │ │ processing → │
│ │ │ processor │ │ /{hash}/ │ │ │ │ library, │
│ │ │ │ │ │ │ │ │ updates DB + │
│ │ │ │ │ │ │ │ │ Qdrant │
└──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘
```
**Status column on documents:**
`catalogued → queued → extracting → extracted → enriching → enriched →
embedding → complete` (plus terminal `error`, `content_failure`,
`duplicate` states).
`organized_at IS NULL` while in flight, set to `CURRENT_TIMESTAMP` after
filing. Transcripts are marked organized in-place during pre_flight
(they have no filesystem target — the watch URL is their "home").
---
## 4. Acquisition Layer (`lib/acquisition/`)
Acquisition modules fetch content from external sources and drop
`{hash}.<ext>` + `{hash}.meta.json` **flat file pairs** into
`data/acquired/<type>/`. They do **not** touch the database — that's the
processor's job.
### Atomic drop protocol
1. Write content to `<hash>.<ext>.tmp` (unknown extension, safe from dispatcher).
2. Compute hash; rename tmp to final `<hash>.<ext>`.
3. Write `<hash>.meta.json.tmp`, then rename to `<hash>.meta.json`.
4. **Meta goes final first, content goes final last.** Dispatcher only
picks up when content file exists and is stable, so a half-visible
pair without meta never gets dispatched.
### PeerTube acquisition (`lib/acquisition/peertube.py`)
- Daemon loop `acquisition_loop(stop_event, db, config, interval=1800)`.
- Queries catalogue for `source='stream.echo6.co'` rows, builds sets of
known UUIDs (`/w/<uuid>` extracted from `path`) **and** known titles
(from `filename`) — both cohorts are checked so Phase 5b-rewritten
rows and pre-5b library-path rows dedupe correctly.
- Lists PeerTube videos via `peertube_scraper.get_videos`, filters to
those with captions, prefers English caption.
- For each new one: fetches VTT, converts to text with `vtt_to_text`,
atomically drops pair into `data/acquired/stream/`.
- Rate limits at `peertube.rate_limit_delay` (default 0.5s) —
**PeerTube returns 429 if captions are fetched too fast.**
### Manual uploads / URL ingest
`api.py` exposes `/api/upload`, `/api/ingest-url`, `/api/ingest-urls`,
`/api/ingest-peertube` — all end by dropping a pair into `acquired/<type>/`.
---
## 5. Dispatcher (`lib/dispatcher.py`)
The dispatcher is one daemon thread (`dispatch_loop`, interval=30s). It
watches each configured subfolder under `data/acquired/` and hands
stable file pairs to the registered processor.
### Config-driven dispatch table
```yaml
pipeline:
acquired_root: /opt/recon/data/acquired
processing_root: /opt/recon/data/processing
dispatch:
pdf: pdf_processor
stream: transcript_processor
html: html_processor # not yet implemented
text: text_processor
mtime_stability_seconds: 10
```
### Extension constants
- `CONTENT_EXTENSIONS = {'.txt', '.vtt', '.html', '.pdf'}` — the
dispatcher considers a file "content" only if its extension is in
this set. **`.tmp` is not in the set**, so partial writes are safe.
- `CONVERTIBLE_EXTENSIONS = {'.epub', '.mobi', '.doc', '.docx'}`
these are normalized to PDF **before** dispatch.
### Normalization step
`_normalize_formats(subfolder_path)`:
- `.epub` / `.mobi` → PDF via `ebook-convert` (Calibre CLI).
- `.doc` / `.docx` → PDF via `libreoffice --headless`.
- Sidecar `.meta.json` is renamed to match the new PDF hash so pairing
holds.
### Pair finding
`_find_pairs(subfolder_path)` returns tuples of (content_path,
meta_path_or_None). Pairs where only content exists are still valid —
meta is not required. A meta without its content is ignored.
### Stability check
`_is_stable(filepath, stability_seconds)` — mtime must be at least
`mtime_stability_seconds` old (default 10s) before dispatch. Prevents
racing active writers.
---
## 6. Processors (`lib/processors/`)
Each processor implements **one function**: `pre_flight(content_path,
meta_path, db, config) → dict`. It owns all the type-specific logic and
**all the database writes** for that item up to status=`extracted`.
### Common pre_flight contract
Every processor does, in order:
1. Hash content (SHA-256 via `content_hash()` in `lib/utils.py`).
2. Stale state cleanup: `rm -rf processing/{hash}/` and
`concepts/{hash}/` if they exist (guards against re-runs).
3. Hash dedupe: if `hash` already exists in `catalogue`, delete the
pair, return action `duplicate`.
4. Type-specific metadata extraction + level-4 dedupe check (PDF only).
5. Move content + meta into `processing/{hash}/` with a type-specific
layout.
6. `db.add_to_catalogue`, `db.queue_document`, set `documents.text_dir`
and `page_count`, `db.update_status(hash, 'extracted', ...)`.
Return dict keys: `hash`, `action`, `source_path`, `error`. Actions are
one of: `extracted`, `duplicate`, `level4_duplicate`, `content_failure`,
`error`, and `duplicate` (for transcripts) or `skip_empty` (for text).
### `pdf_processor.py`
The heaviest processor. Layered metadata extraction:
- **Source A — PDF dict:** `PdfReader(...).metadata`, mapped to
`{title, author, edition, year}`.
- **Source B — Filename:** regex parse the original filename.
- **Source C — Gemini Vision OCR** on first 3 pages when A+B
disagree or are missing. Returns structured JSON via Gemini's
`response_mime_type: application/json`.
- **Voting:** `_vote_metadata(A, B, C)` reconciles the three sources;
2-of-3 wins; ties prefer Source A.
- **Level-4 dedupe:** if all four fields (`title, edition, author,
year`) are present and match an existing catalogue row with a
different hash, the PDF is quarantined to `_duplicates/` for human
review.
- **Size cap:** `processing.max_pdf_size_mb` (default 2000MB). Oversize
PDFs move to `_rejected/`.
- **Text extraction order:** PyPDF2 → `pdftotext` (poppler) → Tesseract
OCR → Gemini Vision on a per-page basis. Output:
`processing/{hash}/page_NNNN.txt`.
### `transcript_processor.py`
Lightweight. The VTT→text conversion already happened in acquisition,
so pre_flight just:
- Hashes `<hash>.txt` file.
- Reads meta.json sidecar.
- `chunk_text(raw_text, WORDS_PER_PAGE=2000)` splits into
`page_NNNN.txt` files.
- Writes the transcript as `processing/{hash}/transcript.txt` plus page
chunks.
- Registers with category `Transcript`, source `stream.echo6.co`.
- Sets `text_dir`, `page_count`, and **`organized_at = CURRENT_TIMESTAMP`
immediately** — transcripts are filed-in-place (their "location" is
the PeerTube watch URL, set later as the catalogue `path` via Phase
5a).
### `text_processor.py`
Raw `.txt` files dropped via manual upload. Two-source metadata vote
(filename + meta.json). Similar flow to transcript processor but no
fixed category or source.
---
## 7. Enrichment & Embedding
Both are **source-agnostic stage loops** that just poll documents by
status and do their work. They live in `lib/enricher.py` and
`lib/embedder.py`, wrapped by `stage_loop(stage, ...)` in `recon.py`.
### Enrichment (`enrich_workers: 16` threads per batch)
- Polls `status = 'extracted' AND retries < max_retries`.
- Sets `enriching`, reads `processing/{hash}/page_NNNN.txt`.
- Windows pages (`enrich_window_size: 5` per window) and sends each
window to Gemini with a structured prompt.
- Stores `concepts/{hash}/window_N.json` per window.
- Backoff: `enrich_base_delay=5s`, doubling up to
`enrich_max_delay=120s`, max `enrich_max_retries=5`.
- On success: `update_status(hash, 'enriched')`.
### Embedding (`embed_workers: 4`)
- Polls `status = 'enriched'`.
- Reads concept JSONs, builds page-level chunks.
- Dense: POST to TEI at `cortex:8090` (`bge-m3`, 1024-d). Batches of
128 per TEI request. Throughput ~1,711 emb/sec.
- Sparse: POST to the sparse service at `cortex:8091` (bge-m3 sparse
mode; `sparse_embedding.enabled: true`).
- Upserts into Qdrant `cortex:6333`, collection `recon_knowledge_hybrid`,
batch size `embed_batch_size=500` vectors per upsert.
- Payload carries: `hash`, `filename`, `original_filename`,
`download_url`, `page`, `text`, `title`, `domain`, `subdomain`,
`category`.
- Ollama is a fallback backend (much slower, ~8 emb/sec) via
`embedding.backend: ollama`.
- On success: `update_status(hash, 'complete')`.
---
## 8. Filing (`lib/filing.py`)
One daemon thread, `filing_worker_loop(interval=30)`. It polls:
```sql
SELECT hash FROM documents
WHERE status = 'complete'
AND organized_at IS NULL
AND path LIKE '/opt/recon/data/processing/%'
LIMIT 50
```
The `path LIKE '/opt/recon/data/processing/%'` filter naturally
**excludes transcripts** — their `documents.path` was never a
filesystem path but the PeerTube watch URL.
For each row, `file_processed_item(doc_hash, source_file_path, db,
config)` does:
1. `determine_dominant_domain(hash)` reads concept JSONs, returns the
top-voted `Domain/Subdomain`.
2. `_build_target_path(...)` derives the canonical name starting at
level 1 (`Title`), escalating to level 2/3/4 only if a collision
exists in the target folder. **Preserves source file's actual
extension** (not hardcoded to `.pdf`).
3. `shutil.move(source, target)` atomically. Target is
`/mnt/library/<Domain>/<Subdomain>/<canonical>.<ext>`.
4. Updates:
- `catalogue.path` → new target
- `catalogue.filename` → new canonical name
- `documents.path` → new target
- Qdrant payload via `update_qdrant_payload(...)`:
`download_url = generate_download_url(new_path, ...)`,
`filename`, `original_filename` set on every point for that hash.
5. `db.mark_organized(hash)` sets `organized_at` + cleans up
`processing/{hash}/`.
### Download URL helper (`lib/utils.py:generate_download_url`)
- If the path is already `http://` or `https://` (transcripts), return
it unchanged.
- Otherwise strip `library_root` prefix and prepend
`book_server.base_url` (→ `https://files.echo6.co/<rel>`).
---
## 9. StatusDB (`lib/status.py`)
SQLite (`data/recon.db`) in WAL mode with thread-local connections
(`_get_conn()` uses `threading.local`).
### Tables
| Table | Purpose |
|---|---|
| `catalogue` | Canonical record keyed by `hash` — title, filename, path, source, category, size |
| `documents` | Pipeline state machine — status, path, text_dir, page_count, retries, organized_at, timestamps |
| `intel` | ARGUS intel feed entries (separate pipeline) |
| `metrics_snapshots` | Time-series rollups for the dashboard |
| `file_operations` | Audit log of Phase-5-style file moves and renames |
| `duplicate_review` | Level-4 dedupe quarantine queue |
### Key methods
- `add_to_catalogue(hash, title, url, size, source, category)`
- `queue_document(hash)` — insert into `documents` with status=`queued`
- `update_status(hash, status, **kwargs)` — single point of status truth
- `mark_organized(hash)` — sets `organized_at`, final transition
- `sync_document_path(hash, new_path)` + `update_catalogue_path(...)` —
used by filing worker and Phase 5a un-file
- `get_path_updates` / `clear_path_update` — small change queue for
backfills
### Connection safety
All writers take a short-lived connection via `_get_conn()`. WAL mode
allows concurrent readers; writes are serialized at the SQLite level.
No explicit `BEGIN` — rely on autocommit semantics with occasional
`conn.commit()` after grouped updates.
---
## 10. Configuration (`config.yaml`)
Lives at `/opt/recon/config.yaml`. Secrets (`GEMINI_KEYS`,
`PEERTUBE_TOKEN`, etc.) live in `/opt/recon/.env` — never in
`config.yaml`, never in git.
### Top-level keys
| Key | Meaning |
|---|---|
| `library_root` | `/mnt/library` — LXC bind-mount root (data host `/mnt/data/library`, local SSD) |
| `processing` | Worker counts, window sizes, timeouts, retry policy |
| `embedding` | TEI host/port, model (`bge-m3`), 1024-d dense |
| `sparse_embedding` | Separate service on cortex:8091 |
| `vector_db` | Qdrant host, port, collection name |
| `gemini` | Model (`gemini-2.0-flash`), JSON response mode |
| `web` | Dashboard bind host + port (8420) |
| `paths` | `base`, `data`, `text`, `concepts`, `intel`, `logs`, `db` |
| `book_server` | `base_url`, `strip_prefix` for download URL generation |
| `upload_paths` | Category → filesystem path for upload routing |
| `service` | `scan_interval`, `stage_poll_interval`, `progress_interval` |
| `peertube` | `api_base`, `public_url`, `rate_limit_delay`, `poll_interval` |
| `pipeline` | `acquired_root`, `processing_root`, `dispatch` table, `mtime_stability_seconds` |
| `crawler` / `web_scraper` | Currently disabled (`sites: []`) pending re-architecture |
| `new_pipeline` | Stream-B (old) pipeline, `enabled: false` |
---
## 11. Service & Threads (`recon.py cmd_service`)
`systemctl start recon` → `python3 recon.py service`. The service runs
seven daemon threads plus a metrics collector:
| Thread | Function | Interval |
|---|---|---|
| `dispatcher` | `dispatcher.dispatch_loop` | 30s |
| `enrich` | `stage_loop('enrich', ...)` | 30s idle |
| `embed` | `stage_loop('embed', ...)` | 30s idle |
| `filing` | `filing.filing_worker_loop` | 30s |
| `peertube-acq` | `acquisition.peertube.acquisition_loop` | 1800s |
| `progress` | Log status rollup line | 60s |
| `dashboard` | `api.run_server` (Flask) | bound |
Plus `peertube_collector.start_collector` for metrics scrape.
All threads receive a shared `stop_event` (`threading.Event`) and exit
cleanly on SIGTERM via `signal.signal(SIGTERM, lambda *_: stop_event.set())`.
### CLI commands (`recon.py` top-level)
`scan`, `queue`, `extract`, `enrich`, `embed`, `run`, `status`,
`catalogue`, `failures`, `search`, `upload`, `ingest-url`, `ingest`,
`ingest-peertube`, `validate`, `rebuild`, `serve`, `service`,
`organize`, `pipeline`.
Most commands are thin wrappers around library functions — useful for
one-off maintenance from the CT 130 shell.
---
## 12. Dashboard & API (`lib/api.py`)
Flask app bound to `0.0.0.0:8420`. Pages are server-rendered Jinja
templates; data is pulled via AJAX from `/api/*` endpoints.
### Page routes
`/`, `/search`, `/catalogue`, `/upload`, `/web-ingest`, `/failures`,
`/peertube`, `/peertube/channels`, `/settings/{keys,cookies,vpn,health}`.
### API surface (grouped)
| Group | Endpoints |
|---|---|
| Upload | `POST /api/upload`, `GET /api/upload/<hash>/status`, `GET /api/upload/categories` |
| Ingest | `POST /api/ingest-url`, `/api/ingest-urls`, `/api/ingest`, `/api/ingest-peertube`, `/api/crawl`, `GET /api/crawl/<id>/status`, `GET /api/ingest-peertube/<job>/status` |
| Search | `POST /api/search` |
| Status | `GET /api/status`, `/api/quick-stats`, `/api/knowledge-stats`, `/api/health` |
| Retry | `POST /api/retry/<hash>`, `/api/retry-all` |
| Service | `POST /api/service/restart` |
| Keys | Full CRUD on `/api/keys`, `/api/keys/validate`, `/api/keys/reload` |
| Cookies | `GET /api/cookies/status`, `POST /api/cookies/upload` |
| VPN | `GET /api/vpn/status`, `POST /api/vpn/{connect,disconnect,rotate,login}` |
| PeerTube | `/api/peertube/{dashboard,channels,channels/stats,channels/add,channels/<actor>}`, `/api/peertube/stats` |
| Metrics | `GET /api/metrics/history` |
### Qdrant scroll
`_qdrant_scroll(host, port, collection, req)` is the shared paged-read
helper for rebuilding the knowledge-stats panel.
### Cache warmer
`start_cache_warmer(stop_event)` pre-computes the expensive quick-stats
and knowledge-stats panels so the dashboard loads instantly.
---
## 13. Filesystem Layout
```
/opt/recon/
├── recon.py # CLI + service entry point
├── config.yaml
├── .env # secrets (GEMINI_KEYS etc.)
├── PROJECT-BIBLE.md # this file (copy on CT 130)
├── backups/ # local DB backups
├── data/
│ ├── acquired/ # hopper — {hash}.ext + {hash}.meta.json
│ │ ├── pdf/
│ │ ├── stream/ # PeerTube transcripts
│ │ ├── html/ # (future)
│ │ └── text/
│ ├── processing/{hash}/ # in-flight scratch
│ │ ├── page_NNNN.txt
│ │ ├── meta.json
│ │ └── (original file or transcript.txt)
│ ├── concepts/{hash}/
│ │ └── window_N.json # Gemini enrichment output
│ ├── intel/ # ARGUS intel feeds
│ ├── _duplicates/ # level-4 name-match quarantine
│ ├── _rejected/ # oversize / unreadable PDFs
│ └── recon.db # SQLite WAL mode
├── lib/
│ ├── acquisition/peertube.py
│ ├── processors/{pdf,transcript,text}_processor.py
│ ├── dispatcher.py
│ ├── filing.py
│ ├── enricher.py
│ ├── embedder.py
│ ├── status.py # StatusDB class
│ ├── api.py # Flask dashboard + API
│ ├── new_pipeline.py # update_qdrant_payload helper lives here
│ ├── utils.py # content_hash, generate_download_url, get_config, setup_logging
│ ├── peertube_scraper.py # PeerTube API client
│ └── organizer.py # determine_dominant_domain, level 1-4 naming
└── logs/
/mnt/library/ # LXC bind-mount from data host /mnt/data/library (local SSD), read-write
├── <Domain>/<Subdomain>/<canonical_name>.<ext>
└── _acquired/ _review/ _staging/ signal-archive/ # not touched by pipeline
```
---
## 14. Refactor History (2026-04)
The refactor is tracked as dated phases under `phases/`. Status
implementations are in the RECON repo; design lives here.
| Phase | Focus |
|---|---|
| 0 | Baseline capture — DB dumps, directory listings, config pin |
| 1 | Scaffolding — create `acquired/`, `processing/`, config keys |
| 2 | Shared filing function — extract organizer logic into `filing.py` |
| 3 | Transcript processor — first end-to-end test of the new pattern |
| 4 | PDF processor — layered A/B/C metadata vote, level-4 dedupe |
| 5a | Transcript resweep — 16,596 transcripts moved from `/mnt/library/_sources/streamecho6/` into `/mnt/library/<Domain>/<Subdomain>/` via concept-driven domain classification; 2,259 skipped as unclassified (these became the 5b drain cohort) |
| 5b | Transcript unprocess — 2,259 skip_unclassified transcripts staged into `data/acquired/stream/` as `.txt`+`.meta.json` pairs; DB rows deleted, Qdrant vectors removed, source dirs cleaned |
| 5c-1 | Service loop rewire — retire old scan_library thread, wire dispatcher in |
| 5c-2 | Service start & transcript drain — clear the hopper backlog |
| 6a | Transcript organized-in-place — set `organized_at` during pre_flight so filing worker ignores transcripts |
| 6b | Dashboard "Untitled / WEB" bug fix — recently-completed table query |
| 6c | Code cleanup — dead-code audit |
| 6d | PeerTube acquisition module — replace ad-hoc ingester with `acquisition/peertube.py` |
| 6e | ShadowLib skill + dashboard PeerTube endpoint cleanup (partial — 6e-2 reverted) |
| 6f | Text processor — new `lib/processors/text_processor.py` handles `.txt` files with two-source metadata voting (filename + Gemini); new `data/acquired/text/` hopper subfolder; files to library like PDFs |
| 6f-2 | Format normalizer in dispatcher — converts `.epub`/`.mobi` to PDF via Calibre's `ebook-convert`, `.doc`/`.docx` via `libreoffice --headless`, called per-subfolder before `_find_pairs()` |
| 6g | Gemini "null" string bug fix — both `pdf_processor` and `text_processor` now filter the literal string `"null"` out of Gemini's JSON responses before metadata voting |
| 6h | STATE 2 transcript cleanup — deleted 283 zero-vector transcripts (DB rows, concepts, local text, Qdrant entries) and 1,198 orphan dirs in `data/text/`; triggered PeerTube transcription for 332 videos without captions via `POST /api/v1/videos/{uuid}/captions/generate` |
| 6i | Dashboard upload migration — `POST /api/upload` now routes by extension to the appropriate hopper (pdf/text) with `.meta.json` sidecar, supports PDF/TXT/EPUB/DOC/DOCX/MOBI, removed direct library copy and `add_to_catalogue`/`queue_document` calls, added status endpoint fallback that checks `acquired/` and `processing/` dirs for the upload/dispatch gap |
| 6j | Library cleanup — ~51G freed; 398 duplicate PDFs deleted (Army_Pubs, Acquired, Scenario-Playbooks dupes); 2,274 non-PDF SCL files deleted (user confirmed backups); 57 files in 3 ghost domain folders (Community-Coordination, Leadership, Scenario-Playbooks) refiled through new pipeline; 201 unclassified SCL PDFs refiled; 1,240 `_unclassified/` PDFs refiled; `_ingest/_duplicates/` cleared; 5 loose root PDFs staged |
| 6k | Phase 5a un-file — 16,340 of the 16,596 Phase 5a-filed transcripts had their `catalogue.path` restored from library filesystem path back to PeerTube watch URL via title-matching against PeerTube's video list (98.6% match rate); physical `.txt` files deleted from library; Qdrant `download_url` payload updated; 4,955 empty dirs cleaned up; 223 edge cases (82 MULTI_MATCH + 141 UNMATCHED) documented for later review |
### Baseline pre-refactor (per `current-state.md`)
- 18,855 transcripts in `/mnt/library/_sources/streamecho6/`.
- Old stream-B `new_pipeline` ran off `/mnt/library/_acquired/`.
- `scan_library()` polled the NFS mount for new PDFs — now deprecated.
- *Storage migration note:* `/mnt/library` was historically an NFS
mount from `pi-nas:/export/library`, which is what `current-state.md`
and `scan_library()` were written against. The library has since
been migrated to local SSD on the data Proxmox host
(`/mnt/data/library`) and surfaced into CT 130 via an LXC
bind-mount. The pi-nas copy was wiped on 2026-04-15. Path strings
inside the codebase didn't change; only the underlying storage did.
---
## 15. Operational Runbook
### Service control (on CT 130 as zvx)
```bash
sudo systemctl {status,start,stop,restart} recon
journalctl -u recon -f
tail -f /opt/recon/logs/recon.log
```
### Backups
```bash
# Local DB backup before risky operations
cp /opt/recon/data/recon.db /tmp/recon.db.bak.$(date +%s)
# Offsite backup: planned, not yet configured (TBD — likely rsync to
# pi-nas:/export/recon-backup once a backup target is provisioned).
```
### Inspect pipeline state at a glance
```bash
ls /opt/recon/data/acquired/*/ # hopper contents
ls /opt/recon/data/processing/ | wc -l # in-flight count
sqlite3 /opt/recon/data/recon.db \
"SELECT status, COUNT(*) FROM documents GROUP BY status;"
```
### Re-queue a failed document
```bash
sqlite3 /opt/recon/data/recon.db \
"UPDATE documents SET status='extracted', retries=0 WHERE hash='<hash>';"
# or via API:
curl -X POST https://recon.echo6.co/api/retry/<hash>
```
### Manual ingest
```bash
# Drop a PDF into the hopper (dispatcher will pick it up on next cycle)
sha=$(sha256sum foo.pdf | cut -d' ' -f1)
cp foo.pdf /opt/recon/data/acquired/pdf/${sha}.pdf
```
### Qdrant health
```bash
curl -s http://100.64.0.14:6333/collections/recon_knowledge_hybrid \
| jq '.result | {status, points_count, optimizer_status}'
# status "grey" with optimizer_status.ok=true is healthy (background indexing).
```
---
## 16. Known Gotchas
- **Logger setup.** RECON modules must use `setup_logging('recon.<name>')`
from `lib.utils`, never raw `logging.getLogger()`. The root logger
has no handlers; calls to a raw logger silently disappear.
- **Qdrant status "grey" is healthy** if `optimizer_status.ok == true`.
Only treat red + not-ok as a real failure.
- **Catalogue row count can grow during long-running jobs** because
parallel ingestion may add rows. Only a *decrease* is a real
integrity failure.
- **Dispatcher `.tmp` safety.** `CONTENT_EXTENSIONS` does not include
`.tmp`, so active acquisition writes are invisible to the dispatcher
until the atomic rename lands.
- **Transcripts are filed in-place.** Their `documents.path` is a URL
and filing worker's `path LIKE '/opt/recon/data/processing/%'`
filter excludes them.
- **PeerTube 429.** Respect `peertube.rate_limit_delay` between caption
API calls or you'll get throttled.
- **Library is an LXC bind-mount, not NFS.** `/mnt/library` on CT 130 is
bound from the data Proxmox host's `/mnt/data/library` (local ext4 on
/dev/sda1). File ownership/UID-GID is shared with the host — writes
from inside the container appear with the container UID on the host.
No NFS, no `root_squash`, no network in the path.
- **SSH heredocs with Python code break.** When editing remote files,
write to a temp file via `scp` or `cat > file` rather than bash
heredocs with parens/quotes.
- **The crawler is off.** `crawler.sites: []`. Re-enabling requires a
re-architecture for the new pipeline.
---
## 17. Credentials & Hosts
| Host | Role | Access |
|---|---|---|
| CT 130 (192.168.1.130 / 100.64.0.24) | RECON service | `ssh zvx@192.168.1.130` (key auth) |
| data host (192.168.1.240) | Proxmox node hosting CT 130; `/mnt/data/library` source for the CT 130 bind-mount | `ssh root@192.168.1.240` |
| cortex VM (192.168.1.150) | Qdrant, TEI, sparse svc, Ollama | `ssh zvx@cortex` |
| CT 110 (192.168.1.170) | PeerTube `stream.echo6.co` | `ssh zvx@192.168.1.170` |
| pi-nas (192.168.1.245) | Backup target (planned; not yet configured). ~22T pool with ~300G free after library wipe. | `ssh zvx@pi-nas` |
| CT 101 (192.168.1.101) | Caddy reverse proxy (home) | `ssh root@192.168.1.241 'pct exec 101'` |
Secrets: `/home/zvx/projects/.ref/credentials` on TOC (this machine).
RECON Gemini/PeerTube keys: `/opt/recon/.env` on CT 130.
---
## 18. Open Follow-ups
- **82 MULTI_MATCH + 141 UNMATCHED** transcript rows still carry
library paths post Phase 5a/6k (audit trail at
`/tmp/phase5a_remaining.txt` on CT 130 — file still present). Either
hand-resolve or tombstone.
- **HTML processor** (`lib/processors/html_processor.py`) is scaffolded
in config but not implemented. Next-up for Kiwix / web ingest.
- **Crawler re-architecture.** The tier-1 sites list in `config.yaml`
is a valuable target list but the old crawler is off pending a new
acquisition-module-shaped implementation.
- **ARGUS intel pipeline** shares the DB but its lifecycle is
documented separately — not covered here.
- **Phase 6e-2** (PeerTube channel sync endpoint) was reverted and
needs a redesign before reinstating.
- **Level-4 dedupe review queue** (`duplicate_review` table) has no UI
yet; items pile up silently.
- **9,478 legacy dirs in `/opt/recon/data/text/`** — historical
extraction output from the pre-refactor pipeline, for documents
still in catalogue. Not touched by current pipeline. Can be cleaned
up once confirmed none are the sole text copy for any document.
- **`lib/new_pipeline.py` is misleadingly named** — it's actually a
library management CLI tool, not the refactor's new pipeline.
Contains `update_qdrant_payload` helper that filing worker depends
on. Should be renamed (e.g., `library_ops.py`) when there's time.
- **SSH key for CT 130 forge access** — currently uses HTTPS with
embedded token in remote URL. Move to SSH key auth.
- **Backup policy for derived data** — `/opt/recon/data/concepts/` and
Qdrant snapshots are not in any backup rotation. If CT 130 or cortex
lose their disks, these are the hardest to regenerate (Gemini calls
+ embedding compute).
- **Backup architecture** — no offsite backup is currently configured.
Section 15 references a planned rsync-to-pi-nas job, but neither the
script nor the systemd timer (`recon-backup.timer`) exist. Decide
what gets backed up (`recon.db`, `concepts/`, `text/`, Qdrant
snapshots, `/mnt/library`?), where, and on what cadence; pi-nas has
~300G free in `/export/` after the 2026-04-15 library wipe and could
be the target for a first pass.
- **`signal-archive/` in `/mnt/library/`** — 44 Signal/Matrix chat log
files, not library content. Matt intends these to "eventually
contribute" to the knowledge base but no ingestion path exists yet.
---
*Last updated: 2026-04-15 — Refactor feature-complete. Phases 0 through 6k landed. Service operational with 7 daemon threads. Outstanding: 223 edge-case transcripts (see Section 18), HTML processor (scaffolded, not implemented), crawler re-architecture (deferred). Living document; edit in place as the system evolves.*