mirror of https://github.com/zvx-echo6/refactored-recon.git synced 2026-05-20 14:44:39 +02:00

Matt b1c05c4d02 PROJECT-BIBLE: fix storage topology — library is LXC bind-mount, not NFS

- Section 2 topology diagram: 'Library (LXC bind) / data /mnt/data/library
  → /mnt/library/ (read/write, local SSD)'
- Section 10 Config table: library_root described as bind-mount root
- Section 13 Filesystem layout: /mnt/library annotated as LXC bind-mount
- Section 14 Refactor history: storage migration note added (NFS history
  preserved as historical context)
- Section 15 Operational runbook: replaced recon-backup.timer reference
  with planned/TBD note
- Section 16 Known Gotchas: new bullet on bind-mount file ownership and
  the absence of NFS / root_squash in the path
- Section 17 Credentials & Hosts: added data host row; rewrote pi-nas
  role to backup target (planned, not yet configured) reflecting the
  2026-04-15 wipe of /export/library
- Section 18 Open Follow-ups: added backup architecture entry capturing
  the missing rsync job and the now-available ~300G pi-nas headroom

2026-04-16 06:50:36 +00:00

33 KiB

Raw Blame History

RECON — Project Bible

Canonical architectural reference for RECON (the knowledge extraction pipeline running on CT 130 / data.echo6 / 100.64.0.24). This document is the orientation dossier for any future session. It is skim-and-find, not a tutorial.

Repo: ssh://git@forge.echo6.co:2222/matt/refactored-recon.git (design)
Code: /opt/recon/ on CT 130 (zvx owns the tree; service runs as zvx)
Service: systemctl status recon
Dashboard: https://recon.echo6.co (zvx-only via Authentik)
Files server: https://files.echo6.co (Authentik forward auth)

1. Mission

RECON ingests documents from multiple sources (manual PDF uploads, PeerTube auto-captioned transcripts, future Kiwix/HTML/RSS feeds) and produces a searchable, domain-organized library plus a hybrid dense/sparse vector index in Qdrant on cortex.

Every piece of content ends up in two places:

A file under /mnt/library/<Domain>/<Subdomain>/<canonical_name>.<ext> (PDFs, HTML) or at a source URL like https://stream.echo6.co/w/<uuid> (PeerTube transcripts — no local copy after Phase 5a).
Page-level embeddings in Qdrant collection recon_knowledge_hybrid (dense bge-m3 + sparse SPLADE-style vectors, 1024-dim dense).

Search returns page-grounded citations back to the file or stream URL.

2. System Topology

                        ┌─────────────────────────┐
                        │  CT 130 (recon)         │
  Library (LXC bind)    │  /opt/recon/            │     ┌──────────────┐
  data /mnt/data/library│   ├─ data/              │     │ Qdrant       │
  → /mnt/library/       │   │   ├─ acquired/      │     │ cortex:6333  │
  (read/write, local SSD)│  │   ├─ processing/    │ ←→  │ recon_knowledge_hybrid
                        │   │   ├─ concepts/      │     │ (1024-d dense + sparse)
                        │   │   └─ recon.db       │     └──────────────┘
                        │   ├─ lib/               │
                        │   ├─ recon.py           │     ┌──────────────┐
                        │   └─ config.yaml        │ ←→  │ TEI          │
                        │  recon.service          │     │ cortex:8090  │
                        │  nginx :8888 (files)    │     │ bge-m3 dense │
                        └─────────────────────────┘     └──────────────┘
                                   ▲                    ┌──────────────┐
                                   │                    │ Sparse svc   │
           ┌───────────────────────┴─────┐         ←→   │ cortex:8091  │
           │                             │              │ bge-m3 sparse│
 PeerTube (CT 110 / stream.echo6.co)    Gemini API      └──────────────┘
 api_base: http://192.168.1.170          (enrichment,
                                          vision OCR)

Shared caddy reverse proxy (CT 101) surfaces the dashboard (8420) and nginx file server (8888) as recon.echo6.co and files.echo6.co.

3. Pipeline Lifecycle

Every document follows the same five-stage arc regardless of source type. The filesystem location at any given moment tells you which stage the item is in — state is a directory.

  ┌──────────────┐   ┌──────────────┐   ┌──────────────┐   ┌──────────────┐   ┌──────────────┐
  │ 1. ACQUIRE   │ → │ 2. DISPATCH  │ → │ 3. PROCESS   │ → │ 4. ENRICH /  │ → │ 5. FILE      │
  │              │   │ (pre_flight) │   │              │   │    EMBED     │   │              │
  │ data/acquired│   │ dispatcher.py│   │ per-type     │   │ shared       │   │ shared       │
  │ /<type>/     │   │ watches      │   │ processor    │   │ stage loops  │   │ filing worker│
  │ <hash>.{ext} │   │ subfolders,  │   │ moves file   │   │ bge-m3 →     │   │ moves file   │
  │ <hash>.meta  │   │ hands to     │   │ to processing│   │ Qdrant       │   │ processing → │
  │              │   │ processor    │   │ /{hash}/     │   │              │   │ library,     │
  │              │   │              │   │              │   │              │   │ updates DB + │
  │              │   │              │   │              │   │              │   │ Qdrant       │
  └──────────────┘   └──────────────┘   └──────────────┘   └──────────────┘   └──────────────┘

Status column on documents: catalogued → queued → extracting → extracted → enriching → enriched → embedding → complete (plus terminal error, content_failure, duplicate states).

organized_at IS NULL while in flight, set to CURRENT_TIMESTAMP after filing. Transcripts are marked organized in-place during pre_flight (they have no filesystem target — the watch URL is their "home").

4. Acquisition Layer (`lib/acquisition/`)

Acquisition modules fetch content from external sources and drop {hash}.<ext> + {hash}.meta.json flat file pairs into data/acquired/<type>/. They do not touch the database — that's the processor's job.

Atomic drop protocol

Write content to <hash>.<ext>.tmp (unknown extension, safe from dispatcher).
Compute hash; rename tmp to final <hash>.<ext>.
Write <hash>.meta.json.tmp, then rename to <hash>.meta.json.
Meta goes final first, content goes final last. Dispatcher only picks up when content file exists and is stable, so a half-visible pair without meta never gets dispatched.

PeerTube acquisition (`lib/acquisition/peertube.py`)

Daemon loop acquisition_loop(stop_event, db, config, interval=1800).
Queries catalogue for source='stream.echo6.co' rows, builds sets of known UUIDs (/w/<uuid> extracted from path) and known titles (from filename) — both cohorts are checked so Phase 5b-rewritten rows and pre-5b library-path rows dedupe correctly.
Lists PeerTube videos via peertube_scraper.get_videos, filters to those with captions, prefers English caption.
For each new one: fetches VTT, converts to text with vtt_to_text, atomically drops pair into data/acquired/stream/.
Rate limits at peertube.rate_limit_delay (default 0.5s) — PeerTube returns 429 if captions are fetched too fast.

Manual uploads / URL ingest

api.py exposes /api/upload, /api/ingest-url, /api/ingest-urls, /api/ingest-peertube — all end by dropping a pair into acquired/<type>/.

5. Dispatcher (`lib/dispatcher.py`)

The dispatcher is one daemon thread (dispatch_loop, interval=30s). It watches each configured subfolder under data/acquired/ and hands stable file pairs to the registered processor.

Config-driven dispatch table

pipeline:
  acquired_root: /opt/recon/data/acquired
  processing_root: /opt/recon/data/processing
  dispatch:
    pdf: pdf_processor
    stream: transcript_processor
    html: html_processor        # not yet implemented
    text: text_processor
  mtime_stability_seconds: 10

Extension constants

CONTENT_EXTENSIONS = {'.txt', '.vtt', '.html', '.pdf'} — the dispatcher considers a file "content" only if its extension is in this set. .tmp is not in the set, so partial writes are safe.
CONVERTIBLE_EXTENSIONS = {'.epub', '.mobi', '.doc', '.docx'} — these are normalized to PDF before dispatch.

Normalization step

_normalize_formats(subfolder_path):

.epub / .mobi → PDF via ebook-convert (Calibre CLI).
.doc / .docx → PDF via libreoffice --headless.
Sidecar .meta.json is renamed to match the new PDF hash so pairing holds.

Pair finding

_find_pairs(subfolder_path) returns tuples of (content_path, meta_path_or_None). Pairs where only content exists are still valid — meta is not required. A meta without its content is ignored.

Stability check

_is_stable(filepath, stability_seconds) — mtime must be at least mtime_stability_seconds old (default 10s) before dispatch. Prevents racing active writers.

6. Processors (`lib/processors/`)

Each processor implements one function: pre_flight(content_path, meta_path, db, config) → dict. It owns all the type-specific logic and all the database writes for that item up to status=extracted.

Common pre_flight contract

Every processor does, in order:

Hash content (SHA-256 via content_hash() in lib/utils.py).
Stale state cleanup: rm -rf processing/{hash}/ and concepts/{hash}/ if they exist (guards against re-runs).
Hash dedupe: if hash already exists in catalogue, delete the pair, return action duplicate.
Type-specific metadata extraction + level-4 dedupe check (PDF only).
Move content + meta into processing/{hash}/ with a type-specific layout.
db.add_to_catalogue, db.queue_document, set documents.text_dir and page_count, db.update_status(hash, 'extracted', ...).

Return dict keys: hash, action, source_path, error. Actions are one of: extracted, duplicate, level4_duplicate, content_failure, error, and duplicate (for transcripts) or skip_empty (for text).

`pdf_processor.py`

The heaviest processor. Layered metadata extraction:

Source A — PDF dict: PdfReader(...).metadata, mapped to {title, author, edition, year}.
Source B — Filename: regex parse the original filename.
Source C — Gemini Vision OCR on first 3 pages when A+B disagree or are missing. Returns structured JSON via Gemini's response_mime_type: application/json.
Voting: _vote_metadata(A, B, C) reconciles the three sources; 2-of-3 wins; ties prefer Source A.
Level-4 dedupe: if all four fields (title, edition, author, year) are present and match an existing catalogue row with a different hash, the PDF is quarantined to _duplicates/ for human review.
Size cap: processing.max_pdf_size_mb (default 2000MB). Oversize PDFs move to _rejected/.
Text extraction order: PyPDF2 → pdftotext (poppler) → Tesseract OCR → Gemini Vision on a per-page basis. Output: processing/{hash}/page_NNNN.txt.

`transcript_processor.py`

Lightweight. The VTT→text conversion already happened in acquisition, so pre_flight just:

Hashes <hash>.txt file.
Reads meta.json sidecar.
chunk_text(raw_text, WORDS_PER_PAGE=2000) splits into page_NNNN.txt files.
Writes the transcript as processing/{hash}/transcript.txt plus page chunks.
Registers with category Transcript, source stream.echo6.co.
Sets text_dir, page_count, and organized_at = CURRENT_TIMESTAMP immediately — transcripts are filed-in-place (their "location" is the PeerTube watch URL, set later as the catalogue path via Phase 5a).

`text_processor.py`

Raw .txt files dropped via manual upload. Two-source metadata vote (filename + meta.json). Similar flow to transcript processor but no fixed category or source.

7. Enrichment & Embedding

Both are source-agnostic stage loops that just poll documents by status and do their work. They live in lib/enricher.py and lib/embedder.py, wrapped by stage_loop(stage, ...) in recon.py.

Enrichment (`enrich_workers: 16` threads per batch)

Polls status = 'extracted' AND retries < max_retries.
Sets enriching, reads processing/{hash}/page_NNNN.txt.
Windows pages (enrich_window_size: 5 per window) and sends each window to Gemini with a structured prompt.
Stores concepts/{hash}/window_N.json per window.
Backoff: enrich_base_delay=5s, doubling up to enrich_max_delay=120s, max enrich_max_retries=5.
On success: update_status(hash, 'enriched').

Embedding (`embed_workers: 4`)

Polls status = 'enriched'.
Reads concept JSONs, builds page-level chunks.
Dense: POST to TEI at cortex:8090 (bge-m3, 1024-d). Batches of 128 per TEI request. Throughput ~1,711 emb/sec.
Sparse: POST to the sparse service at cortex:8091 (bge-m3 sparse mode; sparse_embedding.enabled: true).
Upserts into Qdrant cortex:6333, collection recon_knowledge_hybrid, batch size embed_batch_size=500 vectors per upsert.
Payload carries: hash, filename, original_filename, download_url, page, text, title, domain, subdomain, category.
Ollama is a fallback backend (much slower, ~8 emb/sec) via embedding.backend: ollama.
On success: update_status(hash, 'complete').

8. Filing (`lib/filing.py`)

One daemon thread, filing_worker_loop(interval=30). It polls:

SELECT hash FROM documents
 WHERE status = 'complete'
   AND organized_at IS NULL
   AND path LIKE '/opt/recon/data/processing/%'
 LIMIT 50

The path LIKE '/opt/recon/data/processing/%' filter naturally excludes transcripts — their documents.path was never a filesystem path but the PeerTube watch URL.

For each row, file_processed_item(doc_hash, source_file_path, db, config) does:

determine_dominant_domain(hash) reads concept JSONs, returns the top-voted Domain/Subdomain.
_build_target_path(...) derives the canonical name starting at level 1 (Title), escalating to level 2/3/4 only if a collision exists in the target folder. Preserves source file's actual extension (not hardcoded to .pdf).
shutil.move(source, target) atomically. Target is /mnt/library/<Domain>/<Subdomain>/<canonical>.<ext>.
Updates:
- catalogue.path → new target
- catalogue.filename → new canonical name
- documents.path → new target
- Qdrant payload via update_qdrant_payload(...): download_url = generate_download_url(new_path, ...), filename, original_filename set on every point for that hash.
db.mark_organized(hash) sets organized_at + cleans up processing/{hash}/.

Download URL helper (`lib/utils.py:generate_download_url`)

If the path is already http:// or https:// (transcripts), return it unchanged.
Otherwise strip library_root prefix and prepend book_server.base_url (→ https://files.echo6.co/<rel>).

9. StatusDB (`lib/status.py`)

SQLite (data/recon.db) in WAL mode with thread-local connections (_get_conn() uses threading.local).

Tables

Table	Purpose
`catalogue`	Canonical record keyed by `hash` — title, filename, path, source, category, size
`documents`	Pipeline state machine — status, path, text_dir, page_count, retries, organized_at, timestamps
`intel`	ARGUS intel feed entries (separate pipeline)
`metrics_snapshots`	Time-series rollups for the dashboard
`file_operations`	Audit log of Phase-5-style file moves and renames
`duplicate_review`	Level-4 dedupe quarantine queue

Key methods

add_to_catalogue(hash, title, url, size, source, category)
queue_document(hash) — insert into documents with status=queued
update_status(hash, status, **kwargs) — single point of status truth
mark_organized(hash) — sets organized_at, final transition
sync_document_path(hash, new_path) + update_catalogue_path(...) — used by filing worker and Phase 5a un-file
get_path_updates / clear_path_update — small change queue for backfills

Connection safety

All writers take a short-lived connection via _get_conn(). WAL mode allows concurrent readers; writes are serialized at the SQLite level. No explicit BEGIN — rely on autocommit semantics with occasional conn.commit() after grouped updates.

10. Configuration (`config.yaml`)

Lives at /opt/recon/config.yaml. Secrets (GEMINI_KEYS, PEERTUBE_TOKEN, etc.) live in /opt/recon/.env — never in config.yaml, never in git.

Top-level keys

Key	Meaning
`library_root`	`/mnt/library` — LXC bind-mount root (data host `/mnt/data/library`, local SSD)
`processing`	Worker counts, window sizes, timeouts, retry policy
`embedding`	TEI host/port, model (`bge-m3`), 1024-d dense
`sparse_embedding`	Separate service on cortex:8091
`vector_db`	Qdrant host, port, collection name
`gemini`	Model (`gemini-2.0-flash`), JSON response mode
`web`	Dashboard bind host + port (8420)
`paths`	`base`, `data`, `text`, `concepts`, `intel`, `logs`, `db`
`book_server`	`base_url`, `strip_prefix` for download URL generation
`upload_paths`	Category → filesystem path for upload routing
`service`	`scan_interval`, `stage_poll_interval`, `progress_interval`
`peertube`	`api_base`, `public_url`, `rate_limit_delay`, `poll_interval`
`pipeline`	`acquired_root`, `processing_root`, `dispatch` table, `mtime_stability_seconds`
`crawler` / `web_scraper`	Currently disabled (`sites: []`) pending re-architecture
`new_pipeline`	Stream-B (old) pipeline, `enabled: false`

11. Service & Threads (`recon.py cmd_service`)

systemctl start recon → python3 recon.py service. The service runs seven daemon threads plus a metrics collector:

Thread	Function	Interval
`dispatcher`	`dispatcher.dispatch_loop`	30s
`enrich`	`stage_loop('enrich', ...)`	30s idle
`embed`	`stage_loop('embed', ...)`	30s idle
`filing`	`filing.filing_worker_loop`	30s
`peertube-acq`	`acquisition.peertube.acquisition_loop`	1800s
`progress`	Log status rollup line	60s
`dashboard`	`api.run_server` (Flask)	bound

Plus peertube_collector.start_collector for metrics scrape.

All threads receive a shared stop_event (threading.Event) and exit cleanly on SIGTERM via signal.signal(SIGTERM, lambda *_: stop_event.set()).

CLI commands (`recon.py` top-level)

scan, queue, extract, enrich, embed, run, status, catalogue, failures, search, upload, ingest-url, ingest, ingest-peertube, validate, rebuild, serve, service, organize, pipeline.

Most commands are thin wrappers around library functions — useful for one-off maintenance from the CT 130 shell.

12. Dashboard & API (`lib/api.py`)

Flask app bound to 0.0.0.0:8420. Pages are server-rendered Jinja templates; data is pulled via AJAX from /api/* endpoints.

Page routes

/, /search, /catalogue, /upload, /web-ingest, /failures, /peertube, /peertube/channels, /settings/{keys,cookies,vpn,health}.

API surface (grouped)

Group	Endpoints
Upload	`POST /api/upload`, `GET /api/upload/<hash>/status`, `GET /api/upload/categories`
Ingest	`POST /api/ingest-url`, `/api/ingest-urls`, `/api/ingest`, `/api/ingest-peertube`, `/api/crawl`, `GET /api/crawl/<id>/status`, `GET /api/ingest-peertube/<job>/status`
Search	`POST /api/search`
Status	`GET /api/status`, `/api/quick-stats`, `/api/knowledge-stats`, `/api/health`
Retry	`POST /api/retry/<hash>`, `/api/retry-all`
Service	`POST /api/service/restart`
Keys	Full CRUD on `/api/keys`, `/api/keys/validate`, `/api/keys/reload`
Cookies	`GET /api/cookies/status`, `POST /api/cookies/upload`
VPN	`GET /api/vpn/status`, `POST /api/vpn/{connect,disconnect,rotate,login}`
PeerTube	`/api/peertube/{dashboard,channels,channels/stats,channels/add,channels/<actor>}`, `/api/peertube/stats`
Metrics	`GET /api/metrics/history`

Qdrant scroll

_qdrant_scroll(host, port, collection, req) is the shared paged-read helper for rebuilding the knowledge-stats panel.

Cache warmer

start_cache_warmer(stop_event) pre-computes the expensive quick-stats and knowledge-stats panels so the dashboard loads instantly.

13. Filesystem Layout

/opt/recon/
├── recon.py                    # CLI + service entry point
├── config.yaml
├── .env                        # secrets (GEMINI_KEYS etc.)
├── PROJECT-BIBLE.md            # this file (copy on CT 130)
├── backups/                    # local DB backups
├── data/
│   ├── acquired/               # hopper — {hash}.ext + {hash}.meta.json
│   │   ├── pdf/
│   │   ├── stream/             # PeerTube transcripts
│   │   ├── html/               # (future)
│   │   └── text/
│   ├── processing/{hash}/      # in-flight scratch
│   │   ├── page_NNNN.txt
│   │   ├── meta.json
│   │   └── (original file or transcript.txt)
│   ├── concepts/{hash}/
│   │   └── window_N.json       # Gemini enrichment output
│   ├── intel/                  # ARGUS intel feeds
│   ├── _duplicates/            # level-4 name-match quarantine
│   ├── _rejected/              # oversize / unreadable PDFs
│   └── recon.db                # SQLite WAL mode
├── lib/
│   ├── acquisition/peertube.py
│   ├── processors/{pdf,transcript,text}_processor.py
│   ├── dispatcher.py
│   ├── filing.py
│   ├── enricher.py
│   ├── embedder.py
│   ├── status.py               # StatusDB class
│   ├── api.py                  # Flask dashboard + API
│   ├── new_pipeline.py         # update_qdrant_payload helper lives here
│   ├── utils.py                # content_hash, generate_download_url, get_config, setup_logging
│   ├── peertube_scraper.py     # PeerTube API client
│   └── organizer.py            # determine_dominant_domain, level 1-4 naming
└── logs/

/mnt/library/                   # LXC bind-mount from data host /mnt/data/library (local SSD), read-write
├── <Domain>/<Subdomain>/<canonical_name>.<ext>
└── _acquired/ _review/ _staging/ signal-archive/  # not touched by pipeline

14. Refactor History (2026-04)

The refactor is tracked as dated phases under phases/. Status implementations are in the RECON repo; design lives here.

Phase	Focus
0	Baseline capture — DB dumps, directory listings, config pin
1	Scaffolding — create `acquired/`, `processing/`, config keys
2	Shared filing function — extract organizer logic into `filing.py`
3	Transcript processor — first end-to-end test of the new pattern
4	PDF processor — layered A/B/C metadata vote, level-4 dedupe
5a	Transcript resweep — 16,596 transcripts moved from `/mnt/library/_sources/streamecho6/` into `/mnt/library/<Domain>/<Subdomain>/` via concept-driven domain classification; 2,259 skipped as unclassified (these became the 5b drain cohort)
5b	Transcript unprocess — 2,259 skip_unclassified transcripts staged into `data/acquired/stream/` as `.txt`+`.meta.json` pairs; DB rows deleted, Qdrant vectors removed, source dirs cleaned
5c-1	Service loop rewire — retire old scan_library thread, wire dispatcher in
5c-2	Service start & transcript drain — clear the hopper backlog
6a	Transcript organized-in-place — set `organized_at` during pre_flight so filing worker ignores transcripts
6b	Dashboard "Untitled / WEB" bug fix — recently-completed table query
6c	Code cleanup — dead-code audit
6d	PeerTube acquisition module — replace ad-hoc ingester with `acquisition/peertube.py`
6e	ShadowLib skill + dashboard PeerTube endpoint cleanup (partial — 6e-2 reverted)
6f	Text processor — new `lib/processors/text_processor.py` handles `.txt` files with two-source metadata voting (filename + Gemini); new `data/acquired/text/` hopper subfolder; files to library like PDFs
6f-2	Format normalizer in dispatcher — converts `.epub`/`.mobi` to PDF via Calibre's `ebook-convert`, `.doc`/`.docx` via `libreoffice --headless`, called per-subfolder before `_find_pairs()`
6g	Gemini "null" string bug fix — both `pdf_processor` and `text_processor` now filter the literal string `"null"` out of Gemini's JSON responses before metadata voting
6h	STATE 2 transcript cleanup — deleted 283 zero-vector transcripts (DB rows, concepts, local text, Qdrant entries) and 1,198 orphan dirs in `data/text/`; triggered PeerTube transcription for 332 videos without captions via `POST /api/v1/videos/{uuid}/captions/generate`
6i	Dashboard upload migration — `POST /api/upload` now routes by extension to the appropriate hopper (pdf/text) with `.meta.json` sidecar, supports PDF/TXT/EPUB/DOC/DOCX/MOBI, removed direct library copy and `add_to_catalogue`/`queue_document` calls, added status endpoint fallback that checks `acquired/` and `processing/` dirs for the upload/dispatch gap
6j	Library cleanup — ~51G freed; 398 duplicate PDFs deleted (Army_Pubs, Acquired, Scenario-Playbooks dupes); 2,274 non-PDF SCL files deleted (user confirmed backups); 57 files in 3 ghost domain folders (Community-Coordination, Leadership, Scenario-Playbooks) refiled through new pipeline; 201 unclassified SCL PDFs refiled; 1,240 `_unclassified/` PDFs refiled; `_ingest/_duplicates/` cleared; 5 loose root PDFs staged
6k	Phase 5a un-file — 16,340 of the 16,596 Phase 5a-filed transcripts had their `catalogue.path` restored from library filesystem path back to PeerTube watch URL via title-matching against PeerTube's video list (98.6% match rate); physical `.txt` files deleted from library; Qdrant `download_url` payload updated; 4,955 empty dirs cleaned up; 223 edge cases (82 MULTI_MATCH + 141 UNMATCHED) documented for later review

Baseline pre-refactor (per `current-state.md`)

18,855 transcripts in /mnt/library/_sources/streamecho6/.
Old stream-B new_pipeline ran off /mnt/library/_acquired/.
scan_library() polled the NFS mount for new PDFs — now deprecated.
Storage migration note: /mnt/library was historically an NFS mount from pi-nas:/export/library, which is what current-state.md and scan_library() were written against. The library has since been migrated to local SSD on the data Proxmox host (/mnt/data/library) and surfaced into CT 130 via an LXC bind-mount. The pi-nas copy was wiped on 2026-04-15. Path strings inside the codebase didn't change; only the underlying storage did.

15. Operational Runbook

Service control (on CT 130 as zvx)

sudo systemctl {status,start,stop,restart} recon
journalctl -u recon -f
tail -f /opt/recon/logs/recon.log

Backups

# Local DB backup before risky operations
cp /opt/recon/data/recon.db /tmp/recon.db.bak.$(date +%s)
# Offsite backup: planned, not yet configured (TBD — likely rsync to
# pi-nas:/export/recon-backup once a backup target is provisioned).

Inspect pipeline state at a glance

ls /opt/recon/data/acquired/*/    # hopper contents
ls /opt/recon/data/processing/ | wc -l   # in-flight count
sqlite3 /opt/recon/data/recon.db \
  "SELECT status, COUNT(*) FROM documents GROUP BY status;"

Re-queue a failed document

sqlite3 /opt/recon/data/recon.db \
  "UPDATE documents SET status='extracted', retries=0 WHERE hash='<hash>';"
# or via API:
curl -X POST https://recon.echo6.co/api/retry/<hash>

Manual ingest

# Drop a PDF into the hopper (dispatcher will pick it up on next cycle)
sha=$(sha256sum foo.pdf | cut -d' ' -f1)
cp foo.pdf /opt/recon/data/acquired/pdf/${sha}.pdf

Qdrant health

curl -s http://100.64.0.14:6333/collections/recon_knowledge_hybrid \
  | jq '.result | {status, points_count, optimizer_status}'
# status "grey" with optimizer_status.ok=true is healthy (background indexing).

16. Known Gotchas

Logger setup. RECON modules must use setup_logging('recon.<name>') from lib.utils, never raw logging.getLogger(). The root logger has no handlers; calls to a raw logger silently disappear.
Qdrant status "grey" is healthy if optimizer_status.ok == true. Only treat red + not-ok as a real failure.
Catalogue row count can grow during long-running jobs because parallel ingestion may add rows. Only a decrease is a real integrity failure.
Dispatcher .tmp safety. CONTENT_EXTENSIONS does not include .tmp, so active acquisition writes are invisible to the dispatcher until the atomic rename lands.
Transcripts are filed in-place. Their documents.path is a URL and filing worker's path LIKE '/opt/recon/data/processing/%' filter excludes them.
PeerTube 429. Respect peertube.rate_limit_delay between caption API calls or you'll get throttled.
Library is an LXC bind-mount, not NFS. /mnt/library on CT 130 is bound from the data Proxmox host's /mnt/data/library (local ext4 on /dev/sda1). File ownership/UID-GID is shared with the host — writes from inside the container appear with the container UID on the host. No NFS, no root_squash, no network in the path.
SSH heredocs with Python code break. When editing remote files, write to a temp file via scp or cat > file rather than bash heredocs with parens/quotes.
The crawler is off. crawler.sites: []. Re-enabling requires a re-architecture for the new pipeline.

17. Credentials & Hosts

Host	Role	Access
CT 130 (192.168.1.130 / 100.64.0.24)	RECON service	`ssh zvx@192.168.1.130` (key auth)
data host (192.168.1.240)	Proxmox node hosting CT 130; `/mnt/data/library` source for the CT 130 bind-mount	`ssh root@192.168.1.240`
cortex VM (192.168.1.150)	Qdrant, TEI, sparse svc, Ollama	`ssh zvx@cortex`
CT 110 (192.168.1.170)	PeerTube `stream.echo6.co`	`ssh zvx@192.168.1.170`
pi-nas (192.168.1.245)	Backup target (planned; not yet configured). ~22T pool with ~300G free after library wipe.	`ssh zvx@pi-nas`
CT 101 (192.168.1.101)	Caddy reverse proxy (home)	`ssh root@192.168.1.241 'pct exec 101'`

Secrets: /home/zvx/projects/.ref/credentials on TOC (this machine). RECON Gemini/PeerTube keys: /opt/recon/.env on CT 130.

18. Open Follow-ups

82 MULTI_MATCH + 141 UNMATCHED transcript rows still carry library paths post Phase 5a/6k (audit trail at /tmp/phase5a_remaining.txt on CT 130 — file still present). Either hand-resolve or tombstone.
HTML processor (lib/processors/html_processor.py) is scaffolded in config but not implemented. Next-up for Kiwix / web ingest.
Crawler re-architecture. The tier-1 sites list in config.yaml is a valuable target list but the old crawler is off pending a new acquisition-module-shaped implementation.
ARGUS intel pipeline shares the DB but its lifecycle is documented separately — not covered here.
Phase 6e-2 (PeerTube channel sync endpoint) was reverted and needs a redesign before reinstating.
Level-4 dedupe review queue (duplicate_review table) has no UI yet; items pile up silently.
9,478 legacy dirs in /opt/recon/data/text/ — historical extraction output from the pre-refactor pipeline, for documents still in catalogue. Not touched by current pipeline. Can be cleaned up once confirmed none are the sole text copy for any document.
lib/new_pipeline.py is misleadingly named — it's actually a library management CLI tool, not the refactor's new pipeline. Contains update_qdrant_payload helper that filing worker depends on. Should be renamed (e.g., library_ops.py) when there's time.
SSH key for CT 130 forge access — currently uses HTTPS with embedded token in remote URL. Move to SSH key auth.
Backup policy for derived data — /opt/recon/data/concepts/ and Qdrant snapshots are not in any backup rotation. If CT 130 or cortex lose their disks, these are the hardest to regenerate (Gemini calls
- embedding compute).
Backup architecture — no offsite backup is currently configured. Section 15 references a planned rsync-to-pi-nas job, but neither the script nor the systemd timer (recon-backup.timer) exist. Decide what gets backed up (recon.db, concepts/, text/, Qdrant snapshots, /mnt/library?), where, and on what cadence; pi-nas has ~300G free in /export/ after the 2026-04-15 library wipe and could be the target for a first pass.
signal-archive/ in /mnt/library/ — 44 Signal/Matrix chat log files, not library content. Matt intends these to "eventually contribute" to the knowledge base but no ingestion path exists yet.

Last updated: 2026-04-15 — Refactor feature-complete. Phases 0 through 6k landed. Service operational with 7 daemon threads. Outstanding: 223 edge-case transcripts (see Section 18), HTML processor (scaffolded, not implemented), crawler re-architecture (deferred). Living document; edit in place as the system evolves.

33 KiB Raw Blame History