- Section 2 topology diagram: 'Library (LXC bind) / data /mnt/data/library → /mnt/library/ (read/write, local SSD)' - Section 10 Config table: library_root described as bind-mount root - Section 13 Filesystem layout: /mnt/library annotated as LXC bind-mount - Section 14 Refactor history: storage migration note added (NFS history preserved as historical context) - Section 15 Operational runbook: replaced recon-backup.timer reference with planned/TBD note - Section 16 Known Gotchas: new bullet on bind-mount file ownership and the absence of NFS / root_squash in the path - Section 17 Credentials & Hosts: added data host row; rewrote pi-nas role to backup target (planned, not yet configured) reflecting the 2026-04-15 wipe of /export/library - Section 18 Open Follow-ups: added backup architecture entry capturing the missing rsync job and the now-available ~300G pi-nas headroom
33 KiB
RECON — Project Bible
Canonical architectural reference for RECON (the knowledge extraction
pipeline running on CT 130 / data.echo6 / 100.64.0.24). This document
is the orientation dossier for any future session. It is skim-and-find,
not a tutorial.
- Repo:
ssh://git@forge.echo6.co:2222/matt/refactored-recon.git(design) - Code:
/opt/recon/on CT 130 (zvx owns the tree; service runs as zvx) - Service:
systemctl status recon - Dashboard:
https://recon.echo6.co(zvx-only via Authentik) - Files server:
https://files.echo6.co(Authentik forward auth)
1. Mission
RECON ingests documents from multiple sources (manual PDF uploads, PeerTube auto-captioned transcripts, future Kiwix/HTML/RSS feeds) and produces a searchable, domain-organized library plus a hybrid dense/sparse vector index in Qdrant on cortex.
Every piece of content ends up in two places:
- A file under
/mnt/library/<Domain>/<Subdomain>/<canonical_name>.<ext>(PDFs, HTML) or at a source URL likehttps://stream.echo6.co/w/<uuid>(PeerTube transcripts — no local copy after Phase 5a). - Page-level embeddings in Qdrant collection
recon_knowledge_hybrid(densebge-m3+ sparse SPLADE-style vectors, 1024-dim dense).
Search returns page-grounded citations back to the file or stream URL.
2. System Topology
┌─────────────────────────┐
│ CT 130 (recon) │
Library (LXC bind) │ /opt/recon/ │ ┌──────────────┐
data /mnt/data/library│ ├─ data/ │ │ Qdrant │
→ /mnt/library/ │ │ ├─ acquired/ │ │ cortex:6333 │
(read/write, local SSD)│ │ ├─ processing/ │ ←→ │ recon_knowledge_hybrid
│ │ ├─ concepts/ │ │ (1024-d dense + sparse)
│ │ └─ recon.db │ └──────────────┘
│ ├─ lib/ │
│ ├─ recon.py │ ┌──────────────┐
│ └─ config.yaml │ ←→ │ TEI │
│ recon.service │ │ cortex:8090 │
│ nginx :8888 (files) │ │ bge-m3 dense │
└─────────────────────────┘ └──────────────┘
▲ ┌──────────────┐
│ │ Sparse svc │
┌───────────────────────┴─────┐ ←→ │ cortex:8091 │
│ │ │ bge-m3 sparse│
PeerTube (CT 110 / stream.echo6.co) Gemini API └──────────────┘
api_base: http://192.168.1.170 (enrichment,
vision OCR)
Shared caddy reverse proxy (CT 101) surfaces the dashboard (8420) and
nginx file server (8888) as recon.echo6.co and files.echo6.co.
3. Pipeline Lifecycle
Every document follows the same five-stage arc regardless of source type. The filesystem location at any given moment tells you which stage the item is in — state is a directory.
┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ 1. ACQUIRE │ → │ 2. DISPATCH │ → │ 3. PROCESS │ → │ 4. ENRICH / │ → │ 5. FILE │
│ │ │ (pre_flight) │ │ │ │ EMBED │ │ │
│ data/acquired│ │ dispatcher.py│ │ per-type │ │ shared │ │ shared │
│ /<type>/ │ │ watches │ │ processor │ │ stage loops │ │ filing worker│
│ <hash>.{ext} │ │ subfolders, │ │ moves file │ │ bge-m3 → │ │ moves file │
│ <hash>.meta │ │ hands to │ │ to processing│ │ Qdrant │ │ processing → │
│ │ │ processor │ │ /{hash}/ │ │ │ │ library, │
│ │ │ │ │ │ │ │ │ updates DB + │
│ │ │ │ │ │ │ │ │ Qdrant │
└──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘
Status column on documents:
catalogued → queued → extracting → extracted → enriching → enriched → embedding → complete (plus terminal error, content_failure,
duplicate states).
organized_at IS NULL while in flight, set to CURRENT_TIMESTAMP after
filing. Transcripts are marked organized in-place during pre_flight
(they have no filesystem target — the watch URL is their "home").
4. Acquisition Layer (lib/acquisition/)
Acquisition modules fetch content from external sources and drop
{hash}.<ext> + {hash}.meta.json flat file pairs into
data/acquired/<type>/. They do not touch the database — that's the
processor's job.
Atomic drop protocol
- Write content to
<hash>.<ext>.tmp(unknown extension, safe from dispatcher). - Compute hash; rename tmp to final
<hash>.<ext>. - Write
<hash>.meta.json.tmp, then rename to<hash>.meta.json. - Meta goes final first, content goes final last. Dispatcher only picks up when content file exists and is stable, so a half-visible pair without meta never gets dispatched.
PeerTube acquisition (lib/acquisition/peertube.py)
- Daemon loop
acquisition_loop(stop_event, db, config, interval=1800). - Queries catalogue for
source='stream.echo6.co'rows, builds sets of known UUIDs (/w/<uuid>extracted frompath) and known titles (fromfilename) — both cohorts are checked so Phase 5b-rewritten rows and pre-5b library-path rows dedupe correctly. - Lists PeerTube videos via
peertube_scraper.get_videos, filters to those with captions, prefers English caption. - For each new one: fetches VTT, converts to text with
vtt_to_text, atomically drops pair intodata/acquired/stream/. - Rate limits at
peertube.rate_limit_delay(default 0.5s) — PeerTube returns 429 if captions are fetched too fast.
Manual uploads / URL ingest
api.py exposes /api/upload, /api/ingest-url, /api/ingest-urls,
/api/ingest-peertube — all end by dropping a pair into acquired/<type>/.
5. Dispatcher (lib/dispatcher.py)
The dispatcher is one daemon thread (dispatch_loop, interval=30s). It
watches each configured subfolder under data/acquired/ and hands
stable file pairs to the registered processor.
Config-driven dispatch table
pipeline:
acquired_root: /opt/recon/data/acquired
processing_root: /opt/recon/data/processing
dispatch:
pdf: pdf_processor
stream: transcript_processor
html: html_processor # not yet implemented
text: text_processor
mtime_stability_seconds: 10
Extension constants
CONTENT_EXTENSIONS = {'.txt', '.vtt', '.html', '.pdf'}— the dispatcher considers a file "content" only if its extension is in this set..tmpis not in the set, so partial writes are safe.CONVERTIBLE_EXTENSIONS = {'.epub', '.mobi', '.doc', '.docx'}— these are normalized to PDF before dispatch.
Normalization step
_normalize_formats(subfolder_path):
.epub/.mobi→ PDF viaebook-convert(Calibre CLI)..doc/.docx→ PDF vialibreoffice --headless.- Sidecar
.meta.jsonis renamed to match the new PDF hash so pairing holds.
Pair finding
_find_pairs(subfolder_path) returns tuples of (content_path,
meta_path_or_None). Pairs where only content exists are still valid —
meta is not required. A meta without its content is ignored.
Stability check
_is_stable(filepath, stability_seconds) — mtime must be at least
mtime_stability_seconds old (default 10s) before dispatch. Prevents
racing active writers.
6. Processors (lib/processors/)
Each processor implements one function: pre_flight(content_path, meta_path, db, config) → dict. It owns all the type-specific logic and
all the database writes for that item up to status=extracted.
Common pre_flight contract
Every processor does, in order:
- Hash content (SHA-256 via
content_hash()inlib/utils.py). - Stale state cleanup:
rm -rf processing/{hash}/andconcepts/{hash}/if they exist (guards against re-runs). - Hash dedupe: if
hashalready exists incatalogue, delete the pair, return actionduplicate. - Type-specific metadata extraction + level-4 dedupe check (PDF only).
- Move content + meta into
processing/{hash}/with a type-specific layout. db.add_to_catalogue,db.queue_document, setdocuments.text_dirandpage_count,db.update_status(hash, 'extracted', ...).
Return dict keys: hash, action, source_path, error. Actions are
one of: extracted, duplicate, level4_duplicate, content_failure,
error, and duplicate (for transcripts) or skip_empty (for text).
pdf_processor.py
The heaviest processor. Layered metadata extraction:
- Source A — PDF dict:
PdfReader(...).metadata, mapped to{title, author, edition, year}. - Source B — Filename: regex parse the original filename.
- Source C — Gemini Vision OCR on first 3 pages when A+B
disagree or are missing. Returns structured JSON via Gemini's
response_mime_type: application/json. - Voting:
_vote_metadata(A, B, C)reconciles the three sources; 2-of-3 wins; ties prefer Source A. - Level-4 dedupe: if all four fields (
title, edition, author, year) are present and match an existing catalogue row with a different hash, the PDF is quarantined to_duplicates/for human review. - Size cap:
processing.max_pdf_size_mb(default 2000MB). Oversize PDFs move to_rejected/. - Text extraction order: PyPDF2 →
pdftotext(poppler) → Tesseract OCR → Gemini Vision on a per-page basis. Output:processing/{hash}/page_NNNN.txt.
transcript_processor.py
Lightweight. The VTT→text conversion already happened in acquisition, so pre_flight just:
- Hashes
<hash>.txtfile. - Reads meta.json sidecar.
chunk_text(raw_text, WORDS_PER_PAGE=2000)splits intopage_NNNN.txtfiles.- Writes the transcript as
processing/{hash}/transcript.txtplus page chunks. - Registers with category
Transcript, sourcestream.echo6.co. - Sets
text_dir,page_count, andorganized_at = CURRENT_TIMESTAMPimmediately — transcripts are filed-in-place (their "location" is the PeerTube watch URL, set later as the cataloguepathvia Phase 5a).
text_processor.py
Raw .txt files dropped via manual upload. Two-source metadata vote
(filename + meta.json). Similar flow to transcript processor but no
fixed category or source.
7. Enrichment & Embedding
Both are source-agnostic stage loops that just poll documents by
status and do their work. They live in lib/enricher.py and
lib/embedder.py, wrapped by stage_loop(stage, ...) in recon.py.
Enrichment (enrich_workers: 16 threads per batch)
- Polls
status = 'extracted' AND retries < max_retries. - Sets
enriching, readsprocessing/{hash}/page_NNNN.txt. - Windows pages (
enrich_window_size: 5per window) and sends each window to Gemini with a structured prompt. - Stores
concepts/{hash}/window_N.jsonper window. - Backoff:
enrich_base_delay=5s, doubling up toenrich_max_delay=120s, maxenrich_max_retries=5. - On success:
update_status(hash, 'enriched').
Embedding (embed_workers: 4)
- Polls
status = 'enriched'. - Reads concept JSONs, builds page-level chunks.
- Dense: POST to TEI at
cortex:8090(bge-m3, 1024-d). Batches of 128 per TEI request. Throughput ~1,711 emb/sec. - Sparse: POST to the sparse service at
cortex:8091(bge-m3 sparse mode;sparse_embedding.enabled: true). - Upserts into Qdrant
cortex:6333, collectionrecon_knowledge_hybrid, batch sizeembed_batch_size=500vectors per upsert. - Payload carries:
hash,filename,original_filename,download_url,page,text,title,domain,subdomain,category. - Ollama is a fallback backend (much slower, ~8 emb/sec) via
embedding.backend: ollama. - On success:
update_status(hash, 'complete').
8. Filing (lib/filing.py)
One daemon thread, filing_worker_loop(interval=30). It polls:
SELECT hash FROM documents
WHERE status = 'complete'
AND organized_at IS NULL
AND path LIKE '/opt/recon/data/processing/%'
LIMIT 50
The path LIKE '/opt/recon/data/processing/%' filter naturally
excludes transcripts — their documents.path was never a
filesystem path but the PeerTube watch URL.
For each row, file_processed_item(doc_hash, source_file_path, db, config) does:
determine_dominant_domain(hash)reads concept JSONs, returns the top-votedDomain/Subdomain._build_target_path(...)derives the canonical name starting at level 1 (Title), escalating to level 2/3/4 only if a collision exists in the target folder. Preserves source file's actual extension (not hardcoded to.pdf).shutil.move(source, target)atomically. Target is/mnt/library/<Domain>/<Subdomain>/<canonical>.<ext>.- Updates:
catalogue.path→ new targetcatalogue.filename→ new canonical namedocuments.path→ new target- Qdrant payload via
update_qdrant_payload(...):download_url = generate_download_url(new_path, ...),filename,original_filenameset on every point for that hash.
db.mark_organized(hash)setsorganized_at+ cleans upprocessing/{hash}/.
Download URL helper (lib/utils.py:generate_download_url)
- If the path is already
http://orhttps://(transcripts), return it unchanged. - Otherwise strip
library_rootprefix and prependbook_server.base_url(→https://files.echo6.co/<rel>).
9. StatusDB (lib/status.py)
SQLite (data/recon.db) in WAL mode with thread-local connections
(_get_conn() uses threading.local).
Tables
| Table | Purpose |
|---|---|
catalogue |
Canonical record keyed by hash — title, filename, path, source, category, size |
documents |
Pipeline state machine — status, path, text_dir, page_count, retries, organized_at, timestamps |
intel |
ARGUS intel feed entries (separate pipeline) |
metrics_snapshots |
Time-series rollups for the dashboard |
file_operations |
Audit log of Phase-5-style file moves and renames |
duplicate_review |
Level-4 dedupe quarantine queue |
Key methods
add_to_catalogue(hash, title, url, size, source, category)queue_document(hash)— insert intodocumentswith status=queuedupdate_status(hash, status, **kwargs)— single point of status truthmark_organized(hash)— setsorganized_at, final transitionsync_document_path(hash, new_path)+update_catalogue_path(...)— used by filing worker and Phase 5a un-fileget_path_updates/clear_path_update— small change queue for backfills
Connection safety
All writers take a short-lived connection via _get_conn(). WAL mode
allows concurrent readers; writes are serialized at the SQLite level.
No explicit BEGIN — rely on autocommit semantics with occasional
conn.commit() after grouped updates.
10. Configuration (config.yaml)
Lives at /opt/recon/config.yaml. Secrets (GEMINI_KEYS,
PEERTUBE_TOKEN, etc.) live in /opt/recon/.env — never in
config.yaml, never in git.
Top-level keys
| Key | Meaning |
|---|---|
library_root |
/mnt/library — LXC bind-mount root (data host /mnt/data/library, local SSD) |
processing |
Worker counts, window sizes, timeouts, retry policy |
embedding |
TEI host/port, model (bge-m3), 1024-d dense |
sparse_embedding |
Separate service on cortex:8091 |
vector_db |
Qdrant host, port, collection name |
gemini |
Model (gemini-2.0-flash), JSON response mode |
web |
Dashboard bind host + port (8420) |
paths |
base, data, text, concepts, intel, logs, db |
book_server |
base_url, strip_prefix for download URL generation |
upload_paths |
Category → filesystem path for upload routing |
service |
scan_interval, stage_poll_interval, progress_interval |
peertube |
api_base, public_url, rate_limit_delay, poll_interval |
pipeline |
acquired_root, processing_root, dispatch table, mtime_stability_seconds |
crawler / web_scraper |
Currently disabled (sites: []) pending re-architecture |
new_pipeline |
Stream-B (old) pipeline, enabled: false |
11. Service & Threads (recon.py cmd_service)
systemctl start recon → python3 recon.py service. The service runs
seven daemon threads plus a metrics collector:
| Thread | Function | Interval |
|---|---|---|
dispatcher |
dispatcher.dispatch_loop |
30s |
enrich |
stage_loop('enrich', ...) |
30s idle |
embed |
stage_loop('embed', ...) |
30s idle |
filing |
filing.filing_worker_loop |
30s |
peertube-acq |
acquisition.peertube.acquisition_loop |
1800s |
progress |
Log status rollup line | 60s |
dashboard |
api.run_server (Flask) |
bound |
Plus peertube_collector.start_collector for metrics scrape.
All threads receive a shared stop_event (threading.Event) and exit
cleanly on SIGTERM via signal.signal(SIGTERM, lambda *_: stop_event.set()).
CLI commands (recon.py top-level)
scan, queue, extract, enrich, embed, run, status,
catalogue, failures, search, upload, ingest-url, ingest,
ingest-peertube, validate, rebuild, serve, service,
organize, pipeline.
Most commands are thin wrappers around library functions — useful for one-off maintenance from the CT 130 shell.
12. Dashboard & API (lib/api.py)
Flask app bound to 0.0.0.0:8420. Pages are server-rendered Jinja
templates; data is pulled via AJAX from /api/* endpoints.
Page routes
/, /search, /catalogue, /upload, /web-ingest, /failures,
/peertube, /peertube/channels, /settings/{keys,cookies,vpn,health}.
API surface (grouped)
| Group | Endpoints |
|---|---|
| Upload | POST /api/upload, GET /api/upload/<hash>/status, GET /api/upload/categories |
| Ingest | POST /api/ingest-url, /api/ingest-urls, /api/ingest, /api/ingest-peertube, /api/crawl, GET /api/crawl/<id>/status, GET /api/ingest-peertube/<job>/status |
| Search | POST /api/search |
| Status | GET /api/status, /api/quick-stats, /api/knowledge-stats, /api/health |
| Retry | POST /api/retry/<hash>, /api/retry-all |
| Service | POST /api/service/restart |
| Keys | Full CRUD on /api/keys, /api/keys/validate, /api/keys/reload |
| Cookies | GET /api/cookies/status, POST /api/cookies/upload |
| VPN | GET /api/vpn/status, POST /api/vpn/{connect,disconnect,rotate,login} |
| PeerTube | /api/peertube/{dashboard,channels,channels/stats,channels/add,channels/<actor>}, /api/peertube/stats |
| Metrics | GET /api/metrics/history |
Qdrant scroll
_qdrant_scroll(host, port, collection, req) is the shared paged-read
helper for rebuilding the knowledge-stats panel.
Cache warmer
start_cache_warmer(stop_event) pre-computes the expensive quick-stats
and knowledge-stats panels so the dashboard loads instantly.
13. Filesystem Layout
/opt/recon/
├── recon.py # CLI + service entry point
├── config.yaml
├── .env # secrets (GEMINI_KEYS etc.)
├── PROJECT-BIBLE.md # this file (copy on CT 130)
├── backups/ # local DB backups
├── data/
│ ├── acquired/ # hopper — {hash}.ext + {hash}.meta.json
│ │ ├── pdf/
│ │ ├── stream/ # PeerTube transcripts
│ │ ├── html/ # (future)
│ │ └── text/
│ ├── processing/{hash}/ # in-flight scratch
│ │ ├── page_NNNN.txt
│ │ ├── meta.json
│ │ └── (original file or transcript.txt)
│ ├── concepts/{hash}/
│ │ └── window_N.json # Gemini enrichment output
│ ├── intel/ # ARGUS intel feeds
│ ├── _duplicates/ # level-4 name-match quarantine
│ ├── _rejected/ # oversize / unreadable PDFs
│ └── recon.db # SQLite WAL mode
├── lib/
│ ├── acquisition/peertube.py
│ ├── processors/{pdf,transcript,text}_processor.py
│ ├── dispatcher.py
│ ├── filing.py
│ ├── enricher.py
│ ├── embedder.py
│ ├── status.py # StatusDB class
│ ├── api.py # Flask dashboard + API
│ ├── new_pipeline.py # update_qdrant_payload helper lives here
│ ├── utils.py # content_hash, generate_download_url, get_config, setup_logging
│ ├── peertube_scraper.py # PeerTube API client
│ └── organizer.py # determine_dominant_domain, level 1-4 naming
└── logs/
/mnt/library/ # LXC bind-mount from data host /mnt/data/library (local SSD), read-write
├── <Domain>/<Subdomain>/<canonical_name>.<ext>
└── _acquired/ _review/ _staging/ signal-archive/ # not touched by pipeline
14. Refactor History (2026-04)
The refactor is tracked as dated phases under phases/. Status
implementations are in the RECON repo; design lives here.
| Phase | Focus |
|---|---|
| 0 | Baseline capture — DB dumps, directory listings, config pin |
| 1 | Scaffolding — create acquired/, processing/, config keys |
| 2 | Shared filing function — extract organizer logic into filing.py |
| 3 | Transcript processor — first end-to-end test of the new pattern |
| 4 | PDF processor — layered A/B/C metadata vote, level-4 dedupe |
| 5a | Transcript resweep — 16,596 transcripts moved from /mnt/library/_sources/streamecho6/ into /mnt/library/<Domain>/<Subdomain>/ via concept-driven domain classification; 2,259 skipped as unclassified (these became the 5b drain cohort) |
| 5b | Transcript unprocess — 2,259 skip_unclassified transcripts staged into data/acquired/stream/ as .txt+.meta.json pairs; DB rows deleted, Qdrant vectors removed, source dirs cleaned |
| 5c-1 | Service loop rewire — retire old scan_library thread, wire dispatcher in |
| 5c-2 | Service start & transcript drain — clear the hopper backlog |
| 6a | Transcript organized-in-place — set organized_at during pre_flight so filing worker ignores transcripts |
| 6b | Dashboard "Untitled / WEB" bug fix — recently-completed table query |
| 6c | Code cleanup — dead-code audit |
| 6d | PeerTube acquisition module — replace ad-hoc ingester with acquisition/peertube.py |
| 6e | ShadowLib skill + dashboard PeerTube endpoint cleanup (partial — 6e-2 reverted) |
| 6f | Text processor — new lib/processors/text_processor.py handles .txt files with two-source metadata voting (filename + Gemini); new data/acquired/text/ hopper subfolder; files to library like PDFs |
| 6f-2 | Format normalizer in dispatcher — converts .epub/.mobi to PDF via Calibre's ebook-convert, .doc/.docx via libreoffice --headless, called per-subfolder before _find_pairs() |
| 6g | Gemini "null" string bug fix — both pdf_processor and text_processor now filter the literal string "null" out of Gemini's JSON responses before metadata voting |
| 6h | STATE 2 transcript cleanup — deleted 283 zero-vector transcripts (DB rows, concepts, local text, Qdrant entries) and 1,198 orphan dirs in data/text/; triggered PeerTube transcription for 332 videos without captions via POST /api/v1/videos/{uuid}/captions/generate |
| 6i | Dashboard upload migration — POST /api/upload now routes by extension to the appropriate hopper (pdf/text) with .meta.json sidecar, supports PDF/TXT/EPUB/DOC/DOCX/MOBI, removed direct library copy and add_to_catalogue/queue_document calls, added status endpoint fallback that checks acquired/ and processing/ dirs for the upload/dispatch gap |
| 6j | Library cleanup — ~51G freed; 398 duplicate PDFs deleted (Army_Pubs, Acquired, Scenario-Playbooks dupes); 2,274 non-PDF SCL files deleted (user confirmed backups); 57 files in 3 ghost domain folders (Community-Coordination, Leadership, Scenario-Playbooks) refiled through new pipeline; 201 unclassified SCL PDFs refiled; 1,240 _unclassified/ PDFs refiled; _ingest/_duplicates/ cleared; 5 loose root PDFs staged |
| 6k | Phase 5a un-file — 16,340 of the 16,596 Phase 5a-filed transcripts had their catalogue.path restored from library filesystem path back to PeerTube watch URL via title-matching against PeerTube's video list (98.6% match rate); physical .txt files deleted from library; Qdrant download_url payload updated; 4,955 empty dirs cleaned up; 223 edge cases (82 MULTI_MATCH + 141 UNMATCHED) documented for later review |
Baseline pre-refactor (per current-state.md)
- 18,855 transcripts in
/mnt/library/_sources/streamecho6/. - Old stream-B
new_pipelineran off/mnt/library/_acquired/. scan_library()polled the NFS mount for new PDFs — now deprecated.- Storage migration note:
/mnt/librarywas historically an NFS mount frompi-nas:/export/library, which is whatcurrent-state.mdandscan_library()were written against. The library has since been migrated to local SSD on the data Proxmox host (/mnt/data/library) and surfaced into CT 130 via an LXC bind-mount. The pi-nas copy was wiped on 2026-04-15. Path strings inside the codebase didn't change; only the underlying storage did.
15. Operational Runbook
Service control (on CT 130 as zvx)
sudo systemctl {status,start,stop,restart} recon
journalctl -u recon -f
tail -f /opt/recon/logs/recon.log
Backups
# Local DB backup before risky operations
cp /opt/recon/data/recon.db /tmp/recon.db.bak.$(date +%s)
# Offsite backup: planned, not yet configured (TBD — likely rsync to
# pi-nas:/export/recon-backup once a backup target is provisioned).
Inspect pipeline state at a glance
ls /opt/recon/data/acquired/*/ # hopper contents
ls /opt/recon/data/processing/ | wc -l # in-flight count
sqlite3 /opt/recon/data/recon.db \
"SELECT status, COUNT(*) FROM documents GROUP BY status;"
Re-queue a failed document
sqlite3 /opt/recon/data/recon.db \
"UPDATE documents SET status='extracted', retries=0 WHERE hash='<hash>';"
# or via API:
curl -X POST https://recon.echo6.co/api/retry/<hash>
Manual ingest
# Drop a PDF into the hopper (dispatcher will pick it up on next cycle)
sha=$(sha256sum foo.pdf | cut -d' ' -f1)
cp foo.pdf /opt/recon/data/acquired/pdf/${sha}.pdf
Qdrant health
curl -s http://100.64.0.14:6333/collections/recon_knowledge_hybrid \
| jq '.result | {status, points_count, optimizer_status}'
# status "grey" with optimizer_status.ok=true is healthy (background indexing).
16. Known Gotchas
- Logger setup. RECON modules must use
setup_logging('recon.<name>')fromlib.utils, never rawlogging.getLogger(). The root logger has no handlers; calls to a raw logger silently disappear. - Qdrant status "grey" is healthy if
optimizer_status.ok == true. Only treat red + not-ok as a real failure. - Catalogue row count can grow during long-running jobs because parallel ingestion may add rows. Only a decrease is a real integrity failure.
- Dispatcher
.tmpsafety.CONTENT_EXTENSIONSdoes not include.tmp, so active acquisition writes are invisible to the dispatcher until the atomic rename lands. - Transcripts are filed in-place. Their
documents.pathis a URL and filing worker'spath LIKE '/opt/recon/data/processing/%'filter excludes them. - PeerTube 429. Respect
peertube.rate_limit_delaybetween caption API calls or you'll get throttled. - Library is an LXC bind-mount, not NFS.
/mnt/libraryon CT 130 is bound from the data Proxmox host's/mnt/data/library(local ext4 on /dev/sda1). File ownership/UID-GID is shared with the host — writes from inside the container appear with the container UID on the host. No NFS, noroot_squash, no network in the path. - SSH heredocs with Python code break. When editing remote files,
write to a temp file via
scporcat > filerather than bash heredocs with parens/quotes. - The crawler is off.
crawler.sites: []. Re-enabling requires a re-architecture for the new pipeline.
17. Credentials & Hosts
| Host | Role | Access |
|---|---|---|
| CT 130 (192.168.1.130 / 100.64.0.24) | RECON service | ssh zvx@192.168.1.130 (key auth) |
| data host (192.168.1.240) | Proxmox node hosting CT 130; /mnt/data/library source for the CT 130 bind-mount |
ssh root@192.168.1.240 |
| cortex VM (192.168.1.150) | Qdrant, TEI, sparse svc, Ollama | ssh zvx@cortex |
| CT 110 (192.168.1.170) | PeerTube stream.echo6.co |
ssh zvx@192.168.1.170 |
| pi-nas (192.168.1.245) | Backup target (planned; not yet configured). ~22T pool with ~300G free after library wipe. | ssh zvx@pi-nas |
| CT 101 (192.168.1.101) | Caddy reverse proxy (home) | ssh root@192.168.1.241 'pct exec 101' |
Secrets: /home/zvx/projects/.ref/credentials on TOC (this machine).
RECON Gemini/PeerTube keys: /opt/recon/.env on CT 130.
18. Open Follow-ups
- 82 MULTI_MATCH + 141 UNMATCHED transcript rows still carry
library paths post Phase 5a/6k (audit trail at
/tmp/phase5a_remaining.txton CT 130 — file still present). Either hand-resolve or tombstone. - HTML processor (
lib/processors/html_processor.py) is scaffolded in config but not implemented. Next-up for Kiwix / web ingest. - Crawler re-architecture. The tier-1 sites list in
config.yamlis a valuable target list but the old crawler is off pending a new acquisition-module-shaped implementation. - ARGUS intel pipeline shares the DB but its lifecycle is documented separately — not covered here.
- Phase 6e-2 (PeerTube channel sync endpoint) was reverted and needs a redesign before reinstating.
- Level-4 dedupe review queue (
duplicate_reviewtable) has no UI yet; items pile up silently. - 9,478 legacy dirs in
/opt/recon/data/text/— historical extraction output from the pre-refactor pipeline, for documents still in catalogue. Not touched by current pipeline. Can be cleaned up once confirmed none are the sole text copy for any document. lib/new_pipeline.pyis misleadingly named — it's actually a library management CLI tool, not the refactor's new pipeline. Containsupdate_qdrant_payloadhelper that filing worker depends on. Should be renamed (e.g.,library_ops.py) when there's time.- SSH key for CT 130 forge access — currently uses HTTPS with embedded token in remote URL. Move to SSH key auth.
- Backup policy for derived data —
/opt/recon/data/concepts/and Qdrant snapshots are not in any backup rotation. If CT 130 or cortex lose their disks, these are the hardest to regenerate (Gemini calls- embedding compute).
- Backup architecture — no offsite backup is currently configured.
Section 15 references a planned rsync-to-pi-nas job, but neither the
script nor the systemd timer (
recon-backup.timer) exist. Decide what gets backed up (recon.db,concepts/,text/, Qdrant snapshots,/mnt/library?), where, and on what cadence; pi-nas has ~300G free in/export/after the 2026-04-15 library wipe and could be the target for a first pass. signal-archive/in/mnt/library/— 44 Signal/Matrix chat log files, not library content. Matt intends these to "eventually contribute" to the knowledge base but no ingestion path exists yet.
Last updated: 2026-04-15 — Refactor feature-complete. Phases 0 through 6k landed. Service operational with 7 daemon threads. Outstanding: 223 edge-case transcripts (see Section 18), HTML processor (scaffolded, not implemented), crawler re-architecture (deferred). Living document; edit in place as the system evolves.