# RECON — Project Bible Canonical architectural reference for **RECON** (the knowledge extraction pipeline running on CT 130 / `data.echo6` / `100.64.0.24`). This document is the orientation dossier for any future session. It is skim-and-find, not a tutorial. - **Repo:** `ssh://git@forge.echo6.co:2222/matt/refactored-recon.git` (design) - **Code:** `/opt/recon/` on CT 130 (zvx owns the tree; service runs as zvx) - **Service:** `systemctl status recon` - **Dashboard:** `https://recon.echo6.co` (zvx-only via Authentik) - **Files server:** `https://files.echo6.co` (Authentik forward auth) --- ## 1. Mission RECON ingests documents from multiple sources (manual PDF uploads, PeerTube auto-captioned transcripts, future Kiwix/HTML/RSS feeds) and produces a **searchable, domain-organized library** plus a hybrid dense/sparse vector index in Qdrant on cortex. Every piece of content ends up in two places: 1. A file under `/mnt/library///.` (PDFs, HTML) **or** at a source URL like `https://stream.echo6.co/w/` (PeerTube transcripts — no local copy after Phase 5a). 2. Page-level embeddings in Qdrant collection `recon_knowledge_hybrid` (dense `bge-m3` + sparse SPLADE-style vectors, 1024-dim dense). Search returns page-grounded citations back to the file or stream URL. --- ## 2. System Topology ``` ┌─────────────────────────┐ │ CT 130 (recon) │ Library (NFS) │ /opt/recon/ │ ┌──────────────┐ pi-nas:/export/library│ ├─ data/ │ │ Qdrant │ → /mnt/library/ │ │ ├─ acquired/ │ │ cortex:6333 │ (read/write) │ │ ├─ processing/ │ ←→ │ recon_knowledge_hybrid │ │ ├─ concepts/ │ │ (1024-d dense + sparse) │ │ └─ recon.db │ └──────────────┘ │ ├─ lib/ │ │ ├─ recon.py │ ┌──────────────┐ │ └─ config.yaml │ ←→ │ TEI │ │ recon.service │ │ cortex:8090 │ │ nginx :8888 (files) │ │ bge-m3 dense │ └─────────────────────────┘ └──────────────┘ ▲ ┌──────────────┐ │ │ Sparse svc │ ┌───────────────────────┴─────┐ ←→ │ cortex:8091 │ │ │ │ bge-m3 sparse│ PeerTube (CT 110 / stream.echo6.co) Gemini API └──────────────┘ api_base: http://192.168.1.170 (enrichment, vision OCR) ``` Shared caddy reverse proxy (CT 101) surfaces the dashboard (8420) and nginx file server (8888) as `recon.echo6.co` and `files.echo6.co`. --- ## 3. Pipeline Lifecycle Every document follows the same five-stage arc regardless of source type. The filesystem location at any given moment tells you which stage the item is in — **state is a directory.** ``` ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ 1. ACQUIRE │ → │ 2. DISPATCH │ → │ 3. PROCESS │ → │ 4. ENRICH / │ → │ 5. FILE │ │ │ │ (pre_flight) │ │ │ │ EMBED │ │ │ │ data/acquired│ │ dispatcher.py│ │ per-type │ │ shared │ │ shared │ │ // │ │ watches │ │ processor │ │ stage loops │ │ filing worker│ │ .{ext} │ │ subfolders, │ │ moves file │ │ bge-m3 → │ │ moves file │ │ .meta │ │ hands to │ │ to processing│ │ Qdrant │ │ processing → │ │ │ │ processor │ │ /{hash}/ │ │ │ │ library, │ │ │ │ │ │ │ │ │ │ updates DB + │ │ │ │ │ │ │ │ │ │ Qdrant │ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ ``` **Status column on documents:** `catalogued → queued → extracting → extracted → enriching → enriched → embedding → complete` (plus terminal `error`, `content_failure`, `duplicate` states). `organized_at IS NULL` while in flight, set to `CURRENT_TIMESTAMP` after filing. Transcripts are marked organized in-place during pre_flight (they have no filesystem target — the watch URL is their "home"). --- ## 4. Acquisition Layer (`lib/acquisition/`) Acquisition modules fetch content from external sources and drop `{hash}.` + `{hash}.meta.json` **flat file pairs** into `data/acquired//`. They do **not** touch the database — that's the processor's job. ### Atomic drop protocol 1. Write content to `..tmp` (unknown extension, safe from dispatcher). 2. Compute hash; rename tmp to final `.`. 3. Write `.meta.json.tmp`, then rename to `.meta.json`. 4. **Meta goes final first, content goes final last.** Dispatcher only picks up when content file exists and is stable, so a half-visible pair without meta never gets dispatched. ### PeerTube acquisition (`lib/acquisition/peertube.py`) - Daemon loop `acquisition_loop(stop_event, db, config, interval=1800)`. - Queries catalogue for `source='stream.echo6.co'` rows, builds sets of known UUIDs (`/w/` extracted from `path`) **and** known titles (from `filename`) — both cohorts are checked so Phase 5b-rewritten rows and pre-5b library-path rows dedupe correctly. - Lists PeerTube videos via `peertube_scraper.get_videos`, filters to those with captions, prefers English caption. - For each new one: fetches VTT, converts to text with `vtt_to_text`, atomically drops pair into `data/acquired/stream/`. - Rate limits at `peertube.rate_limit_delay` (default 0.5s) — **PeerTube returns 429 if captions are fetched too fast.** ### Manual uploads / URL ingest `api.py` exposes `/api/upload`, `/api/ingest-url`, `/api/ingest-urls`, `/api/ingest-peertube` — all end by dropping a pair into `acquired//`. --- ## 5. Dispatcher (`lib/dispatcher.py`) The dispatcher is one daemon thread (`dispatch_loop`, interval=30s). It watches each configured subfolder under `data/acquired/` and hands stable file pairs to the registered processor. ### Config-driven dispatch table ```yaml pipeline: acquired_root: /opt/recon/data/acquired processing_root: /opt/recon/data/processing dispatch: pdf: pdf_processor stream: transcript_processor html: html_processor # not yet implemented text: text_processor mtime_stability_seconds: 10 ``` ### Extension constants - `CONTENT_EXTENSIONS = {'.txt', '.vtt', '.html', '.pdf'}` — the dispatcher considers a file "content" only if its extension is in this set. **`.tmp` is not in the set**, so partial writes are safe. - `CONVERTIBLE_EXTENSIONS = {'.epub', '.mobi', '.doc', '.docx'}` — these are normalized to PDF **before** dispatch. ### Normalization step `_normalize_formats(subfolder_path)`: - `.epub` / `.mobi` → PDF via `ebook-convert` (Calibre CLI). - `.doc` / `.docx` → PDF via `libreoffice --headless`. - Sidecar `.meta.json` is renamed to match the new PDF hash so pairing holds. ### Pair finding `_find_pairs(subfolder_path)` returns tuples of (content_path, meta_path_or_None). Pairs where only content exists are still valid — meta is not required. A meta without its content is ignored. ### Stability check `_is_stable(filepath, stability_seconds)` — mtime must be at least `mtime_stability_seconds` old (default 10s) before dispatch. Prevents racing active writers. --- ## 6. Processors (`lib/processors/`) Each processor implements **one function**: `pre_flight(content_path, meta_path, db, config) → dict`. It owns all the type-specific logic and **all the database writes** for that item up to status=`extracted`. ### Common pre_flight contract Every processor does, in order: 1. Hash content (SHA-256 via `content_hash()` in `lib/utils.py`). 2. Stale state cleanup: `rm -rf processing/{hash}/` and `concepts/{hash}/` if they exist (guards against re-runs). 3. Hash dedupe: if `hash` already exists in `catalogue`, delete the pair, return action `duplicate`. 4. Type-specific metadata extraction + level-4 dedupe check (PDF only). 5. Move content + meta into `processing/{hash}/` with a type-specific layout. 6. `db.add_to_catalogue`, `db.queue_document`, set `documents.text_dir` and `page_count`, `db.update_status(hash, 'extracted', ...)`. Return dict keys: `hash`, `action`, `source_path`, `error`. Actions are one of: `extracted`, `duplicate`, `level4_duplicate`, `content_failure`, `error`, and `duplicate` (for transcripts) or `skip_empty` (for text). ### `pdf_processor.py` The heaviest processor. Layered metadata extraction: - **Source A — PDF dict:** `PdfReader(...).metadata`, mapped to `{title, author, edition, year}`. - **Source B — Filename:** regex parse the original filename. - **Source C — Gemini Vision OCR** on first 3 pages when A+B disagree or are missing. Returns structured JSON via Gemini's `response_mime_type: application/json`. - **Voting:** `_vote_metadata(A, B, C)` reconciles the three sources; 2-of-3 wins; ties prefer Source A. - **Level-4 dedupe:** if all four fields (`title, edition, author, year`) are present and match an existing catalogue row with a different hash, the PDF is quarantined to `_duplicates/` for human review. - **Size cap:** `processing.max_pdf_size_mb` (default 2000MB). Oversize PDFs move to `_rejected/`. - **Text extraction order:** PyPDF2 → `pdftotext` (poppler) → Tesseract OCR → Gemini Vision on a per-page basis. Output: `processing/{hash}/page_NNNN.txt`. ### `transcript_processor.py` Lightweight. The VTT→text conversion already happened in acquisition, so pre_flight just: - Hashes `.txt` file. - Reads meta.json sidecar. - `chunk_text(raw_text, WORDS_PER_PAGE=2000)` splits into `page_NNNN.txt` files. - Writes the transcript as `processing/{hash}/transcript.txt` plus page chunks. - Registers with category `Transcript`, source `stream.echo6.co`. - Sets `text_dir`, `page_count`, and **`organized_at = CURRENT_TIMESTAMP` immediately** — transcripts are filed-in-place (their "location" is the PeerTube watch URL, set later as the catalogue `path` via Phase 5a). ### `text_processor.py` Raw `.txt` files dropped via manual upload. Two-source metadata vote (filename + meta.json). Similar flow to transcript processor but no fixed category or source. --- ## 7. Enrichment & Embedding Both are **source-agnostic stage loops** that just poll documents by status and do their work. They live in `lib/enricher.py` and `lib/embedder.py`, wrapped by `stage_loop(stage, ...)` in `recon.py`. ### Enrichment (`enrich_workers: 16` threads per batch) - Polls `status = 'extracted' AND retries < max_retries`. - Sets `enriching`, reads `processing/{hash}/page_NNNN.txt`. - Windows pages (`enrich_window_size: 5` per window) and sends each window to Gemini with a structured prompt. - Stores `concepts/{hash}/window_N.json` per window. - Backoff: `enrich_base_delay=5s`, doubling up to `enrich_max_delay=120s`, max `enrich_max_retries=5`. - On success: `update_status(hash, 'enriched')`. ### Embedding (`embed_workers: 4`) - Polls `status = 'enriched'`. - Reads concept JSONs, builds page-level chunks. - Dense: POST to TEI at `cortex:8090` (`bge-m3`, 1024-d). Batches of 128 per TEI request. Throughput ~1,711 emb/sec. - Sparse: POST to the sparse service at `cortex:8091` (bge-m3 sparse mode; `sparse_embedding.enabled: true`). - Upserts into Qdrant `cortex:6333`, collection `recon_knowledge_hybrid`, batch size `embed_batch_size=500` vectors per upsert. - Payload carries: `hash`, `filename`, `original_filename`, `download_url`, `page`, `text`, `title`, `domain`, `subdomain`, `category`. - Ollama is a fallback backend (much slower, ~8 emb/sec) via `embedding.backend: ollama`. - On success: `update_status(hash, 'complete')`. --- ## 8. Filing (`lib/filing.py`) One daemon thread, `filing_worker_loop(interval=30)`. It polls: ```sql SELECT hash FROM documents WHERE status = 'complete' AND organized_at IS NULL AND path LIKE '/opt/recon/data/processing/%' LIMIT 50 ``` The `path LIKE '/opt/recon/data/processing/%'` filter naturally **excludes transcripts** — their `documents.path` was never a filesystem path but the PeerTube watch URL. For each row, `file_processed_item(doc_hash, source_file_path, db, config)` does: 1. `determine_dominant_domain(hash)` reads concept JSONs, returns the top-voted `Domain/Subdomain`. 2. `_build_target_path(...)` derives the canonical name starting at level 1 (`Title`), escalating to level 2/3/4 only if a collision exists in the target folder. **Preserves source file's actual extension** (not hardcoded to `.pdf`). 3. `shutil.move(source, target)` atomically. Target is `/mnt/library///.`. 4. Updates: - `catalogue.path` → new target - `catalogue.filename` → new canonical name - `documents.path` → new target - Qdrant payload via `update_qdrant_payload(...)`: `download_url = generate_download_url(new_path, ...)`, `filename`, `original_filename` set on every point for that hash. 5. `db.mark_organized(hash)` sets `organized_at` + cleans up `processing/{hash}/`. ### Download URL helper (`lib/utils.py:generate_download_url`) - If the path is already `http://` or `https://` (transcripts), return it unchanged. - Otherwise strip `library_root` prefix and prepend `book_server.base_url` (→ `https://files.echo6.co/`). --- ## 9. StatusDB (`lib/status.py`) SQLite (`data/recon.db`) in WAL mode with thread-local connections (`_get_conn()` uses `threading.local`). ### Tables | Table | Purpose | |---|---| | `catalogue` | Canonical record keyed by `hash` — title, filename, path, source, category, size | | `documents` | Pipeline state machine — status, path, text_dir, page_count, retries, organized_at, timestamps | | `intel` | ARGUS intel feed entries (separate pipeline) | | `metrics_snapshots` | Time-series rollups for the dashboard | | `file_operations` | Audit log of Phase-5-style file moves and renames | | `duplicate_review` | Level-4 dedupe quarantine queue | ### Key methods - `add_to_catalogue(hash, title, url, size, source, category)` - `queue_document(hash)` — insert into `documents` with status=`queued` - `update_status(hash, status, **kwargs)` — single point of status truth - `mark_organized(hash)` — sets `organized_at`, final transition - `sync_document_path(hash, new_path)` + `update_catalogue_path(...)` — used by filing worker and Phase 5a un-file - `get_path_updates` / `clear_path_update` — small change queue for backfills ### Connection safety All writers take a short-lived connection via `_get_conn()`. WAL mode allows concurrent readers; writes are serialized at the SQLite level. No explicit `BEGIN` — rely on autocommit semantics with occasional `conn.commit()` after grouped updates. --- ## 10. Configuration (`config.yaml`) Lives at `/opt/recon/config.yaml`. Secrets (`GEMINI_KEYS`, `PEERTUBE_TOKEN`, etc.) live in `/opt/recon/.env` — never in `config.yaml`, never in git. ### Top-level keys | Key | Meaning | |---|---| | `library_root` | `/mnt/library` — NFS mount root | | `processing` | Worker counts, window sizes, timeouts, retry policy | | `embedding` | TEI host/port, model (`bge-m3`), 1024-d dense | | `sparse_embedding` | Separate service on cortex:8091 | | `vector_db` | Qdrant host, port, collection name | | `gemini` | Model (`gemini-2.0-flash`), JSON response mode | | `web` | Dashboard bind host + port (8420) | | `paths` | `base`, `data`, `text`, `concepts`, `intel`, `logs`, `db` | | `book_server` | `base_url`, `strip_prefix` for download URL generation | | `upload_paths` | Category → filesystem path for upload routing | | `service` | `scan_interval`, `stage_poll_interval`, `progress_interval` | | `peertube` | `api_base`, `public_url`, `rate_limit_delay`, `poll_interval` | | `pipeline` | `acquired_root`, `processing_root`, `dispatch` table, `mtime_stability_seconds` | | `crawler` / `web_scraper` | Currently disabled (`sites: []`) pending re-architecture | | `new_pipeline` | Stream-B (old) pipeline, `enabled: false` | --- ## 11. Service & Threads (`recon.py cmd_service`) `systemctl start recon` → `python3 recon.py service`. The service runs seven daemon threads plus a metrics collector: | Thread | Function | Interval | |---|---|---| | `dispatcher` | `dispatcher.dispatch_loop` | 30s | | `enrich` | `stage_loop('enrich', ...)` | 30s idle | | `embed` | `stage_loop('embed', ...)` | 30s idle | | `filing` | `filing.filing_worker_loop` | 30s | | `peertube-acq` | `acquisition.peertube.acquisition_loop` | 1800s | | `progress` | Log status rollup line | 60s | | `dashboard` | `api.run_server` (Flask) | bound | Plus `peertube_collector.start_collector` for metrics scrape. All threads receive a shared `stop_event` (`threading.Event`) and exit cleanly on SIGTERM via `signal.signal(SIGTERM, lambda *_: stop_event.set())`. ### CLI commands (`recon.py` top-level) `scan`, `queue`, `extract`, `enrich`, `embed`, `run`, `status`, `catalogue`, `failures`, `search`, `upload`, `ingest-url`, `ingest`, `ingest-peertube`, `validate`, `rebuild`, `serve`, `service`, `organize`, `pipeline`. Most commands are thin wrappers around library functions — useful for one-off maintenance from the CT 130 shell. --- ## 12. Dashboard & API (`lib/api.py`) Flask app bound to `0.0.0.0:8420`. Pages are server-rendered Jinja templates; data is pulled via AJAX from `/api/*` endpoints. ### Page routes `/`, `/search`, `/catalogue`, `/upload`, `/web-ingest`, `/failures`, `/peertube`, `/peertube/channels`, `/settings/{keys,cookies,vpn,health}`. ### API surface (grouped) | Group | Endpoints | |---|---| | Upload | `POST /api/upload`, `GET /api/upload//status`, `GET /api/upload/categories` | | Ingest | `POST /api/ingest-url`, `/api/ingest-urls`, `/api/ingest`, `/api/ingest-peertube`, `/api/crawl`, `GET /api/crawl//status`, `GET /api/ingest-peertube//status` | | Search | `POST /api/search` | | Status | `GET /api/status`, `/api/quick-stats`, `/api/knowledge-stats`, `/api/health` | | Retry | `POST /api/retry/`, `/api/retry-all` | | Service | `POST /api/service/restart` | | Keys | Full CRUD on `/api/keys`, `/api/keys/validate`, `/api/keys/reload` | | Cookies | `GET /api/cookies/status`, `POST /api/cookies/upload` | | VPN | `GET /api/vpn/status`, `POST /api/vpn/{connect,disconnect,rotate,login}` | | PeerTube | `/api/peertube/{dashboard,channels,channels/stats,channels/add,channels/}`, `/api/peertube/stats` | | Metrics | `GET /api/metrics/history` | ### Qdrant scroll `_qdrant_scroll(host, port, collection, req)` is the shared paged-read helper for rebuilding the knowledge-stats panel. ### Cache warmer `start_cache_warmer(stop_event)` pre-computes the expensive quick-stats and knowledge-stats panels so the dashboard loads instantly. --- ## 13. Filesystem Layout ``` /opt/recon/ ├── recon.py # CLI + service entry point ├── config.yaml ├── .env # secrets (GEMINI_KEYS etc.) ├── PROJECT-BIBLE.md # this file (copy on CT 130) ├── backups/ # local DB backups ├── data/ │ ├── acquired/ # hopper — {hash}.ext + {hash}.meta.json │ │ ├── pdf/ │ │ ├── stream/ # PeerTube transcripts │ │ ├── html/ # (future) │ │ └── text/ │ ├── processing/{hash}/ # in-flight scratch │ │ ├── page_NNNN.txt │ │ ├── meta.json │ │ └── (original file or transcript.txt) │ ├── concepts/{hash}/ │ │ └── window_N.json # Gemini enrichment output │ ├── intel/ # ARGUS intel feeds │ ├── _duplicates/ # level-4 name-match quarantine │ ├── _rejected/ # oversize / unreadable PDFs │ └── recon.db # SQLite WAL mode ├── lib/ │ ├── acquisition/peertube.py │ ├── processors/{pdf,transcript,text}_processor.py │ ├── dispatcher.py │ ├── filing.py │ ├── enricher.py │ ├── embedder.py │ ├── status.py # StatusDB class │ ├── api.py # Flask dashboard + API │ ├── new_pipeline.py # update_qdrant_payload helper lives here │ ├── utils.py # content_hash, generate_download_url, get_config, setup_logging │ ├── peertube_scraper.py # PeerTube API client │ └── organizer.py # determine_dominant_domain, level 1-4 naming └── logs/ /mnt/library/ # NFS from pi-nas, read-write ├── //. └── _acquired/ _review/ _staging/ signal-archive/ # not touched by pipeline ``` --- ## 14. Refactor History (2026-04) The refactor is tracked as dated phases under `phases/`. Status implementations are in the RECON repo; design lives here. | Phase | Focus | |---|---| | 0 | Baseline capture — DB dumps, directory listings, config pin | | 1 | Scaffolding — create `acquired/`, `processing/`, config keys | | 2 | Shared filing function — extract organizer logic into `filing.py` | | 3 | Transcript processor — first end-to-end test of the new pattern | | 4 | PDF processor — layered A/B/C metadata vote, level-4 dedupe | | 5a | Transcript resweep — 16,596 transcripts moved from `/mnt/library/_sources/streamecho6/` into `/mnt/library///` via concept-driven domain classification; 2,259 skipped as unclassified (these became the 5b drain cohort) | | 5b | Transcript unprocess — 2,259 skip_unclassified transcripts staged into `data/acquired/stream/` as `.txt`+`.meta.json` pairs; DB rows deleted, Qdrant vectors removed, source dirs cleaned | | 5c-1 | Service loop rewire — retire old scan_library thread, wire dispatcher in | | 5c-2 | Service start & transcript drain — clear the hopper backlog | | 6a | Transcript organized-in-place — set `organized_at` during pre_flight so filing worker ignores transcripts | | 6b | Dashboard "Untitled / WEB" bug fix — recently-completed table query | | 6c | Code cleanup — dead-code audit | | 6d | PeerTube acquisition module — replace ad-hoc ingester with `acquisition/peertube.py` | | 6e | ShadowLib skill + dashboard PeerTube endpoint cleanup (partial — 6e-2 reverted) | | 6f | Text processor — new `lib/processors/text_processor.py` handles `.txt` files with two-source metadata voting (filename + Gemini); new `data/acquired/text/` hopper subfolder; files to library like PDFs | | 6f-2 | Format normalizer in dispatcher — converts `.epub`/`.mobi` to PDF via Calibre's `ebook-convert`, `.doc`/`.docx` via `libreoffice --headless`, called per-subfolder before `_find_pairs()` | | 6g | Gemini "null" string bug fix — both `pdf_processor` and `text_processor` now filter the literal string `"null"` out of Gemini's JSON responses before metadata voting | | 6h | STATE 2 transcript cleanup — deleted 283 zero-vector transcripts (DB rows, concepts, local text, Qdrant entries) and 1,198 orphan dirs in `data/text/`; triggered PeerTube transcription for 332 videos without captions via `POST /api/v1/videos/{uuid}/captions/generate` | | 6i | Dashboard upload migration — `POST /api/upload` now routes by extension to the appropriate hopper (pdf/text) with `.meta.json` sidecar, supports PDF/TXT/EPUB/DOC/DOCX/MOBI, removed direct library copy and `add_to_catalogue`/`queue_document` calls, added status endpoint fallback that checks `acquired/` and `processing/` dirs for the upload/dispatch gap | | 6j | Library cleanup — ~51G freed; 398 duplicate PDFs deleted (Army_Pubs, Acquired, Scenario-Playbooks dupes); 2,274 non-PDF SCL files deleted (user confirmed backups); 57 files in 3 ghost domain folders (Community-Coordination, Leadership, Scenario-Playbooks) refiled through new pipeline; 201 unclassified SCL PDFs refiled; 1,240 `_unclassified/` PDFs refiled; `_ingest/_duplicates/` cleared; 5 loose root PDFs staged | | 6k | Phase 5a un-file — 16,340 of the 16,596 Phase 5a-filed transcripts had their `catalogue.path` restored from library filesystem path back to PeerTube watch URL via title-matching against PeerTube's video list (98.6% match rate); physical `.txt` files deleted from library; Qdrant `download_url` payload updated; 4,955 empty dirs cleaned up; 223 edge cases (82 MULTI_MATCH + 141 UNMATCHED) documented for later review | ### Baseline pre-refactor (per `current-state.md`) - 18,855 transcripts in `/mnt/library/_sources/streamecho6/`. - Old stream-B `new_pipeline` ran off `/mnt/library/_acquired/`. - `scan_library()` polled the NFS mount for new PDFs — now deprecated. --- ## 15. Operational Runbook ### Service control (on CT 130 as zvx) ```bash sudo systemctl {status,start,stop,restart} recon journalctl -u recon -f tail -f /opt/recon/logs/recon.log ``` ### Backups ```bash # Local DB backup before risky operations cp /opt/recon/data/recon.db /tmp/recon.db.bak.$(date +%s) # Contabo offsite (automatic): rsync every 6 hours, see recon-backup.timer ``` ### Inspect pipeline state at a glance ```bash ls /opt/recon/data/acquired/*/ # hopper contents ls /opt/recon/data/processing/ | wc -l # in-flight count sqlite3 /opt/recon/data/recon.db \ "SELECT status, COUNT(*) FROM documents GROUP BY status;" ``` ### Re-queue a failed document ```bash sqlite3 /opt/recon/data/recon.db \ "UPDATE documents SET status='extracted', retries=0 WHERE hash='';" # or via API: curl -X POST https://recon.echo6.co/api/retry/ ``` ### Manual ingest ```bash # Drop a PDF into the hopper (dispatcher will pick it up on next cycle) sha=$(sha256sum foo.pdf | cut -d' ' -f1) cp foo.pdf /opt/recon/data/acquired/pdf/${sha}.pdf ``` ### Qdrant health ```bash curl -s http://100.64.0.14:6333/collections/recon_knowledge_hybrid \ | jq '.result | {status, points_count, optimizer_status}' # status "grey" with optimizer_status.ok=true is healthy (background indexing). ``` --- ## 16. Known Gotchas - **Logger setup.** RECON modules must use `setup_logging('recon.')` from `lib.utils`, never raw `logging.getLogger()`. The root logger has no handlers; calls to a raw logger silently disappear. - **Qdrant status "grey" is healthy** if `optimizer_status.ok == true`. Only treat red + not-ok as a real failure. - **Catalogue row count can grow during long-running jobs** because parallel ingestion may add rows. Only a *decrease* is a real integrity failure. - **Dispatcher `.tmp` safety.** `CONTENT_EXTENSIONS` does not include `.tmp`, so active acquisition writes are invisible to the dispatcher until the atomic rename lands. - **Transcripts are filed in-place.** Their `documents.path` is a URL and filing worker's `path LIKE '/opt/recon/data/processing/%'` filter excludes them. - **PeerTube 429.** Respect `peertube.rate_limit_delay` between caption API calls or you'll get throttled. - **SSH heredocs with Python code break.** When editing remote files, write to a temp file via `scp` or `cat > file` rather than bash heredocs with parens/quotes. - **The crawler is off.** `crawler.sites: []`. Re-enabling requires a re-architecture for the new pipeline. --- ## 17. Credentials & Hosts | Host | Role | Access | |---|---|---| | CT 130 (192.168.1.130 / 100.64.0.24) | RECON service | `ssh zvx@192.168.1.130` (key auth) | | cortex VM (192.168.1.150) | Qdrant, TEI, sparse svc, Ollama | `ssh zvx@cortex` | | CT 110 (192.168.1.170) | PeerTube `stream.echo6.co` | `ssh zvx@192.168.1.170` | | pi-nas (192.168.1.245) | NFS server for `/mnt/library` | `ssh zvx@pi-nas` | | CT 101 (192.168.1.101) | Caddy reverse proxy (home) | `ssh root@192.168.1.241 'pct exec 101'` | Secrets: `/home/zvx/projects/.ref/credentials` on TOC (this machine). RECON Gemini/PeerTube keys: `/opt/recon/.env` on CT 130. --- ## 18. Open Follow-ups - **82 MULTI_MATCH + 141 UNMATCHED** transcript rows still carry library paths post Phase 5a/6k (audit trail at `/tmp/phase5a_remaining.txt` on CT 130 — file still present). Either hand-resolve or tombstone. - **HTML processor** (`lib/processors/html_processor.py`) is scaffolded in config but not implemented. Next-up for Kiwix / web ingest. - **Crawler re-architecture.** The tier-1 sites list in `config.yaml` is a valuable target list but the old crawler is off pending a new acquisition-module-shaped implementation. - **ARGUS intel pipeline** shares the DB but its lifecycle is documented separately — not covered here. - **Phase 6e-2** (PeerTube channel sync endpoint) was reverted and needs a redesign before reinstating. - **Level-4 dedupe review queue** (`duplicate_review` table) has no UI yet; items pile up silently. - **9,478 legacy dirs in `/opt/recon/data/text/`** — historical extraction output from the pre-refactor pipeline, for documents still in catalogue. Not touched by current pipeline. Can be cleaned up once confirmed none are the sole text copy for any document. - **`lib/new_pipeline.py` is misleadingly named** — it's actually a library management CLI tool, not the refactor's new pipeline. Contains `update_qdrant_payload` helper that filing worker depends on. Should be renamed (e.g., `library_ops.py`) when there's time. - **SSH key for CT 130 forge access** — currently uses HTTPS with embedded token in remote URL. Move to SSH key auth. - **Backup policy for derived data** — `/opt/recon/data/concepts/` and Qdrant snapshots are not in any backup rotation. If CT 130 or cortex lose their disks, these are the hardest to regenerate (Gemini calls + embedding compute). - **`signal-archive/` in `/mnt/library/`** — 44 Signal/Matrix chat log files, not library content. Matt intends these to "eventually contribute" to the knowledge base but no ingestion path exists yet. --- *Last updated: 2026-04-15 — Refactor feature-complete. Phases 0 through 6k landed. Service operational with 7 daemon threads. Outstanding: 223 edge-case transcripts (see Section 18), HTML processor (scaffolded, not implemented), crawler re-architecture (deferred). Living document; edit in place as the system evolves.*