mirror of
https://github.com/zvx-echo6/refactored-recon.git
synced 2026-05-20 06:34:34 +02:00
Add PROJECT-BIBLE.md: canonical architectural reference for RECON
Consolidated orientation document for future sessions. Covers pipeline lifecycle (acquire → dispatch → process → enrich/embed → file), acquisition modules, dispatcher, per-type processors, filing, StatusDB schema, config, service threads, dashboard/API, filesystem layout, refactor history, runbook, known gotchas, and follow-ups. Sourced from live code on CT 130 (/opt/recon/) including recon.py, dispatcher.py, filing.py, status.py, the three processors, acquisition/peertube.py, config.yaml, and api.py. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
parent
5b0d4eed90
commit
c9a8f1ecb5
1 changed files with 629 additions and 0 deletions
629
PROJECT-BIBLE.md
Normal file
629
PROJECT-BIBLE.md
Normal file
|
|
@ -0,0 +1,629 @@
|
|||
# RECON — Project Bible
|
||||
|
||||
Canonical architectural reference for **RECON** (the knowledge extraction
|
||||
pipeline running on CT 130 / `data.echo6` / `100.64.0.24`). This document
|
||||
is the orientation dossier for any future session. It is skim-and-find,
|
||||
not a tutorial.
|
||||
|
||||
- **Repo:** `ssh://git@forge.echo6.co:2222/matt/refactored-recon.git` (design)
|
||||
- **Code:** `/opt/recon/` on CT 130 (zvx owns the tree; service runs as zvx)
|
||||
- **Service:** `systemctl status recon`
|
||||
- **Dashboard:** `https://recon.echo6.co` (zvx-only via Authentik)
|
||||
- **Files server:** `https://files.echo6.co` (Authentik forward auth)
|
||||
|
||||
---
|
||||
|
||||
## 1. Mission
|
||||
|
||||
RECON ingests documents from multiple sources (manual PDF uploads,
|
||||
PeerTube auto-captioned transcripts, future Kiwix/HTML/RSS feeds) and
|
||||
produces a **searchable, domain-organized library** plus a hybrid
|
||||
dense/sparse vector index in Qdrant on cortex.
|
||||
|
||||
Every piece of content ends up in two places:
|
||||
|
||||
1. A file under `/mnt/library/<Domain>/<Subdomain>/<canonical_name>.<ext>`
|
||||
(PDFs, HTML) **or** at a source URL like `https://stream.echo6.co/w/<uuid>`
|
||||
(PeerTube transcripts — no local copy after Phase 5a).
|
||||
2. Page-level embeddings in Qdrant collection `recon_knowledge_hybrid`
|
||||
(dense `bge-m3` + sparse SPLADE-style vectors, 1024-dim dense).
|
||||
|
||||
Search returns page-grounded citations back to the file or stream URL.
|
||||
|
||||
---
|
||||
|
||||
## 2. System Topology
|
||||
|
||||
```
|
||||
┌─────────────────────────┐
|
||||
│ CT 130 (recon) │
|
||||
Library (NFS) │ /opt/recon/ │ ┌──────────────┐
|
||||
pi-nas:/export/library│ ├─ data/ │ │ Qdrant │
|
||||
→ /mnt/library/ │ │ ├─ acquired/ │ │ cortex:6333 │
|
||||
(read/write) │ │ ├─ processing/ │ ←→ │ recon_knowledge_hybrid
|
||||
│ │ ├─ concepts/ │ │ (1024-d dense + sparse)
|
||||
│ │ └─ recon.db │ └──────────────┘
|
||||
│ ├─ lib/ │
|
||||
│ ├─ recon.py │ ┌──────────────┐
|
||||
│ └─ config.yaml │ ←→ │ TEI │
|
||||
│ recon.service │ │ cortex:8090 │
|
||||
│ nginx :8888 (files) │ │ bge-m3 dense │
|
||||
└─────────────────────────┘ └──────────────┘
|
||||
▲ ┌──────────────┐
|
||||
│ │ Sparse svc │
|
||||
┌───────────────────────┴─────┐ ←→ │ cortex:8091 │
|
||||
│ │ │ bge-m3 sparse│
|
||||
PeerTube (CT 110 / stream.echo6.co) Gemini API └──────────────┘
|
||||
api_base: http://192.168.1.170 (enrichment,
|
||||
vision OCR)
|
||||
```
|
||||
|
||||
Shared caddy reverse proxy (CT 101) surfaces the dashboard (8420) and
|
||||
nginx file server (8888) as `recon.echo6.co` and `files.echo6.co`.
|
||||
|
||||
---
|
||||
|
||||
## 3. Pipeline Lifecycle
|
||||
|
||||
Every document follows the same five-stage arc regardless of source type.
|
||||
The filesystem location at any given moment tells you which stage the
|
||||
item is in — **state is a directory.**
|
||||
|
||||
```
|
||||
┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
|
||||
│ 1. ACQUIRE │ → │ 2. DISPATCH │ → │ 3. PROCESS │ → │ 4. ENRICH / │ → │ 5. FILE │
|
||||
│ │ │ (pre_flight) │ │ │ │ EMBED │ │ │
|
||||
│ data/acquired│ │ dispatcher.py│ │ per-type │ │ shared │ │ shared │
|
||||
│ /<type>/ │ │ watches │ │ processor │ │ stage loops │ │ filing worker│
|
||||
│ <hash>.{ext} │ │ subfolders, │ │ moves file │ │ bge-m3 → │ │ moves file │
|
||||
│ <hash>.meta │ │ hands to │ │ to processing│ │ Qdrant │ │ processing → │
|
||||
│ │ │ processor │ │ /{hash}/ │ │ │ │ library, │
|
||||
│ │ │ │ │ │ │ │ │ updates DB + │
|
||||
│ │ │ │ │ │ │ │ │ Qdrant │
|
||||
└──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘
|
||||
```
|
||||
|
||||
**Status column on documents:**
|
||||
`catalogued → queued → extracting → extracted → enriching → enriched →
|
||||
embedding → complete` (plus terminal `error`, `content_failure`,
|
||||
`duplicate` states).
|
||||
|
||||
`organized_at IS NULL` while in flight, set to `CURRENT_TIMESTAMP` after
|
||||
filing. Transcripts are marked organized in-place during pre_flight
|
||||
(they have no filesystem target — the watch URL is their "home").
|
||||
|
||||
---
|
||||
|
||||
## 4. Acquisition Layer (`lib/acquisition/`)
|
||||
|
||||
Acquisition modules fetch content from external sources and drop
|
||||
`{hash}.<ext>` + `{hash}.meta.json` **flat file pairs** into
|
||||
`data/acquired/<type>/`. They do **not** touch the database — that's the
|
||||
processor's job.
|
||||
|
||||
### Atomic drop protocol
|
||||
1. Write content to `<hash>.<ext>.tmp` (unknown extension, safe from dispatcher).
|
||||
2. Compute hash; rename tmp to final `<hash>.<ext>`.
|
||||
3. Write `<hash>.meta.json.tmp`, then rename to `<hash>.meta.json`.
|
||||
4. **Meta goes final first, content goes final last.** Dispatcher only
|
||||
picks up when content file exists and is stable, so a half-visible
|
||||
pair without meta never gets dispatched.
|
||||
|
||||
### PeerTube acquisition (`lib/acquisition/peertube.py`)
|
||||
- Daemon loop `acquisition_loop(stop_event, db, config, interval=1800)`.
|
||||
- Queries catalogue for `source='stream.echo6.co'` rows, builds sets of
|
||||
known UUIDs (`/w/<uuid>` extracted from `path`) **and** known titles
|
||||
(from `filename`) — both cohorts are checked so Phase 5b-rewritten
|
||||
rows and pre-5b library-path rows dedupe correctly.
|
||||
- Lists PeerTube videos via `peertube_scraper.get_videos`, filters to
|
||||
those with captions, prefers English caption.
|
||||
- For each new one: fetches VTT, converts to text with `vtt_to_text`,
|
||||
atomically drops pair into `data/acquired/stream/`.
|
||||
- Rate limits at `peertube.rate_limit_delay` (default 0.5s) —
|
||||
**PeerTube returns 429 if captions are fetched too fast.**
|
||||
|
||||
### Manual uploads / URL ingest
|
||||
`api.py` exposes `/api/upload`, `/api/ingest-url`, `/api/ingest-urls`,
|
||||
`/api/ingest-peertube` — all end by dropping a pair into `acquired/<type>/`.
|
||||
|
||||
---
|
||||
|
||||
## 5. Dispatcher (`lib/dispatcher.py`)
|
||||
|
||||
The dispatcher is one daemon thread (`dispatch_loop`, interval=30s). It
|
||||
watches each configured subfolder under `data/acquired/` and hands
|
||||
stable file pairs to the registered processor.
|
||||
|
||||
### Config-driven dispatch table
|
||||
```yaml
|
||||
pipeline:
|
||||
acquired_root: /opt/recon/data/acquired
|
||||
processing_root: /opt/recon/data/processing
|
||||
dispatch:
|
||||
pdf: pdf_processor
|
||||
stream: transcript_processor
|
||||
html: html_processor # not yet implemented
|
||||
text: text_processor
|
||||
mtime_stability_seconds: 10
|
||||
```
|
||||
|
||||
### Extension constants
|
||||
- `CONTENT_EXTENSIONS = {'.txt', '.vtt', '.html', '.pdf'}` — the
|
||||
dispatcher considers a file "content" only if its extension is in
|
||||
this set. **`.tmp` is not in the set**, so partial writes are safe.
|
||||
- `CONVERTIBLE_EXTENSIONS = {'.epub', '.mobi', '.doc', '.docx'}` —
|
||||
these are normalized to PDF **before** dispatch.
|
||||
|
||||
### Normalization step
|
||||
`_normalize_formats(subfolder_path)`:
|
||||
- `.epub` / `.mobi` → PDF via `ebook-convert` (Calibre CLI).
|
||||
- `.doc` / `.docx` → PDF via `libreoffice --headless`.
|
||||
- Sidecar `.meta.json` is renamed to match the new PDF hash so pairing
|
||||
holds.
|
||||
|
||||
### Pair finding
|
||||
`_find_pairs(subfolder_path)` returns tuples of (content_path,
|
||||
meta_path_or_None). Pairs where only content exists are still valid —
|
||||
meta is not required. A meta without its content is ignored.
|
||||
|
||||
### Stability check
|
||||
`_is_stable(filepath, stability_seconds)` — mtime must be at least
|
||||
`mtime_stability_seconds` old (default 10s) before dispatch. Prevents
|
||||
racing active writers.
|
||||
|
||||
---
|
||||
|
||||
## 6. Processors (`lib/processors/`)
|
||||
|
||||
Each processor implements **one function**: `pre_flight(content_path,
|
||||
meta_path, db, config) → dict`. It owns all the type-specific logic and
|
||||
**all the database writes** for that item up to status=`extracted`.
|
||||
|
||||
### Common pre_flight contract
|
||||
Every processor does, in order:
|
||||
1. Hash content (SHA-256 via `content_hash()` in `lib/utils.py`).
|
||||
2. Stale state cleanup: `rm -rf processing/{hash}/` and
|
||||
`concepts/{hash}/` if they exist (guards against re-runs).
|
||||
3. Hash dedupe: if `hash` already exists in `catalogue`, delete the
|
||||
pair, return action `duplicate`.
|
||||
4. Type-specific metadata extraction + level-4 dedupe check (PDF only).
|
||||
5. Move content + meta into `processing/{hash}/` with a type-specific
|
||||
layout.
|
||||
6. `db.add_to_catalogue`, `db.queue_document`, set `documents.text_dir`
|
||||
and `page_count`, `db.update_status(hash, 'extracted', ...)`.
|
||||
|
||||
Return dict keys: `hash`, `action`, `source_path`, `error`. Actions are
|
||||
one of: `extracted`, `duplicate`, `level4_duplicate`, `content_failure`,
|
||||
`error`, and `duplicate` (for transcripts) or `skip_empty` (for text).
|
||||
|
||||
### `pdf_processor.py`
|
||||
The heaviest processor. Layered metadata extraction:
|
||||
- **Source A — PDF dict:** `PdfReader(...).metadata`, mapped to
|
||||
`{title, author, edition, year}`.
|
||||
- **Source B — Filename:** regex parse the original filename.
|
||||
- **Source C — Gemini Vision OCR** on first 3 pages when A+B
|
||||
disagree or are missing. Returns structured JSON via Gemini's
|
||||
`response_mime_type: application/json`.
|
||||
- **Voting:** `_vote_metadata(A, B, C)` reconciles the three sources;
|
||||
2-of-3 wins; ties prefer Source A.
|
||||
- **Level-4 dedupe:** if all four fields (`title, edition, author,
|
||||
year`) are present and match an existing catalogue row with a
|
||||
different hash, the PDF is quarantined to `_duplicates/` for human
|
||||
review.
|
||||
- **Size cap:** `processing.max_pdf_size_mb` (default 2000MB). Oversize
|
||||
PDFs move to `_rejected/`.
|
||||
- **Text extraction order:** PyPDF2 → `pdftotext` (poppler) → Tesseract
|
||||
OCR → Gemini Vision on a per-page basis. Output:
|
||||
`processing/{hash}/page_NNNN.txt`.
|
||||
|
||||
### `transcript_processor.py`
|
||||
Lightweight. The VTT→text conversion already happened in acquisition,
|
||||
so pre_flight just:
|
||||
- Hashes `<hash>.txt` file.
|
||||
- Reads meta.json sidecar.
|
||||
- `chunk_text(raw_text, WORDS_PER_PAGE=2000)` splits into
|
||||
`page_NNNN.txt` files.
|
||||
- Writes the transcript as `processing/{hash}/transcript.txt` plus page
|
||||
chunks.
|
||||
- Registers with category `Transcript`, source `stream.echo6.co`.
|
||||
- Sets `text_dir`, `page_count`, and **`organized_at = CURRENT_TIMESTAMP`
|
||||
immediately** — transcripts are filed-in-place (their "location" is
|
||||
the PeerTube watch URL, set later as the catalogue `path` via Phase
|
||||
5a).
|
||||
|
||||
### `text_processor.py`
|
||||
Raw `.txt` files dropped via manual upload. Two-source metadata vote
|
||||
(filename + meta.json). Similar flow to transcript processor but no
|
||||
fixed category or source.
|
||||
|
||||
---
|
||||
|
||||
## 7. Enrichment & Embedding
|
||||
|
||||
Both are **source-agnostic stage loops** that just poll documents by
|
||||
status and do their work. They live in `lib/enricher.py` and
|
||||
`lib/embedder.py`, wrapped by `stage_loop(stage, ...)` in `recon.py`.
|
||||
|
||||
### Enrichment (`enrich_workers: 16` threads per batch)
|
||||
- Polls `status = 'extracted' AND retries < max_retries`.
|
||||
- Sets `enriching`, reads `processing/{hash}/page_NNNN.txt`.
|
||||
- Windows pages (`enrich_window_size: 5` per window) and sends each
|
||||
window to Gemini with a structured prompt.
|
||||
- Stores `concepts/{hash}/window_N.json` per window.
|
||||
- Backoff: `enrich_base_delay=5s`, doubling up to
|
||||
`enrich_max_delay=120s`, max `enrich_max_retries=5`.
|
||||
- On success: `update_status(hash, 'enriched')`.
|
||||
|
||||
### Embedding (`embed_workers: 4`)
|
||||
- Polls `status = 'enriched'`.
|
||||
- Reads concept JSONs, builds page-level chunks.
|
||||
- Dense: POST to TEI at `cortex:8090` (`bge-m3`, 1024-d). Batches of
|
||||
128 per TEI request. Throughput ~1,711 emb/sec.
|
||||
- Sparse: POST to the sparse service at `cortex:8091` (bge-m3 sparse
|
||||
mode; `sparse_embedding.enabled: true`).
|
||||
- Upserts into Qdrant `cortex:6333`, collection `recon_knowledge_hybrid`,
|
||||
batch size `embed_batch_size=500` vectors per upsert.
|
||||
- Payload carries: `hash`, `filename`, `original_filename`,
|
||||
`download_url`, `page`, `text`, `title`, `domain`, `subdomain`,
|
||||
`category`.
|
||||
- Ollama is a fallback backend (much slower, ~8 emb/sec) via
|
||||
`embedding.backend: ollama`.
|
||||
- On success: `update_status(hash, 'complete')`.
|
||||
|
||||
---
|
||||
|
||||
## 8. Filing (`lib/filing.py`)
|
||||
|
||||
One daemon thread, `filing_worker_loop(interval=30)`. It polls:
|
||||
|
||||
```sql
|
||||
SELECT hash FROM documents
|
||||
WHERE status = 'complete'
|
||||
AND organized_at IS NULL
|
||||
AND path LIKE '/opt/recon/data/processing/%'
|
||||
LIMIT 50
|
||||
```
|
||||
|
||||
The `path LIKE '/opt/recon/data/processing/%'` filter naturally
|
||||
**excludes transcripts** — their `documents.path` was never a
|
||||
filesystem path but the PeerTube watch URL.
|
||||
|
||||
For each row, `file_processed_item(doc_hash, source_file_path, db,
|
||||
config)` does:
|
||||
1. `determine_dominant_domain(hash)` reads concept JSONs, returns the
|
||||
top-voted `Domain/Subdomain`.
|
||||
2. `_build_target_path(...)` derives the canonical name starting at
|
||||
level 1 (`Title`), escalating to level 2/3/4 only if a collision
|
||||
exists in the target folder. **Preserves source file's actual
|
||||
extension** (not hardcoded to `.pdf`).
|
||||
3. `shutil.move(source, target)` atomically. Target is
|
||||
`/mnt/library/<Domain>/<Subdomain>/<canonical>.<ext>`.
|
||||
4. Updates:
|
||||
- `catalogue.path` → new target
|
||||
- `catalogue.filename` → new canonical name
|
||||
- `documents.path` → new target
|
||||
- Qdrant payload via `update_qdrant_payload(...)`:
|
||||
`download_url = generate_download_url(new_path, ...)`,
|
||||
`filename`, `original_filename` set on every point for that hash.
|
||||
5. `db.mark_organized(hash)` sets `organized_at` + cleans up
|
||||
`processing/{hash}/`.
|
||||
|
||||
### Download URL helper (`lib/utils.py:generate_download_url`)
|
||||
- If the path is already `http://` or `https://` (transcripts), return
|
||||
it unchanged.
|
||||
- Otherwise strip `library_root` prefix and prepend
|
||||
`book_server.base_url` (→ `https://files.echo6.co/<rel>`).
|
||||
|
||||
---
|
||||
|
||||
## 9. StatusDB (`lib/status.py`)
|
||||
|
||||
SQLite (`data/recon.db`) in WAL mode with thread-local connections
|
||||
(`_get_conn()` uses `threading.local`).
|
||||
|
||||
### Tables
|
||||
| Table | Purpose |
|
||||
|---|---|
|
||||
| `catalogue` | Canonical record keyed by `hash` — title, filename, path, source, category, size |
|
||||
| `documents` | Pipeline state machine — status, path, text_dir, page_count, retries, organized_at, timestamps |
|
||||
| `intel` | ARGUS intel feed entries (separate pipeline) |
|
||||
| `metrics_snapshots` | Time-series rollups for the dashboard |
|
||||
| `file_operations` | Audit log of Phase-5-style file moves and renames |
|
||||
| `duplicate_review` | Level-4 dedupe quarantine queue |
|
||||
|
||||
### Key methods
|
||||
- `add_to_catalogue(hash, title, url, size, source, category)`
|
||||
- `queue_document(hash)` — insert into `documents` with status=`queued`
|
||||
- `update_status(hash, status, **kwargs)` — single point of status truth
|
||||
- `mark_organized(hash)` — sets `organized_at`, final transition
|
||||
- `sync_document_path(hash, new_path)` + `update_catalogue_path(...)` —
|
||||
used by filing worker and Phase 5a un-file
|
||||
- `get_path_updates` / `clear_path_update` — small change queue for
|
||||
backfills
|
||||
|
||||
### Connection safety
|
||||
All writers take a short-lived connection via `_get_conn()`. WAL mode
|
||||
allows concurrent readers; writes are serialized at the SQLite level.
|
||||
No explicit `BEGIN` — rely on autocommit semantics with occasional
|
||||
`conn.commit()` after grouped updates.
|
||||
|
||||
---
|
||||
|
||||
## 10. Configuration (`config.yaml`)
|
||||
|
||||
Lives at `/opt/recon/config.yaml`. Secrets (`GEMINI_KEYS`,
|
||||
`PEERTUBE_TOKEN`, etc.) live in `/opt/recon/.env` — never in
|
||||
`config.yaml`, never in git.
|
||||
|
||||
### Top-level keys
|
||||
| Key | Meaning |
|
||||
|---|---|
|
||||
| `library_root` | `/mnt/library` — NFS mount root |
|
||||
| `processing` | Worker counts, window sizes, timeouts, retry policy |
|
||||
| `embedding` | TEI host/port, model (`bge-m3`), 1024-d dense |
|
||||
| `sparse_embedding` | Separate service on cortex:8091 |
|
||||
| `vector_db` | Qdrant host, port, collection name |
|
||||
| `gemini` | Model (`gemini-2.0-flash`), JSON response mode |
|
||||
| `web` | Dashboard bind host + port (8420) |
|
||||
| `paths` | `base`, `data`, `text`, `concepts`, `intel`, `logs`, `db` |
|
||||
| `book_server` | `base_url`, `strip_prefix` for download URL generation |
|
||||
| `upload_paths` | Category → filesystem path for upload routing |
|
||||
| `service` | `scan_interval`, `stage_poll_interval`, `progress_interval` |
|
||||
| `peertube` | `api_base`, `public_url`, `rate_limit_delay`, `poll_interval` |
|
||||
| `pipeline` | `acquired_root`, `processing_root`, `dispatch` table, `mtime_stability_seconds` |
|
||||
| `crawler` / `web_scraper` | Currently disabled (`sites: []`) pending re-architecture |
|
||||
| `new_pipeline` | Stream-B (old) pipeline, `enabled: false` |
|
||||
|
||||
---
|
||||
|
||||
## 11. Service & Threads (`recon.py cmd_service`)
|
||||
|
||||
`systemctl start recon` → `python3 recon.py service`. The service runs
|
||||
seven daemon threads plus a metrics collector:
|
||||
|
||||
| Thread | Function | Interval |
|
||||
|---|---|---|
|
||||
| `dispatcher` | `dispatcher.dispatch_loop` | 30s |
|
||||
| `enrich` | `stage_loop('enrich', ...)` | 30s idle |
|
||||
| `embed` | `stage_loop('embed', ...)` | 30s idle |
|
||||
| `filing` | `filing.filing_worker_loop` | 30s |
|
||||
| `peertube-acq` | `acquisition.peertube.acquisition_loop` | 1800s |
|
||||
| `progress` | Log status rollup line | 60s |
|
||||
| `dashboard` | `api.run_server` (Flask) | bound |
|
||||
|
||||
Plus `peertube_collector.start_collector` for metrics scrape.
|
||||
|
||||
All threads receive a shared `stop_event` (`threading.Event`) and exit
|
||||
cleanly on SIGTERM via `signal.signal(SIGTERM, lambda *_: stop_event.set())`.
|
||||
|
||||
### CLI commands (`recon.py` top-level)
|
||||
`scan`, `queue`, `extract`, `enrich`, `embed`, `run`, `status`,
|
||||
`catalogue`, `failures`, `search`, `upload`, `ingest-url`, `ingest`,
|
||||
`ingest-peertube`, `validate`, `rebuild`, `serve`, `service`,
|
||||
`organize`, `pipeline`.
|
||||
|
||||
Most commands are thin wrappers around library functions — useful for
|
||||
one-off maintenance from the CT 130 shell.
|
||||
|
||||
---
|
||||
|
||||
## 12. Dashboard & API (`lib/api.py`)
|
||||
|
||||
Flask app bound to `0.0.0.0:8420`. Pages are server-rendered Jinja
|
||||
templates; data is pulled via AJAX from `/api/*` endpoints.
|
||||
|
||||
### Page routes
|
||||
`/`, `/search`, `/catalogue`, `/upload`, `/web-ingest`, `/failures`,
|
||||
`/peertube`, `/peertube/channels`, `/settings/{keys,cookies,vpn,health}`.
|
||||
|
||||
### API surface (grouped)
|
||||
| Group | Endpoints |
|
||||
|---|---|
|
||||
| Upload | `POST /api/upload`, `GET /api/upload/<hash>/status`, `GET /api/upload/categories` |
|
||||
| Ingest | `POST /api/ingest-url`, `/api/ingest-urls`, `/api/ingest`, `/api/ingest-peertube`, `/api/crawl`, `GET /api/crawl/<id>/status`, `GET /api/ingest-peertube/<job>/status` |
|
||||
| Search | `POST /api/search` |
|
||||
| Status | `GET /api/status`, `/api/quick-stats`, `/api/knowledge-stats`, `/api/health` |
|
||||
| Retry | `POST /api/retry/<hash>`, `/api/retry-all` |
|
||||
| Service | `POST /api/service/restart` |
|
||||
| Keys | Full CRUD on `/api/keys`, `/api/keys/validate`, `/api/keys/reload` |
|
||||
| Cookies | `GET /api/cookies/status`, `POST /api/cookies/upload` |
|
||||
| VPN | `GET /api/vpn/status`, `POST /api/vpn/{connect,disconnect,rotate,login}` |
|
||||
| PeerTube | `/api/peertube/{dashboard,channels,channels/stats,channels/add,channels/<actor>}`, `/api/peertube/stats` |
|
||||
| Metrics | `GET /api/metrics/history` |
|
||||
|
||||
### Qdrant scroll
|
||||
`_qdrant_scroll(host, port, collection, req)` is the shared paged-read
|
||||
helper for rebuilding the knowledge-stats panel.
|
||||
|
||||
### Cache warmer
|
||||
`start_cache_warmer(stop_event)` pre-computes the expensive quick-stats
|
||||
and knowledge-stats panels so the dashboard loads instantly.
|
||||
|
||||
---
|
||||
|
||||
## 13. Filesystem Layout
|
||||
|
||||
```
|
||||
/opt/recon/
|
||||
├── recon.py # CLI + service entry point
|
||||
├── config.yaml
|
||||
├── .env # secrets (GEMINI_KEYS etc.)
|
||||
├── PROJECT-BIBLE.md # this file (copy on CT 130)
|
||||
├── backups/ # local DB backups
|
||||
├── data/
|
||||
│ ├── acquired/ # hopper — {hash}.ext + {hash}.meta.json
|
||||
│ │ ├── pdf/
|
||||
│ │ ├── stream/ # PeerTube transcripts
|
||||
│ │ ├── html/ # (future)
|
||||
│ │ └── text/
|
||||
│ ├── processing/{hash}/ # in-flight scratch
|
||||
│ │ ├── page_NNNN.txt
|
||||
│ │ ├── meta.json
|
||||
│ │ └── (original file or transcript.txt)
|
||||
│ ├── concepts/{hash}/
|
||||
│ │ └── window_N.json # Gemini enrichment output
|
||||
│ ├── intel/ # ARGUS intel feeds
|
||||
│ ├── _duplicates/ # level-4 name-match quarantine
|
||||
│ ├── _rejected/ # oversize / unreadable PDFs
|
||||
│ └── recon.db # SQLite WAL mode
|
||||
├── lib/
|
||||
│ ├── acquisition/peertube.py
|
||||
│ ├── processors/{pdf,transcript,text}_processor.py
|
||||
│ ├── dispatcher.py
|
||||
│ ├── filing.py
|
||||
│ ├── enricher.py
|
||||
│ ├── embedder.py
|
||||
│ ├── status.py # StatusDB class
|
||||
│ ├── api.py # Flask dashboard + API
|
||||
│ ├── new_pipeline.py # update_qdrant_payload helper lives here
|
||||
│ ├── utils.py # content_hash, generate_download_url, get_config, setup_logging
|
||||
│ ├── peertube_scraper.py # PeerTube API client
|
||||
│ └── organizer.py # determine_dominant_domain, level 1-4 naming
|
||||
└── logs/
|
||||
|
||||
/mnt/library/ # NFS from pi-nas, read-write
|
||||
├── <Domain>/<Subdomain>/<canonical_name>.<ext>
|
||||
└── _acquired/ _review/ _staging/ signal-archive/ # not touched by pipeline
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 14. Refactor History (2026-04)
|
||||
|
||||
The refactor is tracked as dated phases under `phases/`. Status
|
||||
implementations are in the RECON repo; design lives here.
|
||||
|
||||
| Phase | Focus |
|
||||
|---|---|
|
||||
| 0 | Baseline capture — DB dumps, directory listings, config pin |
|
||||
| 1 | Scaffolding — create `acquired/`, `processing/`, config keys |
|
||||
| 2 | Shared filing function — extract organizer logic into `filing.py` |
|
||||
| 3 | Transcript processor — first end-to-end test of the new pattern |
|
||||
| 4 | PDF processor — layered A/B/C metadata vote, level-4 dedupe |
|
||||
| 5a | Transcript resweep — 16,340 transcripts migrated from `library/*.txt` path to `stream.echo6.co/w/<uuid>` watch URLs; catalogue/documents/Qdrant all updated atomically, physical `.txt` files deleted |
|
||||
| 5b | Transcript unprocess — clean up stale rows and processing dirs |
|
||||
| 5c-1 | Service loop rewire — retire old scan_library thread, wire dispatcher in |
|
||||
| 5c-2 | Service start & transcript drain — clear the hopper backlog |
|
||||
| 6a | Transcript organized-in-place — set `organized_at` during pre_flight so filing worker ignores transcripts |
|
||||
| 6b | Dashboard "Untitled / WEB" bug fix — recently-completed table query |
|
||||
| 6c | Code cleanup — dead-code audit |
|
||||
| 6d | PeerTube acquisition module — replace ad-hoc ingester with `acquisition/peertube.py` |
|
||||
| 6e | ShadowLib skill + dashboard PeerTube endpoint cleanup (partial — 6e-2 reverted) |
|
||||
|
||||
### Baseline pre-refactor (per `current-state.md`)
|
||||
- 18,855 transcripts in `/mnt/library/_sources/streamecho6/`.
|
||||
- Old stream-B `new_pipeline` ran off `/mnt/library/_acquired/`.
|
||||
- `scan_library()` polled the NFS mount for new PDFs — now deprecated.
|
||||
|
||||
---
|
||||
|
||||
## 15. Operational Runbook
|
||||
|
||||
### Service control (on CT 130 as zvx)
|
||||
```bash
|
||||
sudo systemctl {status,start,stop,restart} recon
|
||||
journalctl -u recon -f
|
||||
tail -f /opt/recon/logs/recon.log
|
||||
```
|
||||
|
||||
### Backups
|
||||
```bash
|
||||
# Local DB backup before risky operations
|
||||
cp /opt/recon/data/recon.db /tmp/recon.db.bak.$(date +%s)
|
||||
# Contabo offsite (automatic): rsync every 6 hours, see recon-backup.timer
|
||||
```
|
||||
|
||||
### Inspect pipeline state at a glance
|
||||
```bash
|
||||
ls /opt/recon/data/acquired/*/ # hopper contents
|
||||
ls /opt/recon/data/processing/ | wc -l # in-flight count
|
||||
sqlite3 /opt/recon/data/recon.db \
|
||||
"SELECT status, COUNT(*) FROM documents GROUP BY status;"
|
||||
```
|
||||
|
||||
### Re-queue a failed document
|
||||
```bash
|
||||
sqlite3 /opt/recon/data/recon.db \
|
||||
"UPDATE documents SET status='extracted', retries=0 WHERE hash='<hash>';"
|
||||
# or via API:
|
||||
curl -X POST https://recon.echo6.co/api/retry/<hash>
|
||||
```
|
||||
|
||||
### Manual ingest
|
||||
```bash
|
||||
# Drop a PDF into the hopper (dispatcher will pick it up on next cycle)
|
||||
sha=$(sha256sum foo.pdf | cut -d' ' -f1)
|
||||
cp foo.pdf /opt/recon/data/acquired/pdf/${sha}.pdf
|
||||
```
|
||||
|
||||
### Qdrant health
|
||||
```bash
|
||||
curl -s http://100.64.0.14:6333/collections/recon_knowledge_hybrid \
|
||||
| jq '.result | {status, points_count, optimizer_status}'
|
||||
# status "grey" with optimizer_status.ok=true is healthy (background indexing).
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 16. Known Gotchas
|
||||
|
||||
- **Logger setup.** RECON modules must use `setup_logging('recon.<name>')`
|
||||
from `lib.utils`, never raw `logging.getLogger()`. The root logger
|
||||
has no handlers; calls to a raw logger silently disappear.
|
||||
- **Qdrant status "grey" is healthy** if `optimizer_status.ok == true`.
|
||||
Only treat red + not-ok as a real failure.
|
||||
- **Catalogue row count can grow during long-running jobs** because
|
||||
parallel ingestion may add rows. Only a *decrease* is a real
|
||||
integrity failure.
|
||||
- **Dispatcher `.tmp` safety.** `CONTENT_EXTENSIONS` does not include
|
||||
`.tmp`, so active acquisition writes are invisible to the dispatcher
|
||||
until the atomic rename lands.
|
||||
- **Transcripts are filed in-place.** Their `documents.path` is a URL
|
||||
and filing worker's `path LIKE '/opt/recon/data/processing/%'`
|
||||
filter excludes them.
|
||||
- **PeerTube 429.** Respect `peertube.rate_limit_delay` between caption
|
||||
API calls or you'll get throttled.
|
||||
- **SSH heredocs with Python code break.** When editing remote files,
|
||||
write to a temp file via `scp` or `cat > file` rather than bash
|
||||
heredocs with parens/quotes.
|
||||
- **The crawler is off.** `crawler.sites: []`. Re-enabling requires a
|
||||
re-architecture for the new pipeline.
|
||||
|
||||
---
|
||||
|
||||
## 17. Credentials & Hosts
|
||||
|
||||
| Host | Role | Access |
|
||||
|---|---|---|
|
||||
| CT 130 (192.168.1.130 / 100.64.0.24) | RECON service | `ssh zvx@192.168.1.130` (key auth) |
|
||||
| cortex VM (192.168.1.150) | Qdrant, TEI, sparse svc, Ollama | `ssh zvx@cortex` |
|
||||
| CT 110 (192.168.1.170) | PeerTube `stream.echo6.co` | `ssh zvx@192.168.1.170` |
|
||||
| pi-nas (192.168.1.245) | NFS server for `/mnt/library` | `ssh zvx@pi-nas` |
|
||||
| CT 101 (192.168.1.101) | Caddy reverse proxy (home) | `ssh root@192.168.1.241 'pct exec 101'` |
|
||||
|
||||
Secrets: `/home/zvx/projects/.ref/credentials` on TOC (this machine).
|
||||
RECON Gemini/PeerTube keys: `/opt/recon/.env` on CT 130.
|
||||
|
||||
---
|
||||
|
||||
## 18. Open Follow-ups
|
||||
|
||||
- **82 MULTI_MATCH + 141 UNMATCHED** transcript rows still carry
|
||||
library paths post Phase 5a (audit trail at
|
||||
`/tmp/phase5a_remaining.txt` on CT 130). Either hand-resolve or
|
||||
tombstone.
|
||||
- **HTML processor** (`lib/processors/html_processor.py`) is scaffolded
|
||||
in config but not implemented. Next-up for Kiwix / web ingest.
|
||||
- **Crawler re-architecture.** The tier-1 sites list in `config.yaml`
|
||||
is a valuable target list but the old crawler is off pending a new
|
||||
acquisition-module-shaped implementation.
|
||||
- **ARGUS intel pipeline** shares the DB but its lifecycle is
|
||||
documented separately — not covered here.
|
||||
- **Phase 6e-2** (PeerTube channel sync endpoint) was reverted and
|
||||
needs a redesign before reinstating.
|
||||
- **Level-4 dedupe review queue** (`duplicate_review` table) has no UI
|
||||
yet; items pile up silently.
|
||||
|
||||
---
|
||||
|
||||
*Last updated: 2026-04-15 — Phase 5a transcript un-file complete, Phase 6e partial. Living document; edit in place as the system evolves.*
|
||||
Loading…
Add table
Add a link
Reference in a new issue