refactored-recon/PROJECT-BIBLE.md

# RECON — Project Bible

Canonical architectural reference for **RECON** (the knowledge extraction
pipeline running on CT 130 / `data.echo6` / `100.64.0.24`). This document
is the orientation dossier for any future session. It is skim-and-find,
not a tutorial.

- **Repo:** `ssh://git@forge.echo6.co:2222/matt/refactored-recon.git` (design)
- **Code:** `/opt/recon/` on CT 130 (zvx owns the tree; service runs as zvx)
- **Service:** `systemctl status recon`
- **Dashboard:** `https://recon.echo6.co` (zvx-only via Authentik)
- **Files server:** `https://files.echo6.co` (Authentik forward auth)

---

## 1. Mission

RECON ingests documents from multiple sources (manual PDF uploads,
PeerTube auto-captioned transcripts, future Kiwix/HTML/RSS feeds) and
produces a **searchable, domain-organized library** plus a hybrid
dense/sparse vector index in Qdrant on cortex.

Every piece of content ends up in two places:

1. A file under `/mnt/library/<Domain>/<Subdomain>/<canonical_name>.<ext>`
   (PDFs, HTML) **or** at a source URL like `https://stream.echo6.co/w/<uuid>`
   (PeerTube transcripts — no local copy after Phase 5a).
2. Page-level embeddings in Qdrant collection `recon_knowledge_hybrid`
   (dense `bge-m3` + sparse SPLADE-style vectors, 1024-dim dense).

Search returns page-grounded citations back to the file or stream URL.

---

## 2. System Topology

```
                        ┌─────────────────────────┐
                        │  CT 130 (recon)         │
  Library (LXC bind)    │  /opt/recon/            │     ┌──────────────┐
  data /mnt/data/library│   ├─ data/              │     │ Qdrant       │
  → /mnt/library/       │   │   ├─ acquired/      │     │ cortex:6333  │
  (read/write, local SSD)│  │   ├─ processing/    │ ←→  │ recon_knowledge_hybrid
                        │   │   ├─ concepts/      │     │ (1024-d dense + sparse)
                        │   │   └─ recon.db       │     └──────────────┘
                        │   ├─ lib/               │
                        │   ├─ recon.py           │     ┌──────────────┐
                        │   └─ config.yaml        │ ←→  │ TEI          │
                        │  recon.service          │     │ cortex:8090  │
                        │  nginx :8888 (files)    │     │ bge-m3 dense │
                        └─────────────────────────┘     └──────────────┘
                                   ▲                    ┌──────────────┐
                                   │                    │ Sparse svc   │
           ┌───────────────────────┴─────┐         ←→   │ cortex:8091  │
           │                             │              │ bge-m3 sparse│
 PeerTube (CT 110 / stream.echo6.co)    Gemini API      └──────────────┘
 api_base: http://192.168.1.170          (enrichment,
                                          vision OCR)
```

Shared caddy reverse proxy (CT 101) surfaces the dashboard (8420) and
nginx file server (8888) as `recon.echo6.co` and `files.echo6.co`.

---

## 3. Pipeline Lifecycle

Every document follows the same five-stage arc regardless of source type.
The filesystem location at any given moment tells you which stage the
item is in — **state is a directory.**

```
  ┌──────────────┐   ┌──────────────┐   ┌──────────────┐   ┌──────────────┐   ┌──────────────┐
  │ 1. ACQUIRE   │ → │ 2. DISPATCH  │ → │ 3. PROCESS   │ → │ 4. ENRICH /  │ → │ 5. FILE      │
  │              │   │ (pre_flight) │   │              │   │    EMBED     │   │              │
  │ data/acquired│   │ dispatcher.py│   │ per-type     │   │ shared       │   │ shared       │
  │ /<type>/     │   │ watches      │   │ processor    │   │ stage loops  │   │ filing worker│
  │ <hash>.{ext} │   │ subfolders,  │   │ moves file   │   │ bge-m3 →     │   │ moves file   │
  │ <hash>.meta  │   │ hands to     │   │ to processing│   │ Qdrant       │   │ processing → │
  │              │   │ processor    │   │ /{hash}/     │   │              │   │ library,     │
  │              │   │              │   │              │   │              │   │ updates DB + │
  │              │   │              │   │              │   │              │   │ Qdrant       │
  └──────────────┘   └──────────────┘   └──────────────┘   └──────────────┘   └──────────────┘
```

**Status column on documents:**
`catalogued → queued → extracting → extracted → enriching → enriched →
embedding → complete` (plus terminal `error`, `content_failure`,
`duplicate` states).

`organized_at IS NULL` while in flight, set to `CURRENT_TIMESTAMP` after
filing. Transcripts are marked organized in-place during pre_flight
(they have no filesystem target — the watch URL is their "home").

---

## 4. Acquisition Layer (`lib/acquisition/`)

Acquisition modules fetch content from external sources and drop
`{hash}.<ext>` + `{hash}.meta.json` **flat file pairs** into
`data/acquired/<type>/`. They do **not** touch the database — that's the
processor's job.

### Atomic drop protocol
1. Write content to `<hash>.<ext>.tmp` (unknown extension, safe from dispatcher).
2. Compute hash; rename tmp to final `<hash>.<ext>`.
3. Write `<hash>.meta.json.tmp`, then rename to `<hash>.meta.json`.
4. **Meta goes final first, content goes final last.** Dispatcher only
   picks up when content file exists and is stable, so a half-visible
   pair without meta never gets dispatched.

### PeerTube acquisition (`lib/acquisition/peertube.py`)
- Daemon loop `acquisition_loop(stop_event, db, config, interval=1800)`.
- Queries catalogue for `source='stream.echo6.co'` rows, builds sets of
  known UUIDs (`/w/<uuid>` extracted from `path`) **and** known titles
  (from `filename`) — both cohorts are checked so Phase 5b-rewritten
  rows and pre-5b library-path rows dedupe correctly.
- Lists PeerTube videos via `peertube_scraper.get_videos`, filters to
  those with captions, prefers English caption.
- For each new one: fetches VTT, converts to text with `vtt_to_text`,
  atomically drops pair into `data/acquired/stream/`.
- Rate limits at `peertube.rate_limit_delay` (default 0.5s) —
  **PeerTube returns 429 if captions are fetched too fast.**

### Manual uploads / URL ingest
`api.py` exposes `/api/upload`, `/api/ingest-url`, `/api/ingest-urls`,
`/api/ingest-peertube` — all end by dropping a pair into `acquired/<type>/`.

---

## 5. Dispatcher (`lib/dispatcher.py`)

The dispatcher is one daemon thread (`dispatch_loop`, interval=30s). It
watches each configured subfolder under `data/acquired/` and hands
stable file pairs to the registered processor.

### Config-driven dispatch table
```yaml
pipeline:
  acquired_root: /opt/recon/data/acquired
  processing_root: /opt/recon/data/processing
  dispatch:
    pdf: pdf_processor
    stream: transcript_processor
    html: html_processor        # not yet implemented
    text: text_processor
  mtime_stability_seconds: 10
```

### Extension constants
- `CONTENT_EXTENSIONS = {'.txt', '.vtt', '.html', '.pdf'}` — the
  dispatcher considers a file "content" only if its extension is in
  this set. **`.tmp` is not in the set**, so partial writes are safe.
- `CONVERTIBLE_EXTENSIONS = {'.epub', '.mobi', '.doc', '.docx'}` —
  these are normalized to PDF **before** dispatch.

### Normalization step
`_normalize_formats(subfolder_path)`:
- `.epub` / `.mobi` → PDF via `ebook-convert` (Calibre CLI).
- `.doc` / `.docx` → PDF via `libreoffice --headless`.
- Sidecar `.meta.json` is renamed to match the new PDF hash so pairing
  holds.

### Pair finding
`_find_pairs(subfolder_path)` returns tuples of (content_path,
meta_path_or_None). Pairs where only content exists are still valid —
meta is not required. A meta without its content is ignored.

### Stability check
`_is_stable(filepath, stability_seconds)` — mtime must be at least
`mtime_stability_seconds` old (default 10s) before dispatch. Prevents
racing active writers.

---

## 6. Processors (`lib/processors/`)

Each processor implements **one function**: `pre_flight(content_path,
meta_path, db, config) → dict`. It owns all the type-specific logic and
**all the database writes** for that item up to status=`extracted`.

### Common pre_flight contract
Every processor does, in order:
1. Hash content (SHA-256 via `content_hash()` in `lib/utils.py`).
2. Stale state cleanup: `rm -rf processing/{hash}/` and
   `concepts/{hash}/` if they exist (guards against re-runs).
3. Hash dedupe: if `hash` already exists in `catalogue`, delete the
   pair, return action `duplicate`.
4. Type-specific metadata extraction + level-4 dedupe check (PDF only).
5. Move content + meta into `processing/{hash}/` with a type-specific
   layout.
6. `db.add_to_catalogue`, `db.queue_document`, set `documents.text_dir`
   and `page_count`, `db.update_status(hash, 'extracted', ...)`.

Return dict keys: `hash`, `action`, `source_path`, `error`. Actions are
one of: `extracted`, `duplicate`, `level4_duplicate`, `content_failure`,
`error`, and `duplicate` (for transcripts) or `skip_empty` (for text).

### `pdf_processor.py`
The heaviest processor. Layered metadata extraction:
- **Source A — PDF dict:** `PdfReader(...).metadata`, mapped to
  `{title, author, edition, year}`.
- **Source B — Filename:** regex parse the original filename.
- **Source C — Gemini Vision OCR** on first 3 pages when A+B
  disagree or are missing. Returns structured JSON via Gemini's
  `response_mime_type: application/json`.
- **Voting:** `_vote_metadata(A, B, C)` reconciles the three sources;
  2-of-3 wins; ties prefer Source A.
- **Level-4 dedupe:** if all four fields (`title, edition, author,
  year`) are present and match an existing catalogue row with a
  different hash, the PDF is quarantined to `_duplicates/` for human
  review.
- **Size cap:** `processing.max_pdf_size_mb` (default 2000MB). Oversize
  PDFs move to `_rejected/`.
- **Text extraction order:** PyPDF2 → `pdftotext` (poppler) → Tesseract
  OCR → Gemini Vision on a per-page basis. Output:
  `processing/{hash}/page_NNNN.txt`.

### `transcript_processor.py`
Lightweight. The VTT→text conversion already happened in acquisition,
so pre_flight just:
- Hashes `<hash>.txt` file.
- Reads meta.json sidecar.
- `chunk_text(raw_text, WORDS_PER_PAGE=2000)` splits into
  `page_NNNN.txt` files.
- Writes the transcript as `processing/{hash}/transcript.txt` plus page
  chunks.
- Registers with category `Transcript`, source `stream.echo6.co`.
- Sets `text_dir`, `page_count`, and **`organized_at = CURRENT_TIMESTAMP`
  immediately** — transcripts are filed-in-place (their "location" is
  the PeerTube watch URL, set later as the catalogue `path` via Phase
  5a).

### `text_processor.py`
Raw `.txt` files dropped via manual upload. Two-source metadata vote
(filename + meta.json). Similar flow to transcript processor but no
fixed category or source.

---

## 7. Enrichment & Embedding

Both are **source-agnostic stage loops** that just poll documents by
status and do their work. They live in `lib/enricher.py` and
`lib/embedder.py`, wrapped by `stage_loop(stage, ...)` in `recon.py`.

### Enrichment (`enrich_workers: 16` threads per batch)
- Polls `status = 'extracted' AND retries < max_retries`.
- Sets `enriching`, reads `processing/{hash}/page_NNNN.txt`.
- Windows pages (`enrich_window_size: 5` per window) and sends each
  window to Gemini with a structured prompt.
- Stores `concepts/{hash}/window_N.json` per window.
- Backoff: `enrich_base_delay=5s`, doubling up to
  `enrich_max_delay=120s`, max `enrich_max_retries=5`.
- On success: `update_status(hash, 'enriched')`.

### Embedding (`embed_workers: 4`)
- Polls `status = 'enriched'`.
- Reads concept JSONs, builds page-level chunks.
- Dense: POST to TEI at `cortex:8090` (`bge-m3`, 1024-d). Batches of
  128 per TEI request. Throughput ~1,711 emb/sec.
- Sparse: POST to the sparse service at `cortex:8091` (bge-m3 sparse
  mode; `sparse_embedding.enabled: true`).
- Upserts into Qdrant `cortex:6333`, collection `recon_knowledge_hybrid`,
  batch size `embed_batch_size=500` vectors per upsert.
- Payload carries: `hash`, `filename`, `original_filename`,
  `download_url`, `page`, `text`, `title`, `domain`, `subdomain`,
  `category`.
- Ollama is a fallback backend (much slower, ~8 emb/sec) via
  `embedding.backend: ollama`.
- On success: `update_status(hash, 'complete')`.

---

## 8. Filing (`lib/filing.py`)

One daemon thread, `filing_worker_loop(interval=30)`. It polls:

```sql
SELECT hash FROM documents
 WHERE status = 'complete'
   AND organized_at IS NULL
   AND path LIKE '/opt/recon/data/processing/%'
 LIMIT 50
```

The `path LIKE '/opt/recon/data/processing/%'` filter naturally
**excludes transcripts** — their `documents.path` was never a
filesystem path but the PeerTube watch URL.

For each row, `file_processed_item(doc_hash, source_file_path, db,
config)` does:
1. `determine_dominant_domain(hash)` reads concept JSONs, returns the
   top-voted `Domain/Subdomain`.
2. `_build_target_path(...)` derives the canonical name starting at
   level 1 (`Title`), escalating to level 2/3/4 only if a collision
   exists in the target folder. **Preserves source file's actual
   extension** (not hardcoded to `.pdf`).
3. `shutil.move(source, target)` atomically. Target is
   `/mnt/library/<Domain>/<Subdomain>/<canonical>.<ext>`.
4. Updates:
   - `catalogue.path` → new target
   - `catalogue.filename` → new canonical name
   - `documents.path` → new target
   - Qdrant payload via `update_qdrant_payload(...)`:
     `download_url = generate_download_url(new_path, ...)`,
     `filename`, `original_filename` set on every point for that hash.
5. `db.mark_organized(hash)` sets `organized_at` + cleans up
   `processing/{hash}/`.

### Download URL helper (`lib/utils.py:generate_download_url`)
- If the path is already `http://` or `https://` (transcripts), return
  it unchanged.
- Otherwise strip `library_root` prefix and prepend
  `book_server.base_url` (→ `https://files.echo6.co/<rel>`).

---

## 9. StatusDB (`lib/status.py`)

SQLite (`data/recon.db`) in WAL mode with thread-local connections
(`_get_conn()` uses `threading.local`).

### Tables
| Table | Purpose |
|---|---|
| `catalogue` | Canonical record keyed by `hash` — title, filename, path, source, category, size |
| `documents` | Pipeline state machine — status, path, text_dir, page_count, retries, organized_at, timestamps |
| `intel` | ARGUS intel feed entries (separate pipeline) |
| `metrics_snapshots` | Time-series rollups for the dashboard |
| `file_operations` | Audit log of Phase-5-style file moves and renames |
| `duplicate_review` | Level-4 dedupe quarantine queue |

### Key methods
- `add_to_catalogue(hash, title, url, size, source, category)`
- `queue_document(hash)` — insert into `documents` with status=`queued`
- `update_status(hash, status, **kwargs)` — single point of status truth
- `mark_organized(hash)` — sets `organized_at`, final transition
- `sync_document_path(hash, new_path)` + `update_catalogue_path(...)` —
  used by filing worker and Phase 5a un-file
- `get_path_updates` / `clear_path_update` — small change queue for
  backfills

### Connection safety
All writers take a short-lived connection via `_get_conn()`. WAL mode
allows concurrent readers; writes are serialized at the SQLite level.
No explicit `BEGIN` — rely on autocommit semantics with occasional
`conn.commit()` after grouped updates.

---

## 10. Configuration (`config.yaml`)

Lives at `/opt/recon/config.yaml`. Secrets (`GEMINI_KEYS`,
`PEERTUBE_TOKEN`, etc.) live in `/opt/recon/.env` — never in
`config.yaml`, never in git.

### Top-level keys
| Key | Meaning |
|---|---|
| `library_root` | `/mnt/library` — LXC bind-mount root (data host `/mnt/data/library`, local SSD) |
| `processing` | Worker counts, window sizes, timeouts, retry policy |
| `embedding` | TEI host/port, model (`bge-m3`), 1024-d dense |
| `sparse_embedding` | Separate service on cortex:8091 |
| `vector_db` | Qdrant host, port, collection name |
| `gemini` | Model (`gemini-2.0-flash`), JSON response mode |
| `web` | Dashboard bind host + port (8420) |
| `paths` | `base`, `data`, `text`, `concepts`, `intel`, `logs`, `db` |
| `book_server` | `base_url`, `strip_prefix` for download URL generation |
| `upload_paths` | Category → filesystem path for upload routing |
| `service` | `scan_interval`, `stage_poll_interval`, `progress_interval` |
| `peertube` | `api_base`, `public_url`, `rate_limit_delay`, `poll_interval` |
| `pipeline` | `acquired_root`, `processing_root`, `dispatch` table, `mtime_stability_seconds` |
| `crawler` / `web_scraper` | Currently disabled (`sites: []`) pending re-architecture |
| `new_pipeline` | Stream-B (old) pipeline, `enabled: false` |

---

## 11. Service & Threads (`recon.py cmd_service`)

`systemctl start recon` → `python3 recon.py service`. The service runs
seven daemon threads plus a metrics collector:

| Thread | Function | Interval |
|---|---|---|
| `dispatcher` | `dispatcher.dispatch_loop` | 30s |
| `enrich` | `stage_loop('enrich', ...)` | 30s idle |
| `embed` | `stage_loop('embed', ...)` | 30s idle |
| `filing` | `filing.filing_worker_loop` | 30s |
| `peertube-acq` | `acquisition.peertube.acquisition_loop` | 1800s |
| `progress` | Log status rollup line | 60s |
| `dashboard` | `api.run_server` (Flask) | bound |

Plus `peertube_collector.start_collector` for metrics scrape.

All threads receive a shared `stop_event` (`threading.Event`) and exit
cleanly on SIGTERM via `signal.signal(SIGTERM, lambda *_: stop_event.set())`.

### CLI commands (`recon.py` top-level)
`scan`, `queue`, `extract`, `enrich`, `embed`, `run`, `status`,
`catalogue`, `failures`, `search`, `upload`, `ingest-url`, `ingest`,
`ingest-peertube`, `validate`, `rebuild`, `serve`, `service`,
`organize`, `pipeline`.

Most commands are thin wrappers around library functions — useful for
one-off maintenance from the CT 130 shell.

---

## 12. Dashboard & API (`lib/api.py`)

Flask app bound to `0.0.0.0:8420`. Pages are server-rendered Jinja
templates; data is pulled via AJAX from `/api/*` endpoints.

### Page routes
`/`, `/search`, `/catalogue`, `/upload`, `/web-ingest`, `/failures`,
`/peertube`, `/peertube/channels`, `/settings/{keys,cookies,vpn,health}`.

### API surface (grouped)
| Group | Endpoints |
|---|---|
| Upload | `POST /api/upload`, `GET /api/upload/<hash>/status`, `GET /api/upload/categories` |
| Ingest | `POST /api/ingest-url`, `/api/ingest-urls`, `/api/ingest`, `/api/ingest-peertube`, `/api/crawl`, `GET /api/crawl/<id>/status`, `GET /api/ingest-peertube/<job>/status` |
| Search | `POST /api/search` |
| Status | `GET /api/status`, `/api/quick-stats`, `/api/knowledge-stats`, `/api/health` |
| Retry | `POST /api/retry/<hash>`, `/api/retry-all` |
| Service | `POST /api/service/restart` |
| Keys | Full CRUD on `/api/keys`, `/api/keys/validate`, `/api/keys/reload` |
| Cookies | `GET /api/cookies/status`, `POST /api/cookies/upload` |
| VPN | `GET /api/vpn/status`, `POST /api/vpn/{connect,disconnect,rotate,login}` |
| PeerTube | `/api/peertube/{dashboard,channels,channels/stats,channels/add,channels/<actor>}`, `/api/peertube/stats` |
| Metrics | `GET /api/metrics/history` |

### Qdrant scroll
`_qdrant_scroll(host, port, collection, req)` is the shared paged-read
helper for rebuilding the knowledge-stats panel.

### Cache warmer
`start_cache_warmer(stop_event)` pre-computes the expensive quick-stats
and knowledge-stats panels so the dashboard loads instantly.

---

## 13. Filesystem Layout

```
/opt/recon/
├── recon.py                    # CLI + service entry point
├── config.yaml
├── .env                        # secrets (GEMINI_KEYS etc.)
├── PROJECT-BIBLE.md            # this file (copy on CT 130)
├── backups/                    # local DB backups
├── data/
│   ├── acquired/               # hopper — {hash}.ext + {hash}.meta.json
│   │   ├── pdf/
│   │   ├── stream/             # PeerTube transcripts
│   │   ├── html/               # (future)
│   │   └── text/
│   ├── processing/{hash}/      # in-flight scratch
│   │   ├── page_NNNN.txt
│   │   ├── meta.json
│   │   └── (original file or transcript.txt)
│   ├── concepts/{hash}/
│   │   └── window_N.json       # Gemini enrichment output
│   ├── intel/                  # ARGUS intel feeds
│   ├── _duplicates/            # level-4 name-match quarantine
│   ├── _rejected/              # oversize / unreadable PDFs
│   └── recon.db                # SQLite WAL mode
├── lib/
│   ├── acquisition/peertube.py
│   ├── processors/{pdf,transcript,text}_processor.py
│   ├── dispatcher.py
│   ├── filing.py
│   ├── enricher.py
│   ├── embedder.py
│   ├── status.py               # StatusDB class
│   ├── api.py                  # Flask dashboard + API
│   ├── new_pipeline.py         # update_qdrant_payload helper lives here
│   ├── utils.py                # content_hash, generate_download_url, get_config, setup_logging
│   ├── peertube_scraper.py     # PeerTube API client
│   └── organizer.py            # determine_dominant_domain, level 1-4 naming
└── logs/

/mnt/library/                   # LXC bind-mount from data host /mnt/data/library (local SSD), read-write
├── <Domain>/<Subdomain>/<canonical_name>.<ext>
└── _acquired/ _review/ _staging/ signal-archive/  # not touched by pipeline
```

---

## 14. Refactor History (2026-04)

The refactor is tracked as dated phases under `phases/`. Status
implementations are in the RECON repo; design lives here.

| Phase | Focus |
|---|---|
| 0 | Baseline capture — DB dumps, directory listings, config pin |
| 1 | Scaffolding — create `acquired/`, `processing/`, config keys |
| 2 | Shared filing function — extract organizer logic into `filing.py` |
| 3 | Transcript processor — first end-to-end test of the new pattern |
| 4 | PDF processor — layered A/B/C metadata vote, level-4 dedupe |
| 5a | Transcript resweep — 16,596 transcripts moved from `/mnt/library/_sources/streamecho6/` into `/mnt/library/<Domain>/<Subdomain>/` via concept-driven domain classification; 2,259 skipped as unclassified (these became the 5b drain cohort) |
| 5b | Transcript unprocess — 2,259 skip_unclassified transcripts staged into `data/acquired/stream/` as `.txt`+`.meta.json` pairs; DB rows deleted, Qdrant vectors removed, source dirs cleaned |
| 5c-1 | Service loop rewire — retire old scan_library thread, wire dispatcher in |
| 5c-2 | Service start & transcript drain — clear the hopper backlog |
| 6a | Transcript organized-in-place — set `organized_at` during pre_flight so filing worker ignores transcripts |
| 6b | Dashboard "Untitled / WEB" bug fix — recently-completed table query |
| 6c | Code cleanup — dead-code audit |
| 6d | PeerTube acquisition module — replace ad-hoc ingester with `acquisition/peertube.py` |
| 6e | ShadowLib skill + dashboard PeerTube endpoint cleanup (partial — 6e-2 reverted) |
| 6f | Text processor — new `lib/processors/text_processor.py` handles `.txt` files with two-source metadata voting (filename + Gemini); new `data/acquired/text/` hopper subfolder; files to library like PDFs |
| 6f-2 | Format normalizer in dispatcher — converts `.epub`/`.mobi` to PDF via Calibre's `ebook-convert`, `.doc`/`.docx` via `libreoffice --headless`, called per-subfolder before `_find_pairs()` |
| 6g | Gemini "null" string bug fix — both `pdf_processor` and `text_processor` now filter the literal string `"null"` out of Gemini's JSON responses before metadata voting |
| 6h | STATE 2 transcript cleanup — deleted 283 zero-vector transcripts (DB rows, concepts, local text, Qdrant entries) and 1,198 orphan dirs in `data/text/`; triggered PeerTube transcription for 332 videos without captions via `POST /api/v1/videos/{uuid}/captions/generate` |
| 6i | Dashboard upload migration — `POST /api/upload` now routes by extension to the appropriate hopper (pdf/text) with `.meta.json` sidecar, supports PDF/TXT/EPUB/DOC/DOCX/MOBI, removed direct library copy and `add_to_catalogue`/`queue_document` calls, added status endpoint fallback that checks `acquired/` and `processing/` dirs for the upload/dispatch gap |
| 6j | Library cleanup — ~51G freed; 398 duplicate PDFs deleted (Army_Pubs, Acquired, Scenario-Playbooks dupes); 2,274 non-PDF SCL files deleted (user confirmed backups); 57 files in 3 ghost domain folders (Community-Coordination, Leadership, Scenario-Playbooks) refiled through new pipeline; 201 unclassified SCL PDFs refiled; 1,240 `_unclassified/` PDFs refiled; `_ingest/_duplicates/` cleared; 5 loose root PDFs staged |
| 6k | Phase 5a un-file — 16,340 of the 16,596 Phase 5a-filed transcripts had their `catalogue.path` restored from library filesystem path back to PeerTube watch URL via title-matching against PeerTube's video list (98.6% match rate); physical `.txt` files deleted from library; Qdrant `download_url` payload updated; 4,955 empty dirs cleaned up; 223 edge cases (82 MULTI_MATCH + 141 UNMATCHED) documented for later review |

### Baseline pre-refactor (per `current-state.md`)
- 18,855 transcripts in `/mnt/library/_sources/streamecho6/`.
- Old stream-B `new_pipeline` ran off `/mnt/library/_acquired/`.
- `scan_library()` polled the NFS mount for new PDFs — now deprecated.
- *Storage migration note:* `/mnt/library` was historically an NFS
  mount from `pi-nas:/export/library`, which is what `current-state.md`
  and `scan_library()` were written against. The library has since
  been migrated to local SSD on the data Proxmox host
  (`/mnt/data/library`) and surfaced into CT 130 via an LXC
  bind-mount. The pi-nas copy was wiped on 2026-04-15. Path strings
  inside the codebase didn't change; only the underlying storage did.

---

## 15. Operational Runbook

### Service control (on CT 130 as zvx)
```bash
sudo systemctl {status,start,stop,restart} recon
journalctl -u recon -f
tail -f /opt/recon/logs/recon.log
```

### Backups
```bash
# Local DB backup before risky operations
cp /opt/recon/data/recon.db /tmp/recon.db.bak.$(date +%s)
# Offsite backup: planned, not yet configured (TBD — likely rsync to
# pi-nas:/export/recon-backup once a backup target is provisioned).
```

### Inspect pipeline state at a glance
```bash
ls /opt/recon/data/acquired/*/    # hopper contents
ls /opt/recon/data/processing/ | wc -l   # in-flight count
sqlite3 /opt/recon/data/recon.db \
  "SELECT status, COUNT(*) FROM documents GROUP BY status;"
```

### Re-queue a failed document
```bash
sqlite3 /opt/recon/data/recon.db \
  "UPDATE documents SET status='extracted', retries=0 WHERE hash='<hash>';"
# or via API:
curl -X POST https://recon.echo6.co/api/retry/<hash>
```

### Manual ingest
```bash
# Drop a PDF into the hopper (dispatcher will pick it up on next cycle)
sha=$(sha256sum foo.pdf | cut -d' ' -f1)
cp foo.pdf /opt/recon/data/acquired/pdf/${sha}.pdf
```

### Qdrant health
```bash
curl -s http://100.64.0.14:6333/collections/recon_knowledge_hybrid \
  | jq '.result | {status, points_count, optimizer_status}'
# status "grey" with optimizer_status.ok=true is healthy (background indexing).
```

---

## 16. Known Gotchas

- **Logger setup.** RECON modules must use `setup_logging('recon.<name>')`
  from `lib.utils`, never raw `logging.getLogger()`. The root logger
  has no handlers; calls to a raw logger silently disappear.
- **Qdrant status "grey" is healthy** if `optimizer_status.ok == true`.
  Only treat red + not-ok as a real failure.
- **Catalogue row count can grow during long-running jobs** because
  parallel ingestion may add rows. Only a *decrease* is a real
  integrity failure.
- **Dispatcher `.tmp` safety.** `CONTENT_EXTENSIONS` does not include
  `.tmp`, so active acquisition writes are invisible to the dispatcher
  until the atomic rename lands.
- **Transcripts are filed in-place.** Their `documents.path` is a URL
  and filing worker's `path LIKE '/opt/recon/data/processing/%'`
  filter excludes them.
- **PeerTube 429.** Respect `peertube.rate_limit_delay` between caption
  API calls or you'll get throttled.
- **Library is an LXC bind-mount, not NFS.** `/mnt/library` on CT 130 is
  bound from the data Proxmox host's `/mnt/data/library` (local ext4 on
  /dev/sda1). File ownership/UID-GID is shared with the host — writes
  from inside the container appear with the container UID on the host.
  No NFS, no `root_squash`, no network in the path.
- **SSH heredocs with Python code break.** When editing remote files,
  write to a temp file via `scp` or `cat > file` rather than bash
  heredocs with parens/quotes.
- **The crawler is off.** `crawler.sites: []`. Re-enabling requires a
  re-architecture for the new pipeline.

---

## 17. Credentials & Hosts

| Host | Role | Access |
|---|---|---|
| CT 130 (192.168.1.130 / 100.64.0.24) | RECON service | `ssh zvx@192.168.1.130` (key auth) |
| data host (192.168.1.240) | Proxmox node hosting CT 130; `/mnt/data/library` source for the CT 130 bind-mount | `ssh root@192.168.1.240` |
| cortex VM (192.168.1.150) | Qdrant, TEI, sparse svc, Ollama | `ssh zvx@cortex` |
| CT 110 (192.168.1.170) | PeerTube `stream.echo6.co` | `ssh zvx@192.168.1.170` |
| pi-nas (192.168.1.245) | Backup target (planned; not yet configured). ~22T pool with ~300G free after library wipe. | `ssh zvx@pi-nas` |
| CT 101 (192.168.1.101) | Caddy reverse proxy (home) | `ssh root@192.168.1.241 'pct exec 101'` |

Secrets: `/home/zvx/projects/.ref/credentials` on TOC (this machine).
RECON Gemini/PeerTube keys: `/opt/recon/.env` on CT 130.

---

## 18. Open Follow-ups

- **82 MULTI_MATCH + 141 UNMATCHED** transcript rows still carry
  library paths post Phase 5a/6k (audit trail at
  `/tmp/phase5a_remaining.txt` on CT 130 — file still present). Either
  hand-resolve or tombstone.
- **HTML processor** (`lib/processors/html_processor.py`) is scaffolded
  in config but not implemented. Next-up for Kiwix / web ingest.
- **Crawler re-architecture.** The tier-1 sites list in `config.yaml`
  is a valuable target list but the old crawler is off pending a new
  acquisition-module-shaped implementation.
- **ARGUS intel pipeline** shares the DB but its lifecycle is
  documented separately — not covered here.
- **Phase 6e-2** (PeerTube channel sync endpoint) was reverted and
  needs a redesign before reinstating.
- **Level-4 dedupe review queue** (`duplicate_review` table) has no UI
  yet; items pile up silently.
- **9,478 legacy dirs in `/opt/recon/data/text/`** — historical
  extraction output from the pre-refactor pipeline, for documents
  still in catalogue. Not touched by current pipeline. Can be cleaned
  up once confirmed none are the sole text copy for any document.
- **`lib/new_pipeline.py` is misleadingly named** — it's actually a
  library management CLI tool, not the refactor's new pipeline.
  Contains `update_qdrant_payload` helper that filing worker depends
  on. Should be renamed (e.g., `library_ops.py`) when there's time.
- **SSH key for CT 130 forge access** — currently uses HTTPS with
  embedded token in remote URL. Move to SSH key auth.
- **Backup policy for derived data** — `/opt/recon/data/concepts/` and
  Qdrant snapshots are not in any backup rotation. If CT 130 or cortex
  lose their disks, these are the hardest to regenerate (Gemini calls
  + embedding compute).
- **Backup architecture** — no offsite backup is currently configured.
  Section 15 references a planned rsync-to-pi-nas job, but neither the
  script nor the systemd timer (`recon-backup.timer`) exist. Decide
  what gets backed up (`recon.db`, `concepts/`, `text/`, Qdrant
  snapshots, `/mnt/library`?), where, and on what cadence; pi-nas has
  ~300G free in `/export/` after the 2026-04-15 library wipe and could
  be the target for a first pass.
- **`signal-archive/` in `/mnt/library/`** — 44 Signal/Matrix chat log
  files, not library content. Matt intends these to "eventually
  contribute" to the knowledge base but no ingestion path exists yet.

---

*Last updated: 2026-04-15 — Refactor feature-complete. Phases 0 through 6k landed. Service operational with 7 daemon threads. Outstanding: 223 edge-case transcripts (see Section 18), HTML processor (scaffolded, not implemented), crawler re-architecture (deferred). Living document; edit in place as the system evolves.*