refactored-recon/architecture.md

# Architecture

The target architecture for RECON. This is what we are building toward. Current state and migration plan are separate documents.

## Overview

RECON is a content ingestion pipeline. Files arrive from various sources (manual PDF uploads, PeerTube transcript scraping, future Kiwix imports, future web rebuilds), get processed through text extraction, enrichment, and embedding, and end up filed in a searchable library organized by domain and subdomain.

The architecture splits this lifecycle into three stages connected by well-defined handoffs, with the type-specific work isolated into small modules and the expensive shared infrastructure (enrichment, embedding) used as a library by whichever module needs it.

## Physical layout

```
/opt/recon/data/
  acquired/
    pdf/              ← PDF acquisition modules drop here
    stream/           ← transcript acquisition modules drop here
    html/             ← kiwix / html acquisition modules drop here
    (anything in the acquired/ root is ignored)
  processing/
    {hash}/           ← scratch for in-flight work, flat and hash-indexed
  concepts/           ← enrichment output (existing)
  recon.db

/mnt/library/
  Domain1/Subdomain/file.pdf
  Domain1/Subdomain/transcript.txt
  Domain2/Subdomain/article.html
  ...
```

Three locations, each with a clear meaning:

- **`_acquired/<type>/`** — waiting room. A file here has been fetched from its source and is waiting to be picked up by the dispatcher.
- **`_processing/{hash}/`** — work zone. A file here is being actively processed. It holds the source file, extracted text pages, `meta.json`, and whatever scratch the processor needs.
- **`library/Domain/Subdomain/`** — permanent home. Finished, filed, renamed, human-browsable.

`_acquired/` and `_processing/` live under RECON (`/opt/recon/data/`), not under `/mnt/library/`. The library is kept clean — only finished content touches it. This gives two clean backup targets (RECON state and library content) and prevents half-processed files from ever being visible to the file server or to anyone browsing the library.

## Lifecycle

Every item follows the same five steps regardless of content type.

**1. Acquisition.** Some module fetches content from a source (PeerTube API, a download script, a manual upload handler) and drops it in the appropriate `_acquired/<type>/` subfolder. The acquisition module knows only two things: how to get the content, and which subfolder to drop it in. It does not care what happens next.

**2. Dispatch.** A dispatcher watches each `_acquired/<type>/` subfolder. When it sees a new item that's been stable on disk for some mtime threshold (i.e., not still being written), it hands the item to the processor registered for that subfolder. The dispatcher is dumb — it does no inspection, no type sniffing, no content analysis. Folder determines type. Type determines processor.

**3. Pre-flight (processor-specific).** The processor takes ownership of the item and runs the pre-flight check before doing any expensive work:

   - Compute the content hash and look up in the catalogue. If found, it is a byte-identical duplicate. Delete the file. Done.
   - Extract cheap metadata from the file (title, edition/volume, author, year) using whatever type-appropriate method the processor chooses.
   - Derive the level-4 canonical name (`Title_Edition_Author_Year`). Look up in the catalogue for any existing entry matching all four fields with a different hash.
   - If a level-4 match is found, move the file to a `_duplicates/` quarantine for human review. Flag it, log it, done.
   - If any of the four fields could not be extracted, the strict match fails — treat as non-duplicate and proceed.
   - Otherwise, move the file from `_acquired/<type>/` to `_processing/{hash}/` and standardize it into `page_NNNN.txt` + `meta.json` form. Update the catalogue and documents table. Set status to `extracted`.

**4. Enrichment and embedding (shared).** The existing enrich and embed stage loops pick up items by status. Enrichment reads text pages, calls Gemini, writes concept JSONs, sets status to `enriched`. Embedding reads concepts, pushes vectors into Qdrant, sets status to `complete`. These stages are source-agnostic — they don't know or care what kind of content produced the text pages. They are shared infrastructure.

**5. Filing (shared).** A single organizer watches for items with `status='complete' AND organized_at IS NULL`. For each:

   - Read dominant domain from the concept JSONs (existing logic).
   - Derive the canonical name starting at level 1 (`Title`) and escalating through levels 2, 3, 4 only if needed to resolve collisions at the target path.
   - Move the source file from `_processing/{hash}/` to `library/Domain/Subdomain/{canonical_name}.{ext}`.
   - Update catalogue, documents, and Qdrant payloads atomically to reflect the final name and path.
   - Clean up the `_processing/{hash}/` scratch directory.
   - Set `organized_at`.

Every content type goes through the same five steps. What varies by type is bounded and well-defined.

## The processor contract

A processor is a module with a small, well-defined interface. It owns pre-flight for its content type. Everything else is shared.

```python
# Minimum interface for a processor module

def pre_flight(item_path: str, db, config) -> dict:
    """Handle the item from _acquired/ to _processing/.

    Returns a dict describing the outcome:
      {'action': 'extracted', 'hash': ...}        — moved to _processing/, ready for enrich
      {'action': 'duplicate_hash', 'hash': ...}   — byte-identical, deleted
      {'action': 'duplicate_name', 'hash': ...}   — quarantined for review
      {'action': 'error', 'error': '...'}         — something went wrong, item stays in _acquired/
    """
```

What the processor is responsible for:

- Knowing where its input is (`_acquired/<type>/` subfolder — configured via dispatch registry)
- Hash-based duplicate detection against the catalogue
- Cheap metadata extraction for the pre-enrichment name-based duplicate check
- Name-based duplicate detection at level 4
- Moving the item from `_acquired/` to `_processing/`
- Converting the source into standardized `page_NNNN.txt` + `meta.json` pages
- Updating catalogue + documents + setting status to `extracted`

What the processor is NOT responsible for:

- Enrichment (shared)
- Embedding (shared)
- Canonical naming derivation at filing time (shared — the organizer handles level 1 → 2 → 3 → 4 escalation)
- Filing to the library (shared — the organizer moves items and updates DB+Qdrant atomically)
- Domain classification (shared — the organizer reads concepts)

This split is deliberate. Pre-flight is type-specific because metadata extraction depends heavily on the source format. Filing is type-agnostic because by that point everything has been reduced to "a source file, its hash, and its concept JSONs" — and that's enough to classify and file.

## The dispatcher

The dispatcher is a small component that watches `_acquired/<type>/` subfolders and hands items to processors. Its config is a flat dict:

```yaml
dispatch:
  pdf: pdf_processor
  stream: transcript_processor
  html: html_processor
```

Key is the subfolder name. Value is the processor module name. Adding a new content type is one line in this config plus a new processor module file.

The dispatcher's logic is:

1. For each configured subfolder, list contents.
2. For each file/directory that has been stable on disk longer than the mtime threshold, import the processor module and call `pre_flight(item_path, db, config)`.
3. Record the outcome. Retry on transient errors (with backoff). Leave the item in place on persistent errors.
4. Sleep. Repeat.

Items in `_acquired/` root (not in a subfolder) are ignored. No error, no movement, no warning. The filesystem itself is the alert — `ls _acquired/` will show them.

## Naming

Each processor derives canonical names using a four-level hierarchy. The levels escalate only when needed to resolve collisions at the target library path.

```
Level 1: Title
Level 2: Title_Author
Level 3: Title_Edition_Author
Level 4: Title_Edition_Author_Year
```

At pre-flight time, the processor derives level 4 and checks the catalogue for existing matches. Strict match: all four fields must be present and equal. Missing fields mean the check cannot run and the file proceeds as non-duplicate.

At filing time, the organizer starts at level 1 and escalates only to resolve physical collisions in the target Domain/Subdomain folder. Most files file at level 1.

Duplicate detection semantics:

- **Byte-identical (same hash):** delete immediately, no review, no cost.
- **Level-4 name match with different hash:** quarantine for human review. Could be a better scan, a re-scan with corrections, an edition the metadata missed. Human decides.
- **Everything else:** proceed through the pipeline normally.

Different editions of the same work are kept as separate documents because edition and year are part of the level-4 key. Concept-level deduplication of near-identical content can happen later as a separate cleanup activity if needed.

## State transitions

The filesystem is the primary state indicator. The database tracks detail but the high-level "where is this item in its lifecycle" is always visible as a directory listing.

| Filesystem location | Meaning | DB status |
|---|---|---|
| `_acquired/<type>/` | Waiting for dispatcher | Not in DB yet |
| `_processing/{hash}/` | In-flight | `queued`, `extracting`, `extracted`, `enriching`, `enriched`, `embedding` |
| `library/Domain/Subdomain/` | Finished | `complete` with `organized_at` set |
| `_duplicates/` | Quarantined | DB entry with duplicate flag |

Crashes and partial failures leave files in `_acquired/` or `_processing/` where they can be inspected. Nothing ever silently disappears.

## Enrichment and embedding (unchanged)

The refactor does not touch how enrichment or embedding work internally. The existing `lib/enricher.py` and `lib/embedder.py` keep doing what they do today. What changes is WHERE they read from: instead of `/opt/recon/data/text/{hash}/`, they read from `/opt/recon/data/processing/{hash}/`. Both are processor-agnostic — they just read `page_NNNN.txt` and `meta.json` from a directory.

The path resolution change is small. Either a helper function (`resolve_text_dir(hash)` that returns `_processing/{hash}/` for in-flight items) or a direct change to the constant. Either way, it is a minimal diff in the shared code.

## Things that do not exist in this architecture

A few things that exist today and do not have a home in the target:

- **Library-root scanning.** The current `scan_library()` walks `/mnt/library/` looking for new PDFs. Under the new architecture, nothing should arrive in the library except via the pipeline (acquired → processing → filed). Manual drops into the library are not a supported input path. If you have files to ingest, they go in `_acquired/pdf/`, not in the library tree.

- **The `catalogue` vs `documents` split.** Both tables survive the refactor but their roles become clearer: `catalogue` is the canonical "what content do we have, keyed by hash" record; `documents` is the pipeline state machine for in-flight items. The refactor does not merge them but clarifies what each is for.

- **The crawler as a background service.** Web scraping via the crawler is not part of the refactor. If web ingestion returns later, it will be as its own acquisition module that drops into `_acquired/html/` (or a new subfolder), same as any other source. The existing crawler code can stay in the codebase as a dead-but-preserved module, or be deleted. That decision is out of scope for the refactor.

- **The `_sources/streamecho6/` layout for transcripts.** The 18,855 transcripts currently filed there are a transitional artifact from a previous session's migration. Under the new architecture, transcripts are filed by domain like everything else. A resweep will move them to the Domain/Subdomain tree during the migration. This is tracked in `migration-plan.md`.

## Things that are deliberately simple

A few places where we chose simplicity over flexibility:

- **Subfolder dispatch, not pattern matching.** Filename conventions inside `_acquired/` are not enforced. Type is determined by which subfolder the file is in. This means a processor could receive any filename within its subfolder and has to handle it. The alternative — pattern matching on filename — was considered and rejected as too clever.

- **Unknown-type files are ignored.** A file dropped at the root of `_acquired/` (not in any subfolder) sits there forever until a human moves it. There is no error handling, no catch-all bucket, no warning log. This is deliberate: any automated handling of unknown files creates a second inbox that rots silently. Human attention is the recovery mechanism.

- **No processor plugin system.** Processors are registered in a static config file. No dynamic discovery, no drop-in plugins, no hot reload. The number of processors is small and known, and adding one is a three-line change: new module file, one config line, one restart.

## What this architecture optimizes for

**Expandability.** Adding a new content source should be obvious and local. Write an acquisition module, write a processor, add one config line. Done. No changes to shared infrastructure, no changes to any other processor, no changes to the enricher or embedder or organizer.

**Clarity of state.** At any moment, you should be able to see what the system is doing by looking at the filesystem. No hidden state in background threads, no need to query the database to know what's in flight. `ls /opt/recon/data/acquired/` and `ls /opt/recon/data/processing/` tell you everything.

**Recoverability from failure.** Every stage transition is atomic. Every item's state is visible. Crashes leave diagnosable residue rather than silent data loss. Duplicates are handled with explicit policy (delete hash, quarantine name) rather than ad hoc.

**Minimal surface for bugs.** The shared infrastructure (enrich, embed, organize) is written once, tested once, and reused by every processor. The type-specific code (pre-flight) is small and self-contained per processor. A bug in the PDF processor cannot break the transcript processor.