refactored-recon/README.md

# refactored-recon

Design documents for the RECON pipeline refactor. The goal is to restructure RECON's ingestion pipeline into a hopper-based, type-dispatched architecture where new content sources can be added by writing a small acquisition module and a small processor module without touching shared infrastructure.

This repo is design-only. Implementation happens in the RECON repo; this repo tracks the thinking, the decisions, and the phased migration plan with git history so the architecture can evolve visibly over time.

## Status

- Design drafted: 2026-04-14
- Implementation status: not started
- Current system: recon.service stopped pending refactor

## Documents

- [architecture.md](architecture.md) — target architecture. The hopper model, processor pattern, lifecycle, contracts.
- [current-state.md](current-state.md) — where the system is today, what works, what's broken, what's technical debt.
- [migration-plan.md](migration-plan.md) — phased plan to get from current to target without losing data or extended downtime.
- [decisions.md](decisions.md) — architectural decision record. The forks we considered and why we chose what we chose.
- [phases/](phases/) — detailed per-phase execution plans (to be filled in as each phase is scoped).

## Read order

If you're new to this design, read in this order:

1. `current-state.md` — understand what exists
2. `architecture.md` — understand the target
3. `decisions.md` — understand why the target looks the way it does
4. `migration-plan.md` — understand how we get there

## Principles

Three principles shaped every decision in this design. When in doubt on a detail, fall back to these:

**Modularity on the edges, uniformity in the middle.** Each content source (PDFs, transcripts, HTML, future types) is its own acquisition module and its own processor. They share nothing except the enrich/embed infrastructure and the filesystem contract. Adding a new type touches only the two new modules and one line of config.

**State is a directory.** A file's location on disk tells you what stage of the pipeline it's in. Acquired but unprocessed → sitting in `_acquired/`. Being worked on → sitting in `_processing/`. Done → sitting in the library under its final name. No status tracking that isn't reflected in where the file actually lives.

**Small atomic transitions.** Files move between stages as complete units with all their metadata updated together — filesystem, catalogue, documents table, and Qdrant payloads in one transition. Partial state is the enemy. If any part of a transition fails, the file stays where it was.