refactored-recon/README.md
Matt aa195825e3 Initial design docs for RECON pipeline refactor
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-14 06:08:06 +00:00

38 lines
2.6 KiB
Markdown

# refactored-recon
Design documents for the RECON pipeline refactor. The goal is to restructure RECON's ingestion pipeline into a hopper-based, type-dispatched architecture where new content sources can be added by writing a small acquisition module and a small processor module without touching shared infrastructure.
This repo is design-only. Implementation happens in the RECON repo; this repo tracks the thinking, the decisions, and the phased migration plan with git history so the architecture can evolve visibly over time.
## Status
- Design drafted: 2026-04-14
- Implementation status: not started
- Current system: recon.service stopped pending refactor
## Documents
- [architecture.md](architecture.md) — target architecture. The hopper model, processor pattern, lifecycle, contracts.
- [current-state.md](current-state.md) — where the system is today, what works, what's broken, what's technical debt.
- [migration-plan.md](migration-plan.md) — phased plan to get from current to target without losing data or extended downtime.
- [decisions.md](decisions.md) — architectural decision record. The forks we considered and why we chose what we chose.
- [phases/](phases/) — detailed per-phase execution plans (to be filled in as each phase is scoped).
## Read order
If you're new to this design, read in this order:
1. `current-state.md` — understand what exists
2. `architecture.md` — understand the target
3. `decisions.md` — understand why the target looks the way it does
4. `migration-plan.md` — understand how we get there
## Principles
Three principles shaped every decision in this design. When in doubt on a detail, fall back to these:
**Modularity on the edges, uniformity in the middle.** Each content source (PDFs, transcripts, HTML, future types) is its own acquisition module and its own processor. They share nothing except the enrich/embed infrastructure and the filesystem contract. Adding a new type touches only the two new modules and one line of config.
**State is a directory.** A file's location on disk tells you what stage of the pipeline it's in. Acquired but unprocessed → sitting in `_acquired/`. Being worked on → sitting in `_processing/`. Done → sitting in the library under its final name. No status tracking that isn't reflected in where the file actually lives.
**Small atomic transitions.** Files move between stages as complete units with all their metadata updated together — filesystem, catalogue, documents table, and Qdrant payloads in one transition. Partial state is the enemy. If any part of a transition fails, the file stays where it was.