Initial design docs for RECON pipeline refactor

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Matt 2026-04-14 06:08:06 +00:00
commit aa195825e3
7 changed files with 814 additions and 0 deletions

146
current-state.md Normal file
View file

@ -0,0 +1,146 @@
# Current state
Where RECON is today, as of the start of the refactor. Honest assessment of what works, what's broken, what's technical debt, and what's load-bearing.
This document is a snapshot. It will go stale as the refactor proceeds. The migration plan references this state as its starting point.
## What works well
The core content-processing logic is solid and trusted.
- **PDF text extraction** (`lib/extractor.py`) works reliably across the library's many PDF formats.
- **Gemini enrichment** (`lib/enricher.py`) produces high-quality concept JSONs and is the most expensive part of the pipeline by dollar value.
- **TEI embedding** (`lib/embedder.py`) pushes 2.3M+ vectors into Qdrant without issue.
- **Domain classification** (`lib/organizer.py` `determine_dominant_domain()`) correctly classifies documents from their concept JSONs with well-tuned ambiguity handling.
- **PDF filename sanitization** (`lib/utils.py` `sanitize_filename()`) handles the messy real-world filenames from Anna's Archive, PDFDrive, z-lib, and military publications. Six-phase pipeline, well-tested against the library.
- **PeerTube scraping** (`lib/peertube_scraper.py`) reliably fetches captions from the local stream.echo6.co instance.
- **Dashboard and API** (`lib/api.py`) is functional and used for day-to-day monitoring.
These are assets. The refactor preserves all of them as library code called by processors and shared stages. None of this logic gets rewritten.
## What does not work well
### Two coexisting pipeline models
The codebase has two overlapping ingestion paths that do not cleanly separate. This is evidence of an earlier, incomplete refactor.
**Path A — library-root scanning (legacy):**
- `scan_library()` walks `/mnt/library/` for PDFs
- `queue_all()` moves catalogued items into the documents table
- Stage loops (`extract`, `enrich`, `embed`) pick up by status
- `organize_document()` moves finished items from library root to `Domain/Subdomain/`
**Path B — acquired-staging watchdog (newer):**
- `_acquired/` staging directory
- `ingest_scan()``ingest_acquire()` watchdog loop
- `_ingest/` intermediate staging
- `ingest_place()` for final filing
Both paths converge at the `documents` table with `status='queued'` and from that point forward go through the same stage loops. But the front-half is duplicated and the code has accumulated conditionals and helpers that exist because of this split.
The refactor replaces both paths with a single, uniform flow.
### Extraction scratch is separate from the library
The current pipeline writes extracted text and metadata to `/opt/recon/data/text/{hash}/` while the source file stays in `/mnt/library/`. The scratch is keyed by hash and lives on a different filesystem from the source. This decoupling has caused problems in the past — the Phase B transcript migration had to explicitly correlate scratch locations to source locations for 18,855 transcripts.
Under the refactor, `/opt/recon/data/processing/{hash}/` holds the source file AND its extracted text AND its metadata AND its concepts, all in one place, for the duration of processing.
### Scanner race condition
`scan_library()`'s `add_to_catalogue()` uses `ON CONFLICT DO UPDATE` to overwrite the `path` column for any hash it re-encounters. During the earlier library sweep, this caused 1,015 hashes' paths to be silently overwritten because the scanner ran during the sweep. The issue was reconciled from the `file_operations` audit trail, but the underlying bug is still present: any future bulk reorganization will hit the same race if the scanner is running.
Under the refactor, the scanner no longer walks the library. Files enter the pipeline only via `_acquired/`, so there is no library-walking scanner for the race to occur in.
### Organizer error-spam on service restart
The organizer loop queries `get_unorganized()` which returns any document with `status='complete' AND organized_at IS NULL`. All 19,133 existing PeerTube transcripts match this query because transcripts have never been run through the organizer — their "path" is a watch URL, and `os.path.exists(watch_url)` returns False, so `organize_document()` returns an error without setting `organized_at`. On the next poll the same transcripts are returned again. Forever.
The moment `recon.service` restarts under current code, the organizer will start error-spamming on all 19,133 transcripts every 30 seconds.
**This bug does not exist under the refactor** because transcripts enter the same filing path as PDFs and get `organized_at` set naturally.
### Crawler will re-fire on service restart
`config.yaml` has 42 crawl targets configured across 4 tiers. The `crawler_scheduler_loop` waits 60 seconds after startup, then begins crawling. Web content that was ingested previously was purged in the cleanup earlier today (8,464 rows deleted from the DB, 47,262 Qdrant points deleted, 8,454 scratch directories removed). If the crawler fires again on the next restart, it will begin re-ingesting the same content.
Under the refactor, web ingestion is not part of the core pipeline. The crawler stays in the codebase as dead code or is deleted outright, and `config.yaml` has no `crawler.sites` to iterate.
### Deferred transcripts still at the old location
278 PeerTube transcripts were classified as STATE 2 in the Phase B migration. These have DB entries but no Qdrant vectors, meaning they never completed embedding. They currently live at `/opt/recon/data/text/{hash}/` and were explicitly left in place to be reprocessed after the refactor lands.
Under the refactor, these need to be re-queued through the new pipeline. Either manually by copying them into `_acquired/stream/` or programmatically via a migration script.
### Transcripts filed by source, not by domain
The 18,855 successfully-migrated transcripts are currently at `library/_sources/streamecho6/{channel}/{sanitized_title}__{hash8}/`. Under the target architecture, all content files by domain (`library/Domain/Subdomain/`). This means a post-refactor resweep to move every transcript to its domain-appropriate location based on existing concept classifications.
The resweep is expected to be fast — the PDF sweep scaffolding is proven and reusable — and there is no enrichment cost because the concepts are already computed.
### Stale documentation
`PROJECT-BIBLE.md` in the RECON repo is the canonical architecture doc. It is out of date in several places:
- References `recon_knowledge` as the Qdrant collection name; the actual collection is `recon_knowledge_hybrid`
- References ~13,000 PDFs in the library; actual count is 10,679
- Backup sizes for `data/text/` and `data/concepts/` are off by an order of magnitude
- Does not mention Stream B or `_acquired/` staging
- Describes the pipeline in its legacy form
The refactor will produce a new project bible as part of the final cleanup phase.
## What's in flight but paused
- **PeerTube acquisition:** paused at the source. No new transcripts are being fetched. The existing 19,133 are either in the library (18,855) or deferred (278).
- **`recon.service`:** stopped. Has been stopped since the cleanup operations earlier today. Will not restart until the refactor is complete.
- **Acquisition of new PDFs:** not currently in flight. No active scraping or downloading.
The system is quiescent. This is the ideal time to refactor — no moving parts, no new content arriving, no risk of data loss from concurrent modifications.
## Data that must survive the refactor
Nothing in this list can be lost or corrupted during the refactor. All of it represents either irreplaceable source content or expensive API call results that would cost real money to regenerate.
- **Qdrant collection `recon_knowledge_hybrid`** — approximately 2.32M vector points across ~29,812 distinct documents. Represents the output of Gemini enrichment and TEI embedding. Regeneration cost: high (Gemini API calls) plus time.
- **`recon.db`** — SQLite database with catalogue, documents, file_operations, and other tables. Source of truth for pipeline state and document metadata.
- **`/mnt/library/`** — source files. 10,679 PDFs filed by domain, 18,855 transcripts at `_sources/streamecho6/`. Total approximately 700+ GB.
- **`/opt/recon/data/concepts/`** — concept JSONs from Gemini enrichment. Not strictly irreplaceable (Qdrant has the embedded form) but still represents API cost if lost.
- **`/opt/recon/data/text/{hash}/` for the 278 STATE 2 transcripts** — extracted text waiting for reprocessing. Regeneration would require re-fetching captions from PeerTube, which is possible but wasteful.
Backups exist for recon.db on both CT 130 and cortex at `/tmp/recon.db.prepeertubemigration.20260414_033857.bak` and several earlier pre-operation snapshots. A fresh backup should be taken immediately before the refactor begins.
## Code that the refactor preserves as library functions
These modules stay in place and are called by the new processors and shared stages:
- `lib/extractor.py` — PDF text extraction. Becomes a library function called by the PDF processor.
- `lib/enricher.py` — Gemini enrichment. Stays as a stage-loop worker, unchanged except for path resolution.
- `lib/embedder.py` — TEI embedding. Stays as a stage-loop worker, unchanged except for path resolution.
- `lib/organizer.py` `determine_dominant_domain()` — domain classification. Stays as a library function called by the shared filing stage.
- `lib/utils.py` `sanitize_filename()` and helpers — PDF filename sanitization. Called by the PDF processor.
- `lib/peertube_scraper.py` API and caption fetching logic — becomes the transcript acquisition module. The file-writing parts get rewritten; the API client parts stay.
- `lib/api.py` — dashboard and API routes. Unchanged.
- `lib/status.py` — database access. Gets new methods for processor-friendly queries but existing methods stay.
## Code that the refactor removes or replaces
- `scan_library()` and `queue_all()` — no longer needed; files enter the pipeline via `_acquired/`, not via library walks.
- `lib/new_pipeline.py` `ingest_scan()`, `ingest_acquire()`, `ingest_place()` — replaced by the dispatcher and processor architecture. The logic for PDF acquisition, collision resolution, and filing is reused but rewired.
- `lib/organizer.py` `organize_document()` as it exists today — replaced by a shared filing function that handles any content type. The domain classification piece inside it is preserved.
- `lib/web_scraper.py` and `lib/crawler.py` — either deleted or left as dead code with `crawler.sites` empty. Not part of the target architecture.
- `peertube_scanner_loop` in `recon.py` — replaced by a transcript acquisition module that drops into `_acquired/stream/`.
## Metrics to track through the refactor
Things whose values should stay the same (or change predictably) as the refactor proceeds. If any of these drift unexpectedly, something is wrong.
- Qdrant point count: baseline 2,320,695 (post-cleanup)
- Catalogue row count: baseline 29,812
- Documents row count: baseline 29,812
- PDF file count in library: baseline 10,679
- Transcript file count in library: baseline 18,855 (will migrate to domain tree during the refactor)
- Disk usage of `/opt/recon/data/text/`: baseline around 261 MB (only the 278 STATE 2 transcripts remain)
- Disk usage of `/opt/recon/data/concepts/`: baseline around 4.2 GB
These numbers are the ground truth. The refactor must not lose rows, vectors, or files. Any discrepancy is a bug to investigate before proceeding.