mirror of
https://github.com/zvx-echo6/refactored-recon.git
synced 2026-05-20 06:34:34 +02:00
Initial design docs for RECON pipeline refactor
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
commit
aa195825e3
7 changed files with 814 additions and 0 deletions
5
.gitignore
vendored
Normal file
5
.gitignore
vendored
Normal file
|
|
@ -0,0 +1,5 @@
|
|||
.DS_Store
|
||||
*.swp
|
||||
*~
|
||||
.vscode/
|
||||
.idea/
|
||||
38
README.md
Normal file
38
README.md
Normal file
|
|
@ -0,0 +1,38 @@
|
|||
# refactored-recon
|
||||
|
||||
Design documents for the RECON pipeline refactor. The goal is to restructure RECON's ingestion pipeline into a hopper-based, type-dispatched architecture where new content sources can be added by writing a small acquisition module and a small processor module without touching shared infrastructure.
|
||||
|
||||
This repo is design-only. Implementation happens in the RECON repo; this repo tracks the thinking, the decisions, and the phased migration plan with git history so the architecture can evolve visibly over time.
|
||||
|
||||
## Status
|
||||
|
||||
- Design drafted: 2026-04-14
|
||||
- Implementation status: not started
|
||||
- Current system: recon.service stopped pending refactor
|
||||
|
||||
## Documents
|
||||
|
||||
- [architecture.md](architecture.md) — target architecture. The hopper model, processor pattern, lifecycle, contracts.
|
||||
- [current-state.md](current-state.md) — where the system is today, what works, what's broken, what's technical debt.
|
||||
- [migration-plan.md](migration-plan.md) — phased plan to get from current to target without losing data or extended downtime.
|
||||
- [decisions.md](decisions.md) — architectural decision record. The forks we considered and why we chose what we chose.
|
||||
- [phases/](phases/) — detailed per-phase execution plans (to be filled in as each phase is scoped).
|
||||
|
||||
## Read order
|
||||
|
||||
If you're new to this design, read in this order:
|
||||
|
||||
1. `current-state.md` — understand what exists
|
||||
2. `architecture.md` — understand the target
|
||||
3. `decisions.md` — understand why the target looks the way it does
|
||||
4. `migration-plan.md` — understand how we get there
|
||||
|
||||
## Principles
|
||||
|
||||
Three principles shaped every decision in this design. When in doubt on a detail, fall back to these:
|
||||
|
||||
**Modularity on the edges, uniformity in the middle.** Each content source (PDFs, transcripts, HTML, future types) is its own acquisition module and its own processor. They share nothing except the enrich/embed infrastructure and the filesystem contract. Adding a new type touches only the two new modules and one line of config.
|
||||
|
||||
**State is a directory.** A file's location on disk tells you what stage of the pipeline it's in. Acquired but unprocessed → sitting in `_acquired/`. Being worked on → sitting in `_processing/`. Done → sitting in the library under its final name. No status tracking that isn't reflected in where the file actually lives.
|
||||
|
||||
**Small atomic transitions.** Files move between stages as complete units with all their metadata updated together — filesystem, catalogue, documents table, and Qdrant payloads in one transition. Partial state is the enemy. If any part of a transition fails, the file stays where it was.
|
||||
202
architecture.md
Normal file
202
architecture.md
Normal file
|
|
@ -0,0 +1,202 @@
|
|||
# Architecture
|
||||
|
||||
The target architecture for RECON. This is what we are building toward. Current state and migration plan are separate documents.
|
||||
|
||||
## Overview
|
||||
|
||||
RECON is a content ingestion pipeline. Files arrive from various sources (manual PDF uploads, PeerTube transcript scraping, future Kiwix imports, future web rebuilds), get processed through text extraction, enrichment, and embedding, and end up filed in a searchable library organized by domain and subdomain.
|
||||
|
||||
The architecture splits this lifecycle into three stages connected by well-defined handoffs, with the type-specific work isolated into small modules and the expensive shared infrastructure (enrichment, embedding) used as a library by whichever module needs it.
|
||||
|
||||
## Physical layout
|
||||
|
||||
```
|
||||
/opt/recon/data/
|
||||
acquired/
|
||||
pdf/ ← PDF acquisition modules drop here
|
||||
stream/ ← transcript acquisition modules drop here
|
||||
html/ ← kiwix / html acquisition modules drop here
|
||||
(anything in the acquired/ root is ignored)
|
||||
processing/
|
||||
{hash}/ ← scratch for in-flight work, flat and hash-indexed
|
||||
concepts/ ← enrichment output (existing)
|
||||
recon.db
|
||||
|
||||
/mnt/library/
|
||||
Domain1/Subdomain/file.pdf
|
||||
Domain1/Subdomain/transcript.txt
|
||||
Domain2/Subdomain/article.html
|
||||
...
|
||||
```
|
||||
|
||||
Three locations, each with a clear meaning:
|
||||
|
||||
- **`_acquired/<type>/`** — waiting room. A file here has been fetched from its source and is waiting to be picked up by the dispatcher.
|
||||
- **`_processing/{hash}/`** — work zone. A file here is being actively processed. It holds the source file, extracted text pages, `meta.json`, and whatever scratch the processor needs.
|
||||
- **`library/Domain/Subdomain/`** — permanent home. Finished, filed, renamed, human-browsable.
|
||||
|
||||
`_acquired/` and `_processing/` live under RECON (`/opt/recon/data/`), not under `/mnt/library/`. The library is kept clean — only finished content touches it. This gives two clean backup targets (RECON state and library content) and prevents half-processed files from ever being visible to the file server or to anyone browsing the library.
|
||||
|
||||
## Lifecycle
|
||||
|
||||
Every item follows the same five steps regardless of content type.
|
||||
|
||||
**1. Acquisition.** Some module fetches content from a source (PeerTube API, a download script, a manual upload handler) and drops it in the appropriate `_acquired/<type>/` subfolder. The acquisition module knows only two things: how to get the content, and which subfolder to drop it in. It does not care what happens next.
|
||||
|
||||
**2. Dispatch.** A dispatcher watches each `_acquired/<type>/` subfolder. When it sees a new item that's been stable on disk for some mtime threshold (i.e., not still being written), it hands the item to the processor registered for that subfolder. The dispatcher is dumb — it does no inspection, no type sniffing, no content analysis. Folder determines type. Type determines processor.
|
||||
|
||||
**3. Pre-flight (processor-specific).** The processor takes ownership of the item and runs the pre-flight check before doing any expensive work:
|
||||
|
||||
- Compute the content hash and look up in the catalogue. If found, it is a byte-identical duplicate. Delete the file. Done.
|
||||
- Extract cheap metadata from the file (title, edition/volume, author, year) using whatever type-appropriate method the processor chooses.
|
||||
- Derive the level-4 canonical name (`Title_Edition_Author_Year`). Look up in the catalogue for any existing entry matching all four fields with a different hash.
|
||||
- If a level-4 match is found, move the file to a `_duplicates/` quarantine for human review. Flag it, log it, done.
|
||||
- If any of the four fields could not be extracted, the strict match fails — treat as non-duplicate and proceed.
|
||||
- Otherwise, move the file from `_acquired/<type>/` to `_processing/{hash}/` and standardize it into `page_NNNN.txt` + `meta.json` form. Update the catalogue and documents table. Set status to `extracted`.
|
||||
|
||||
**4. Enrichment and embedding (shared).** The existing enrich and embed stage loops pick up items by status. Enrichment reads text pages, calls Gemini, writes concept JSONs, sets status to `enriched`. Embedding reads concepts, pushes vectors into Qdrant, sets status to `complete`. These stages are source-agnostic — they don't know or care what kind of content produced the text pages. They are shared infrastructure.
|
||||
|
||||
**5. Filing (shared).** A single organizer watches for items with `status='complete' AND organized_at IS NULL`. For each:
|
||||
|
||||
- Read dominant domain from the concept JSONs (existing logic).
|
||||
- Derive the canonical name starting at level 1 (`Title`) and escalating through levels 2, 3, 4 only if needed to resolve collisions at the target path.
|
||||
- Move the source file from `_processing/{hash}/` to `library/Domain/Subdomain/{canonical_name}.{ext}`.
|
||||
- Update catalogue, documents, and Qdrant payloads atomically to reflect the final name and path.
|
||||
- Clean up the `_processing/{hash}/` scratch directory.
|
||||
- Set `organized_at`.
|
||||
|
||||
Every content type goes through the same five steps. What varies by type is bounded and well-defined.
|
||||
|
||||
## The processor contract
|
||||
|
||||
A processor is a module with a small, well-defined interface. It owns pre-flight for its content type. Everything else is shared.
|
||||
|
||||
```python
|
||||
# Minimum interface for a processor module
|
||||
|
||||
def pre_flight(item_path: str, db, config) -> dict:
|
||||
"""Handle the item from _acquired/ to _processing/.
|
||||
|
||||
Returns a dict describing the outcome:
|
||||
{'action': 'extracted', 'hash': ...} — moved to _processing/, ready for enrich
|
||||
{'action': 'duplicate_hash', 'hash': ...} — byte-identical, deleted
|
||||
{'action': 'duplicate_name', 'hash': ...} — quarantined for review
|
||||
{'action': 'error', 'error': '...'} — something went wrong, item stays in _acquired/
|
||||
"""
|
||||
```
|
||||
|
||||
What the processor is responsible for:
|
||||
|
||||
- Knowing where its input is (`_acquired/<type>/` subfolder — configured via dispatch registry)
|
||||
- Hash-based duplicate detection against the catalogue
|
||||
- Cheap metadata extraction for the pre-enrichment name-based duplicate check
|
||||
- Name-based duplicate detection at level 4
|
||||
- Moving the item from `_acquired/` to `_processing/`
|
||||
- Converting the source into standardized `page_NNNN.txt` + `meta.json` pages
|
||||
- Updating catalogue + documents + setting status to `extracted`
|
||||
|
||||
What the processor is NOT responsible for:
|
||||
|
||||
- Enrichment (shared)
|
||||
- Embedding (shared)
|
||||
- Canonical naming derivation at filing time (shared — the organizer handles level 1 → 2 → 3 → 4 escalation)
|
||||
- Filing to the library (shared — the organizer moves items and updates DB+Qdrant atomically)
|
||||
- Domain classification (shared — the organizer reads concepts)
|
||||
|
||||
This split is deliberate. Pre-flight is type-specific because metadata extraction depends heavily on the source format. Filing is type-agnostic because by that point everything has been reduced to "a source file, its hash, and its concept JSONs" — and that's enough to classify and file.
|
||||
|
||||
## The dispatcher
|
||||
|
||||
The dispatcher is a small component that watches `_acquired/<type>/` subfolders and hands items to processors. Its config is a flat dict:
|
||||
|
||||
```yaml
|
||||
dispatch:
|
||||
pdf: pdf_processor
|
||||
stream: transcript_processor
|
||||
html: html_processor
|
||||
```
|
||||
|
||||
Key is the subfolder name. Value is the processor module name. Adding a new content type is one line in this config plus a new processor module file.
|
||||
|
||||
The dispatcher's logic is:
|
||||
|
||||
1. For each configured subfolder, list contents.
|
||||
2. For each file/directory that has been stable on disk longer than the mtime threshold, import the processor module and call `pre_flight(item_path, db, config)`.
|
||||
3. Record the outcome. Retry on transient errors (with backoff). Leave the item in place on persistent errors.
|
||||
4. Sleep. Repeat.
|
||||
|
||||
Items in `_acquired/` root (not in a subfolder) are ignored. No error, no movement, no warning. The filesystem itself is the alert — `ls _acquired/` will show them.
|
||||
|
||||
## Naming
|
||||
|
||||
Each processor derives canonical names using a four-level hierarchy. The levels escalate only when needed to resolve collisions at the target library path.
|
||||
|
||||
```
|
||||
Level 1: Title
|
||||
Level 2: Title_Author
|
||||
Level 3: Title_Edition_Author
|
||||
Level 4: Title_Edition_Author_Year
|
||||
```
|
||||
|
||||
At pre-flight time, the processor derives level 4 and checks the catalogue for existing matches. Strict match: all four fields must be present and equal. Missing fields mean the check cannot run and the file proceeds as non-duplicate.
|
||||
|
||||
At filing time, the organizer starts at level 1 and escalates only to resolve physical collisions in the target Domain/Subdomain folder. Most files file at level 1.
|
||||
|
||||
Duplicate detection semantics:
|
||||
|
||||
- **Byte-identical (same hash):** delete immediately, no review, no cost.
|
||||
- **Level-4 name match with different hash:** quarantine for human review. Could be a better scan, a re-scan with corrections, an edition the metadata missed. Human decides.
|
||||
- **Everything else:** proceed through the pipeline normally.
|
||||
|
||||
Different editions of the same work are kept as separate documents because edition and year are part of the level-4 key. Concept-level deduplication of near-identical content can happen later as a separate cleanup activity if needed.
|
||||
|
||||
## State transitions
|
||||
|
||||
The filesystem is the primary state indicator. The database tracks detail but the high-level "where is this item in its lifecycle" is always visible as a directory listing.
|
||||
|
||||
| Filesystem location | Meaning | DB status |
|
||||
|---|---|---|
|
||||
| `_acquired/<type>/` | Waiting for dispatcher | Not in DB yet |
|
||||
| `_processing/{hash}/` | In-flight | `queued`, `extracting`, `extracted`, `enriching`, `enriched`, `embedding` |
|
||||
| `library/Domain/Subdomain/` | Finished | `complete` with `organized_at` set |
|
||||
| `_duplicates/` | Quarantined | DB entry with duplicate flag |
|
||||
|
||||
Crashes and partial failures leave files in `_acquired/` or `_processing/` where they can be inspected. Nothing ever silently disappears.
|
||||
|
||||
## Enrichment and embedding (unchanged)
|
||||
|
||||
The refactor does not touch how enrichment or embedding work internally. The existing `lib/enricher.py` and `lib/embedder.py` keep doing what they do today. What changes is WHERE they read from: instead of `/opt/recon/data/text/{hash}/`, they read from `/opt/recon/data/processing/{hash}/`. Both are processor-agnostic — they just read `page_NNNN.txt` and `meta.json` from a directory.
|
||||
|
||||
The path resolution change is small. Either a helper function (`resolve_text_dir(hash)` that returns `_processing/{hash}/` for in-flight items) or a direct change to the constant. Either way, it is a minimal diff in the shared code.
|
||||
|
||||
## Things that do not exist in this architecture
|
||||
|
||||
A few things that exist today and do not have a home in the target:
|
||||
|
||||
- **Library-root scanning.** The current `scan_library()` walks `/mnt/library/` looking for new PDFs. Under the new architecture, nothing should arrive in the library except via the pipeline (acquired → processing → filed). Manual drops into the library are not a supported input path. If you have files to ingest, they go in `_acquired/pdf/`, not in the library tree.
|
||||
|
||||
- **The `catalogue` vs `documents` split.** Both tables survive the refactor but their roles become clearer: `catalogue` is the canonical "what content do we have, keyed by hash" record; `documents` is the pipeline state machine for in-flight items. The refactor does not merge them but clarifies what each is for.
|
||||
|
||||
- **The crawler as a background service.** Web scraping via the crawler is not part of the refactor. If web ingestion returns later, it will be as its own acquisition module that drops into `_acquired/html/` (or a new subfolder), same as any other source. The existing crawler code can stay in the codebase as a dead-but-preserved module, or be deleted. That decision is out of scope for the refactor.
|
||||
|
||||
- **The `_sources/streamecho6/` layout for transcripts.** The 18,855 transcripts currently filed there are a transitional artifact from a previous session's migration. Under the new architecture, transcripts are filed by domain like everything else. A resweep will move them to the Domain/Subdomain tree during the migration. This is tracked in `migration-plan.md`.
|
||||
|
||||
## Things that are deliberately simple
|
||||
|
||||
A few places where we chose simplicity over flexibility:
|
||||
|
||||
- **Subfolder dispatch, not pattern matching.** Filename conventions inside `_acquired/` are not enforced. Type is determined by which subfolder the file is in. This means a processor could receive any filename within its subfolder and has to handle it. The alternative — pattern matching on filename — was considered and rejected as too clever.
|
||||
|
||||
- **Unknown-type files are ignored.** A file dropped at the root of `_acquired/` (not in any subfolder) sits there forever until a human moves it. There is no error handling, no catch-all bucket, no warning log. This is deliberate: any automated handling of unknown files creates a second inbox that rots silently. Human attention is the recovery mechanism.
|
||||
|
||||
- **No processor plugin system.** Processors are registered in a static config file. No dynamic discovery, no drop-in plugins, no hot reload. The number of processors is small and known, and adding one is a three-line change: new module file, one config line, one restart.
|
||||
|
||||
## What this architecture optimizes for
|
||||
|
||||
**Expandability.** Adding a new content source should be obvious and local. Write an acquisition module, write a processor, add one config line. Done. No changes to shared infrastructure, no changes to any other processor, no changes to the enricher or embedder or organizer.
|
||||
|
||||
**Clarity of state.** At any moment, you should be able to see what the system is doing by looking at the filesystem. No hidden state in background threads, no need to query the database to know what's in flight. `ls /opt/recon/data/acquired/` and `ls /opt/recon/data/processing/` tell you everything.
|
||||
|
||||
**Recoverability from failure.** Every stage transition is atomic. Every item's state is visible. Crashes leave diagnosable residue rather than silent data loss. Duplicates are handled with explicit policy (delete hash, quarantine name) rather than ad hoc.
|
||||
|
||||
**Minimal surface for bugs.** The shared infrastructure (enrich, embed, organize) is written once, tested once, and reused by every processor. The type-specific code (pre-flight) is small and self-contained per processor. A bug in the PDF processor cannot break the transcript processor.
|
||||
146
current-state.md
Normal file
146
current-state.md
Normal file
|
|
@ -0,0 +1,146 @@
|
|||
# Current state
|
||||
|
||||
Where RECON is today, as of the start of the refactor. Honest assessment of what works, what's broken, what's technical debt, and what's load-bearing.
|
||||
|
||||
This document is a snapshot. It will go stale as the refactor proceeds. The migration plan references this state as its starting point.
|
||||
|
||||
## What works well
|
||||
|
||||
The core content-processing logic is solid and trusted.
|
||||
|
||||
- **PDF text extraction** (`lib/extractor.py`) works reliably across the library's many PDF formats.
|
||||
- **Gemini enrichment** (`lib/enricher.py`) produces high-quality concept JSONs and is the most expensive part of the pipeline by dollar value.
|
||||
- **TEI embedding** (`lib/embedder.py`) pushes 2.3M+ vectors into Qdrant without issue.
|
||||
- **Domain classification** (`lib/organizer.py` `determine_dominant_domain()`) correctly classifies documents from their concept JSONs with well-tuned ambiguity handling.
|
||||
- **PDF filename sanitization** (`lib/utils.py` `sanitize_filename()`) handles the messy real-world filenames from Anna's Archive, PDFDrive, z-lib, and military publications. Six-phase pipeline, well-tested against the library.
|
||||
- **PeerTube scraping** (`lib/peertube_scraper.py`) reliably fetches captions from the local stream.echo6.co instance.
|
||||
- **Dashboard and API** (`lib/api.py`) is functional and used for day-to-day monitoring.
|
||||
|
||||
These are assets. The refactor preserves all of them as library code called by processors and shared stages. None of this logic gets rewritten.
|
||||
|
||||
## What does not work well
|
||||
|
||||
### Two coexisting pipeline models
|
||||
|
||||
The codebase has two overlapping ingestion paths that do not cleanly separate. This is evidence of an earlier, incomplete refactor.
|
||||
|
||||
**Path A — library-root scanning (legacy):**
|
||||
- `scan_library()` walks `/mnt/library/` for PDFs
|
||||
- `queue_all()` moves catalogued items into the documents table
|
||||
- Stage loops (`extract`, `enrich`, `embed`) pick up by status
|
||||
- `organize_document()` moves finished items from library root to `Domain/Subdomain/`
|
||||
|
||||
**Path B — acquired-staging watchdog (newer):**
|
||||
- `_acquired/` staging directory
|
||||
- `ingest_scan()` → `ingest_acquire()` watchdog loop
|
||||
- `_ingest/` intermediate staging
|
||||
- `ingest_place()` for final filing
|
||||
|
||||
Both paths converge at the `documents` table with `status='queued'` and from that point forward go through the same stage loops. But the front-half is duplicated and the code has accumulated conditionals and helpers that exist because of this split.
|
||||
|
||||
The refactor replaces both paths with a single, uniform flow.
|
||||
|
||||
### Extraction scratch is separate from the library
|
||||
|
||||
The current pipeline writes extracted text and metadata to `/opt/recon/data/text/{hash}/` while the source file stays in `/mnt/library/`. The scratch is keyed by hash and lives on a different filesystem from the source. This decoupling has caused problems in the past — the Phase B transcript migration had to explicitly correlate scratch locations to source locations for 18,855 transcripts.
|
||||
|
||||
Under the refactor, `/opt/recon/data/processing/{hash}/` holds the source file AND its extracted text AND its metadata AND its concepts, all in one place, for the duration of processing.
|
||||
|
||||
### Scanner race condition
|
||||
|
||||
`scan_library()`'s `add_to_catalogue()` uses `ON CONFLICT DO UPDATE` to overwrite the `path` column for any hash it re-encounters. During the earlier library sweep, this caused 1,015 hashes' paths to be silently overwritten because the scanner ran during the sweep. The issue was reconciled from the `file_operations` audit trail, but the underlying bug is still present: any future bulk reorganization will hit the same race if the scanner is running.
|
||||
|
||||
Under the refactor, the scanner no longer walks the library. Files enter the pipeline only via `_acquired/`, so there is no library-walking scanner for the race to occur in.
|
||||
|
||||
### Organizer error-spam on service restart
|
||||
|
||||
The organizer loop queries `get_unorganized()` which returns any document with `status='complete' AND organized_at IS NULL`. All 19,133 existing PeerTube transcripts match this query because transcripts have never been run through the organizer — their "path" is a watch URL, and `os.path.exists(watch_url)` returns False, so `organize_document()` returns an error without setting `organized_at`. On the next poll the same transcripts are returned again. Forever.
|
||||
|
||||
The moment `recon.service` restarts under current code, the organizer will start error-spamming on all 19,133 transcripts every 30 seconds.
|
||||
|
||||
**This bug does not exist under the refactor** because transcripts enter the same filing path as PDFs and get `organized_at` set naturally.
|
||||
|
||||
### Crawler will re-fire on service restart
|
||||
|
||||
`config.yaml` has 42 crawl targets configured across 4 tiers. The `crawler_scheduler_loop` waits 60 seconds after startup, then begins crawling. Web content that was ingested previously was purged in the cleanup earlier today (8,464 rows deleted from the DB, 47,262 Qdrant points deleted, 8,454 scratch directories removed). If the crawler fires again on the next restart, it will begin re-ingesting the same content.
|
||||
|
||||
Under the refactor, web ingestion is not part of the core pipeline. The crawler stays in the codebase as dead code or is deleted outright, and `config.yaml` has no `crawler.sites` to iterate.
|
||||
|
||||
### Deferred transcripts still at the old location
|
||||
|
||||
278 PeerTube transcripts were classified as STATE 2 in the Phase B migration. These have DB entries but no Qdrant vectors, meaning they never completed embedding. They currently live at `/opt/recon/data/text/{hash}/` and were explicitly left in place to be reprocessed after the refactor lands.
|
||||
|
||||
Under the refactor, these need to be re-queued through the new pipeline. Either manually by copying them into `_acquired/stream/` or programmatically via a migration script.
|
||||
|
||||
### Transcripts filed by source, not by domain
|
||||
|
||||
The 18,855 successfully-migrated transcripts are currently at `library/_sources/streamecho6/{channel}/{sanitized_title}__{hash8}/`. Under the target architecture, all content files by domain (`library/Domain/Subdomain/`). This means a post-refactor resweep to move every transcript to its domain-appropriate location based on existing concept classifications.
|
||||
|
||||
The resweep is expected to be fast — the PDF sweep scaffolding is proven and reusable — and there is no enrichment cost because the concepts are already computed.
|
||||
|
||||
### Stale documentation
|
||||
|
||||
`PROJECT-BIBLE.md` in the RECON repo is the canonical architecture doc. It is out of date in several places:
|
||||
|
||||
- References `recon_knowledge` as the Qdrant collection name; the actual collection is `recon_knowledge_hybrid`
|
||||
- References ~13,000 PDFs in the library; actual count is 10,679
|
||||
- Backup sizes for `data/text/` and `data/concepts/` are off by an order of magnitude
|
||||
- Does not mention Stream B or `_acquired/` staging
|
||||
- Describes the pipeline in its legacy form
|
||||
|
||||
The refactor will produce a new project bible as part of the final cleanup phase.
|
||||
|
||||
## What's in flight but paused
|
||||
|
||||
- **PeerTube acquisition:** paused at the source. No new transcripts are being fetched. The existing 19,133 are either in the library (18,855) or deferred (278).
|
||||
- **`recon.service`:** stopped. Has been stopped since the cleanup operations earlier today. Will not restart until the refactor is complete.
|
||||
- **Acquisition of new PDFs:** not currently in flight. No active scraping or downloading.
|
||||
|
||||
The system is quiescent. This is the ideal time to refactor — no moving parts, no new content arriving, no risk of data loss from concurrent modifications.
|
||||
|
||||
## Data that must survive the refactor
|
||||
|
||||
Nothing in this list can be lost or corrupted during the refactor. All of it represents either irreplaceable source content or expensive API call results that would cost real money to regenerate.
|
||||
|
||||
- **Qdrant collection `recon_knowledge_hybrid`** — approximately 2.32M vector points across ~29,812 distinct documents. Represents the output of Gemini enrichment and TEI embedding. Regeneration cost: high (Gemini API calls) plus time.
|
||||
- **`recon.db`** — SQLite database with catalogue, documents, file_operations, and other tables. Source of truth for pipeline state and document metadata.
|
||||
- **`/mnt/library/`** — source files. 10,679 PDFs filed by domain, 18,855 transcripts at `_sources/streamecho6/`. Total approximately 700+ GB.
|
||||
- **`/opt/recon/data/concepts/`** — concept JSONs from Gemini enrichment. Not strictly irreplaceable (Qdrant has the embedded form) but still represents API cost if lost.
|
||||
- **`/opt/recon/data/text/{hash}/` for the 278 STATE 2 transcripts** — extracted text waiting for reprocessing. Regeneration would require re-fetching captions from PeerTube, which is possible but wasteful.
|
||||
|
||||
Backups exist for recon.db on both CT 130 and cortex at `/tmp/recon.db.prepeertubemigration.20260414_033857.bak` and several earlier pre-operation snapshots. A fresh backup should be taken immediately before the refactor begins.
|
||||
|
||||
## Code that the refactor preserves as library functions
|
||||
|
||||
These modules stay in place and are called by the new processors and shared stages:
|
||||
|
||||
- `lib/extractor.py` — PDF text extraction. Becomes a library function called by the PDF processor.
|
||||
- `lib/enricher.py` — Gemini enrichment. Stays as a stage-loop worker, unchanged except for path resolution.
|
||||
- `lib/embedder.py` — TEI embedding. Stays as a stage-loop worker, unchanged except for path resolution.
|
||||
- `lib/organizer.py` `determine_dominant_domain()` — domain classification. Stays as a library function called by the shared filing stage.
|
||||
- `lib/utils.py` `sanitize_filename()` and helpers — PDF filename sanitization. Called by the PDF processor.
|
||||
- `lib/peertube_scraper.py` API and caption fetching logic — becomes the transcript acquisition module. The file-writing parts get rewritten; the API client parts stay.
|
||||
- `lib/api.py` — dashboard and API routes. Unchanged.
|
||||
- `lib/status.py` — database access. Gets new methods for processor-friendly queries but existing methods stay.
|
||||
|
||||
## Code that the refactor removes or replaces
|
||||
|
||||
- `scan_library()` and `queue_all()` — no longer needed; files enter the pipeline via `_acquired/`, not via library walks.
|
||||
- `lib/new_pipeline.py` `ingest_scan()`, `ingest_acquire()`, `ingest_place()` — replaced by the dispatcher and processor architecture. The logic for PDF acquisition, collision resolution, and filing is reused but rewired.
|
||||
- `lib/organizer.py` `organize_document()` as it exists today — replaced by a shared filing function that handles any content type. The domain classification piece inside it is preserved.
|
||||
- `lib/web_scraper.py` and `lib/crawler.py` — either deleted or left as dead code with `crawler.sites` empty. Not part of the target architecture.
|
||||
- `peertube_scanner_loop` in `recon.py` — replaced by a transcript acquisition module that drops into `_acquired/stream/`.
|
||||
|
||||
## Metrics to track through the refactor
|
||||
|
||||
Things whose values should stay the same (or change predictably) as the refactor proceeds. If any of these drift unexpectedly, something is wrong.
|
||||
|
||||
- Qdrant point count: baseline 2,320,695 (post-cleanup)
|
||||
- Catalogue row count: baseline 29,812
|
||||
- Documents row count: baseline 29,812
|
||||
- PDF file count in library: baseline 10,679
|
||||
- Transcript file count in library: baseline 18,855 (will migrate to domain tree during the refactor)
|
||||
- Disk usage of `/opt/recon/data/text/`: baseline around 261 MB (only the 278 STATE 2 transcripts remain)
|
||||
- Disk usage of `/opt/recon/data/concepts/`: baseline around 4.2 GB
|
||||
|
||||
These numbers are the ground truth. The refactor must not lose rows, vectors, or files. Any discrepancy is a bug to investigate before proceeding.
|
||||
197
decisions.md
Normal file
197
decisions.md
Normal file
|
|
@ -0,0 +1,197 @@
|
|||
# Decisions
|
||||
|
||||
Architectural decisions made during the design of the RECON refactor. Each entry captures a choice, the alternatives considered, the rationale, and the status. New decisions get appended; existing decisions get edited in place when the thinking changes (commit history shows the evolution).
|
||||
|
||||
Format loosely follows ADR (Architecture Decision Record) conventions but kept informal.
|
||||
|
||||
---
|
||||
|
||||
## ADR-001: Hopper-based ingestion with type-subfolder dispatch
|
||||
|
||||
**Status:** accepted
|
||||
|
||||
**Context:** Current RECON has two overlapping ingestion paths (library-root scanning and `_acquired/` watchdog) that duplicate front-half logic. Adding new content types today requires touching shared code in multiple places. The user wants a modular approach where new content types can be added without refactoring existing ones.
|
||||
|
||||
**Decision:** Adopt a hopper model where all acquisition modules drop items into type-specific subfolders of `_acquired/`. A dispatcher watches each subfolder and routes items to type-specific processors. The subfolder determines the type — no filename conventions, no content sniffing.
|
||||
|
||||
**Alternatives considered:**
|
||||
|
||||
1. **Central rule table mapping filename patterns to processors.** Rejected: requires filename convention enforcement, adds complexity for ambiguous cases, more code than a subfolder listing.
|
||||
2. **Self-registering processors via module-level attributes.** Rejected: implicit, harder to inspect, overkill for a small number of processors.
|
||||
3. **Unified single hopper with content sniffing.** Rejected: requires every processor to read every item to decide if it owns it, or requires a central type-identifier that duplicates what a subfolder already provides for free.
|
||||
|
||||
**Consequences:**
|
||||
- Adding a new content type is one new subfolder, one new processor module, one config line.
|
||||
- Acquisition modules must know their target subfolder at write time (trivial — they already know what they produce).
|
||||
- Files dropped at `_acquired/` root are orphaned. Accepted as a feature, not a bug: human attention is the recovery mechanism.
|
||||
|
||||
---
|
||||
|
||||
## ADR-002: `_acquired/` and `_processing/` live under RECON, not under the library
|
||||
|
||||
**Status:** accepted
|
||||
|
||||
**Context:** The hopper and processing scratch could live either under `/mnt/library/` (where the current `_acquired/` exists) or under `/opt/recon/data/` (co-located with RECON's other state). The user preferred keeping the library clean.
|
||||
|
||||
**Decision:** Both `_acquired/` and `_processing/` live under `/opt/recon/data/`. The library contains only finished, filed content.
|
||||
|
||||
**Alternatives considered:**
|
||||
|
||||
1. **Keep both in `/mnt/library/`.** Rejected: pollutes the library with in-flight content, complicates backups (library backup now needs to exclude `_acquired/` and `_processing/`), and exposes half-processed files to the file server.
|
||||
2. **`_acquired/` under library, `_processing/` under RECON.** Rejected: splits the pipeline state across two filesystems, hybrid approach with no clear benefit.
|
||||
|
||||
**Consequences:**
|
||||
- Two clean backup targets: library (finished content only) and RECON (all pipeline state including hopper and scratch).
|
||||
- The library never contains in-flight content. Nothing half-processed is ever visible to anyone browsing.
|
||||
- Acquisition modules must write over NFS if they run on a host other than CT 130. For modules running on CT 130, writes are local.
|
||||
- The final move from `_processing/` to library/Domain/Subdomain/ is a cross-filesystem operation (copy + delete). Acceptable for the volumes involved.
|
||||
|
||||
---
|
||||
|
||||
## ADR-003: Everything files to `library/Domain/Subdomain/` regardless of type
|
||||
|
||||
**Status:** accepted
|
||||
|
||||
**Context:** Transcripts are currently filed at `library/_sources/streamecho6/channel/title__hash8/`, which is source-oriented rather than domain-oriented. PDFs file by domain. The user could have kept these as two separate conventions or unified them.
|
||||
|
||||
**Decision:** All content types file by domain. Transcripts, PDFs, HTML articles, and any future type all end up in `library/Domain/Subdomain/` with a canonical name. The filing logic is shared across all types.
|
||||
|
||||
**Alternatives considered:**
|
||||
|
||||
1. **Keep transcripts at `_sources/streamecho6/` and PDFs at `Domain/Subdomain/`.** Rejected: creates two different filing conventions and forces the organizer to dispatch by type. User preference was for uniformity.
|
||||
2. **File by source for everything.** Rejected: fragments PDFs by acquisition source, which is not meaningful — a PDF from Anna's Archive and the same PDF from a manual upload should file to the same domain location.
|
||||
|
||||
**Consequences:**
|
||||
- The existing 18,855 transcripts at `_sources/streamecho6/` need to be re-filed during the migration. A resweep using the PDF sweep scaffolding.
|
||||
- Human browsability of "all transcripts from channel X" goes away. Replaced by search — the Qdrant `channel_name` field can be filtered on.
|
||||
- The filing logic is fully shared; no per-type dispatch.
|
||||
|
||||
---
|
||||
|
||||
## ADR-004: Four-level naming hierarchy with strict level-4 dedupe check
|
||||
|
||||
**Status:** accepted
|
||||
|
||||
**Context:** Name-based duplicate detection needs a clear rule for what counts as "the same work." Too aggressive (match on title alone) produces false positives; too permissive (match only on byte-identical content) misses non-identical duplicates.
|
||||
|
||||
**Decision:** Canonical names follow a four-level hierarchy:
|
||||
1. `Title`
|
||||
2. `Title_Author`
|
||||
3. `Title_Edition_Author`
|
||||
4. `Title_Edition_Author_Year`
|
||||
|
||||
Pre-enrichment duplicate detection uses a strict level-4 match — all four fields must be present and equal to count as a duplicate. Missing fields fail the match and the item proceeds as non-duplicate.
|
||||
|
||||
At filing time, the organizer starts at level 1 and escalates only if needed to resolve physical collisions at the target library path.
|
||||
|
||||
**Alternatives considered:**
|
||||
|
||||
1. **Level-2 check (Title + Author).** Rejected: too aggressive. Different editions of the same book by the same author would be flagged as duplicates and quarantined, creating false-positive review load.
|
||||
2. **Fuzzy match on missing fields.** Rejected: produces false positives when metadata extraction is incomplete. Strict is safer.
|
||||
3. **Level-1 filing only (always use `Title` unless collision).** Accepted for filing — the escalation only happens on collision, and most files file at level 1.
|
||||
|
||||
**Consequences:**
|
||||
- Different editions of the same work are kept as separate documents. Each gets its own concepts, its own vectors, its own library entry.
|
||||
- Concept-level dedupe becomes a possible future cleanup activity — if two editions have concepts that are substantively identical, they can be merged at the vector level, separate from file-level deduplication.
|
||||
- Pre-enrichment metadata extraction must reliably get title, edition, author, and year. This is hard; it likely requires reading the first few pages and using a cheap LLM call or a structured parser. The cost of that extraction is the price of the dedupe savings.
|
||||
|
||||
---
|
||||
|
||||
## ADR-005: Duplicate handling — hash match deletes, name match quarantines
|
||||
|
||||
**Status:** accepted
|
||||
|
||||
**Context:** Duplicate detection has two trigger conditions (byte-identical content and matching canonical name with different hash). The response to each should match the confidence level.
|
||||
|
||||
**Decision:**
|
||||
- **Byte-identical (hash match):** delete the file immediately. No review, no logging beyond debug. Zero risk of data loss because the existing copy is still in the catalogue.
|
||||
- **Name match (level 4, different hash):** move to a `_duplicates/` quarantine folder and flag for human review. Preserve the file. Do not process.
|
||||
- **No match:** proceed through the pipeline normally.
|
||||
|
||||
**Alternatives considered:**
|
||||
|
||||
1. **Quarantine both.** Rejected: hash matches are certain duplicates and don't warrant human attention.
|
||||
2. **Delete both.** Rejected: level-4 name matches can be legitimate re-ingests (better scans, corrected OCR, different metadata) and should not be silently destroyed.
|
||||
3. **Process everything and deduplicate at concept level later.** Rejected: wastes enrichment API calls on the specific files we want to avoid wasting money on.
|
||||
|
||||
**Consequences:**
|
||||
- The `_duplicates/` folder needs a human review process. Items sit there indefinitely until someone decides.
|
||||
- Hash match deletions are silent. There is no audit trail beyond a debug log and the presence of the original in the catalogue.
|
||||
- The quarantine gives a natural place to inspect near-duplicates without losing data.
|
||||
|
||||
---
|
||||
|
||||
## ADR-006: Shared enrich/embed infrastructure, processor-specific pre-flight
|
||||
|
||||
**Status:** accepted
|
||||
|
||||
**Context:** Some pipeline work is type-specific (extracting text from a PDF vs parsing a VTT vs scraping HTML) and some is type-agnostic (calling Gemini on text pages, embedding concepts into Qdrant). The question is where to draw the line.
|
||||
|
||||
**Decision:** Each processor owns its type-specific pre-flight: dedupe check, metadata extraction, source-to-text-pages conversion, move to `_processing/`. After that, shared stage loops handle enrichment and embedding. The organizer handles filing. None of those shared stages know or care what type produced the item.
|
||||
|
||||
**Alternatives considered:**
|
||||
|
||||
1. **Processors own end-to-end flow.** Rejected: duplicates enrichment and embedding logic across every processor, and loses batching benefits for API-rate-limited stages.
|
||||
2. **Single universal pipeline with no processors.** Rejected: cannot handle type-specific pre-flight (PDF metadata extraction vs VTT parsing are genuinely different operations).
|
||||
|
||||
**Consequences:**
|
||||
- Enrichment and embedding keep their current stage-loop architecture. Batching across items continues to work for throughput and rate-limit management.
|
||||
- The processor interface is small: one function, `pre_flight(item_path, db, config)`.
|
||||
- Filing is shared and handles any type uniformly.
|
||||
- The convergence point is "standardized `page_NNNN.txt` + `meta.json` in `_processing/{hash}/`." From there, everything is uniform.
|
||||
|
||||
---
|
||||
|
||||
## ADR-007: Refactor, not rebuild
|
||||
|
||||
**Status:** accepted
|
||||
|
||||
**Context:** The design described in `architecture.md` is substantially different from the current implementation. We considered rebuilding RECON from scratch against the new design vs refactoring the existing codebase incrementally.
|
||||
|
||||
**Decision:** Refactor, not rebuild. Preserve all existing core logic (extraction, enrichment, embedding, classification, sanitization) as library functions. Rewire the orchestration layer. Migrate data in place.
|
||||
|
||||
**Alternatives considered:**
|
||||
|
||||
1. **Rebuild from scratch.** Rejected on three grounds: (a) the user trusts the existing code and wants to preserve it, (b) the "cannot lose data" constraint makes parallel-build-and-cutover risky, (c) downtime is available but not unlimited.
|
||||
2. **Minimal patch to unbreak current state without architectural change.** Rejected: does not solve the underlying design issues and postpones the inevitable.
|
||||
|
||||
**Consequences:**
|
||||
- The refactor proceeds in phases with each phase independently verifiable and rollback-able.
|
||||
- Code that works stays. Code that wires things together changes.
|
||||
- Data (Qdrant, DB, library files) is migrated in place via explicit migration steps, not via a fresh rebuild.
|
||||
- Each phase has a clear go/no-go gate before moving on.
|
||||
|
||||
---
|
||||
|
||||
## ADR-008: Transcripts will be re-filed by domain during the migration
|
||||
|
||||
**Status:** accepted
|
||||
|
||||
**Context:** 18,855 transcripts are currently filed at `library/_sources/streamecho6/channel/title__hash8/`. Under the target architecture, everything files by domain. The existing transcripts need to move to `library/Domain/Subdomain/` with canonical names derived from their titles.
|
||||
|
||||
**Decision:** Execute a transcript resweep during the migration, using the same sweep scaffolding built for the earlier PDF sweep. Read domain classifications from existing concept JSONs (no new enrichment needed), derive canonical names from existing meta.json titles, move files, update catalogue + documents + Qdrant atomically.
|
||||
|
||||
**Alternatives considered:**
|
||||
|
||||
1. **Leave transcripts where they are as a legacy exception.** Rejected: creates a permanent exception to an otherwise-uniform architecture.
|
||||
2. **Reprocess transcripts through the full new pipeline.** Rejected: wastes Gemini API calls on content that has already been enriched.
|
||||
|
||||
**Consequences:**
|
||||
- One additional migration step in the plan, reusing proven sweep tooling.
|
||||
- No new enrichment cost — classifications already exist.
|
||||
- The `_sources/streamecho6/` directory tree goes away after the resweep.
|
||||
- Transcripts in the library become searchable alongside PDFs by domain rather than by source.
|
||||
|
||||
---
|
||||
|
||||
## Open questions
|
||||
|
||||
These are things we know we need to decide but haven't yet.
|
||||
|
||||
- **Pre-enrichment metadata extraction method for PDFs.** What's the cheapest reliable way to get title, edition, author, and year from a PDF before running full enrichment? Options: parse PDF metadata fields (fast, unreliable), parse filename (fast, unreliable for junk filenames), read first-page text with heuristics (medium cost, medium reliability), small LLM call on first-page text (medium cost, high reliability). Needs experimentation.
|
||||
|
||||
- **Cleanup of `_processing/` scratch after filing.** Keep or delete the extracted text and concept JSONs after an item is filed? Keep means re-embedding is possible without re-extracting, useful if embedding models change. Delete means scratch doesn't grow unbounded. Open.
|
||||
|
||||
- **Fate of `lib/web_scraper.py` and `lib/crawler.py`.** Dead code in the codebase, delete outright, or preserve in case web ingestion returns later? Open.
|
||||
|
||||
- **Metrics and observability for the new pipeline.** The current dashboard queries DB status counts. Under the refactor, pipeline state is partly in the DB and partly on disk (`_acquired/` and `_processing/` contents). Should the dashboard be extended to show filesystem state? Open.
|
||||
205
migration-plan.md
Normal file
205
migration-plan.md
Normal file
|
|
@ -0,0 +1,205 @@
|
|||
# Migration plan
|
||||
|
||||
How we get from the current state to the target architecture without losing data and with the ability to stop, verify, or roll back at each step.
|
||||
|
||||
This document is the high-level plan. Per-phase execution details will live in the `phases/` directory as each phase is scoped. Phase-level docs will be drafted just-in-time, not all at once — the details of phase 3 will be informed by what we learn in phases 1 and 2.
|
||||
|
||||
## Guiding principles for the migration
|
||||
|
||||
**One phase at a time.** Each phase has a clear objective, a clear go/no-go gate, and a rollback plan. No phase starts until the previous one is verified complete.
|
||||
|
||||
**Preserve data at every step.** Backups before destructive operations. Dry runs before real runs. Verification after every operation. The `cannot lose data` constraint is absolute.
|
||||
|
||||
**Minimize downtime risk by minimizing time pressure.** `recon.service` is already stopped and can stay stopped. We are not trying to get the system back online quickly. We are trying to get it back online correctly.
|
||||
|
||||
**Reuse proven tooling.** The PDF sweep scaffolding from the earlier session is proven machinery. The cleanup-with-preflight-backup-and-go pattern is a known-good workflow. Don't reinvent either.
|
||||
|
||||
**Write before cutting.** Build new code paths, verify them in isolation, then cut over. The old code paths stay alive until the new ones are proven.
|
||||
|
||||
## Phase overview
|
||||
|
||||
Six phases, executed in order. Each phase's rough objective is described here. Detailed execution plans live in `phases/phase-N-*.md` and are written just before each phase runs.
|
||||
|
||||
---
|
||||
|
||||
### Phase 0: Baseline capture and backup
|
||||
|
||||
**Objective:** establish a verified baseline of the current state so that the refactor starts from known-good ground truth, and create backups of everything that cannot be regenerated.
|
||||
|
||||
**Activities:**
|
||||
- Full backup of `recon.db` to a new dated snapshot on CT 130 and cortex, MD5 verified.
|
||||
- Full backup of `config.yaml` to a new dated snapshot.
|
||||
- Capture baseline metrics: Qdrant point count, catalogue row count, documents row count, library PDF count, library transcript count, disk usage by directory.
|
||||
- Write the baseline numbers to a file in this repo under `phases/phase-0-baseline.md` for future verification.
|
||||
- Verify `recon.service` is stopped and will not restart automatically (check systemd unit state).
|
||||
- Verify PeerTube acquisition is paused at the source.
|
||||
|
||||
**Go/no-go gate:** all backups verified by MD5, all baseline numbers recorded, service is confirmed stopped. If any backup fails or any metric cannot be captured, do not proceed.
|
||||
|
||||
**Rollback:** none needed. Read-only phase.
|
||||
|
||||
**Estimated effort:** small. Mostly running existing scripts and writing down numbers.
|
||||
|
||||
---
|
||||
|
||||
### Phase 1: Scaffolding — new directories and config, no behavior change
|
||||
|
||||
**Objective:** put the new physical layout in place and wire up configuration without changing any pipeline behavior. The system at the end of this phase is indistinguishable from the system at the start, except that new directories exist and config has new keys.
|
||||
|
||||
**Activities:**
|
||||
- Create `/opt/recon/data/acquired/` with `pdf/`, `stream/`, `html/` subdirectories.
|
||||
- Create `/opt/recon/data/processing/`.
|
||||
- Update `config.yaml` to register the new paths under a new `pipeline` section. Keep existing config untouched so legacy code still works.
|
||||
- Update `config.yaml` to disable the crawler (`crawler.sites = []` with a dated comment explaining why).
|
||||
- Add a `dispatch` section to `config.yaml` registering the subfolder-to-processor mapping (even though no processors exist yet).
|
||||
- Add a `text_dir` column to the `documents` table via a schema migration. All existing rows get NULL. Existing code falls back to legacy paths when `text_dir` is NULL.
|
||||
|
||||
**Go/no-go gate:** directories exist, config parses cleanly, `text_dir` column exists, existing code paths work unchanged (verify by running `recon.py status` or equivalent read-only operation).
|
||||
|
||||
**Rollback:** drop the new directories, revert `config.yaml` from backup, drop the `text_dir` column.
|
||||
|
||||
**Estimated effort:** small. Filesystem and config changes only.
|
||||
|
||||
---
|
||||
|
||||
### Phase 2: Shared filing refactor — prepare the organizer for any content type
|
||||
|
||||
**Objective:** refactor the existing organizer logic so it can file any content type, not just PDFs. This is preparation work for the processors — when a processor eventually moves items to `_processing/` and marks them complete, the organizer needs to be able to file them by domain regardless of type.
|
||||
|
||||
**Activities:**
|
||||
- Extract the domain classification logic from `organize_document()` into a reusable library function.
|
||||
- Write a new shared filing function that takes a hash, reads its concepts, derives canonical name and domain, moves from `_processing/{hash}/` to `library/Domain/Subdomain/`, and updates catalogue + documents + Qdrant atomically.
|
||||
- The new filing function is written but NOT wired into the service loop yet. It's library code that the existing organizer can start calling, and that future processors can also call.
|
||||
- Unit-test the new filing function in isolation using a synthetic hash and a synthetic `_processing/` directory.
|
||||
- The existing `organize_document()` continues to work for the moment. We do not remove it.
|
||||
|
||||
**Go/no-go gate:** new filing function passes isolation test, existing organizer still works, no regressions in status command.
|
||||
|
||||
**Rollback:** remove the new filing function, restore the original `organizer.py` from backup.
|
||||
|
||||
**Estimated effort:** medium. Real code change, but scoped to one file and one function.
|
||||
|
||||
---
|
||||
|
||||
### Phase 3: Transcript processor — first end-to-end new-pipeline path
|
||||
|
||||
**Objective:** build the transcript processor as the first proof of the new architecture. End-to-end: acquisition module writes to `_acquired/stream/`, dispatcher routes to transcript processor, processor handles pre-flight and moves to `_processing/`, shared stages enrich and embed, shared filing function files to `library/Domain/Subdomain/`.
|
||||
|
||||
**Why transcripts first and not PDFs:** transcripts are simpler to pre-flight (no PDF metadata extraction required, title comes straight from the meta.json the scraper already builds), and they are the content type with existing architectural debt that the refactor is most motivated to fix. Proving the architecture on the simpler type first reduces risk.
|
||||
|
||||
**Activities:**
|
||||
- Write `lib/acquisition/peertube.py` — a thin acquisition module that wraps the existing PeerTube API logic from `lib/peertube_scraper.py` and drops items into `_acquired/stream/` in a standardized form.
|
||||
- Write `lib/processors/transcript.py` — the transcript processor. Implements `pre_flight(item_path, db, config)`: hash dedupe, extract metadata from meta.json, level-4 name dedupe, move to `_processing/{hash}/`, update DB, set status to `extracted`.
|
||||
- Write the dispatcher as a new module `lib/dispatcher.py`. Small: watches configured subfolders, calls pre_flight on stable items.
|
||||
- Test end-to-end with a single real transcript. Use a test transcript that is known not to be in the system yet (or temporarily remove one from the catalogue for testing).
|
||||
- Verify: the transcript lands in `_acquired/stream/`, gets picked up by the dispatcher, routed to the transcript processor, pre-flight runs cleanly, item moves to `_processing/`, enrichment runs, embedding runs, filing moves it to `library/Domain/Subdomain/`, catalogue and documents and Qdrant all reflect the final state.
|
||||
|
||||
**Go/no-go gate:** a single transcript successfully flows through the entire new pipeline end-to-end. Catalogue, documents, Qdrant, and filesystem all consistent at the end.
|
||||
|
||||
**Rollback:** revert the acquisition module, processor, and dispatcher. Restore the test transcript to its original state. The old PeerTube ingestion path still exists and can be reactivated.
|
||||
|
||||
**Estimated effort:** large. First real end-to-end implementation work. Expect surprises.
|
||||
|
||||
---
|
||||
|
||||
### Phase 4: PDF processor — second processor proves modularity
|
||||
|
||||
**Objective:** build the PDF processor following the same pattern as the transcript processor. This phase validates that adding a new content type really is as simple as writing one acquisition module and one processor module, and it replaces the PDF-specific parts of the old pipeline.
|
||||
|
||||
**Activities:**
|
||||
- Write `lib/acquisition/manual_pdf.py` or similar — a simple acquisition module for manual PDF uploads that drops into `_acquired/pdf/`. (The dashboard upload endpoint can be rewired to use this.)
|
||||
- Write `lib/processors/pdf.py` — the PDF processor. Implements `pre_flight`: hash dedupe, pre-enrichment metadata extraction (the cheap path — see open question in decisions.md), level-4 name dedupe, move to `_processing/{hash}/`, extract text via existing `lib/extractor.py` as a library call, update DB, set status to `extracted`.
|
||||
- Register `pdf: pdf_processor` in the dispatch config.
|
||||
- Test end-to-end with a single real PDF. Same verification criteria as phase 3.
|
||||
- Deprecate but do not yet remove the old scanner-based PDF ingestion path.
|
||||
|
||||
**Go/no-go gate:** a single PDF successfully flows through the entire new pipeline end-to-end. Both the PDF and transcript processors can run concurrently without interference.
|
||||
|
||||
**Rollback:** revert the PDF processor and its acquisition module. The old scanner path is still alive and usable.
|
||||
|
||||
**Estimated effort:** large. Same scope as phase 3 but with the added complexity of PDF metadata extraction.
|
||||
|
||||
---
|
||||
|
||||
### Phase 5: Cutover — new pipeline becomes the only pipeline
|
||||
|
||||
**Objective:** switch the service loop from running the old stage loops to running the new dispatcher + shared stages. Retire the old scan_library, old organize_document, and old new_pipeline watchdog. Bring recon.service back online.
|
||||
|
||||
**Activities:**
|
||||
- Rewrite `cmd_service()` in `recon.py` to run: dispatcher thread + existing enrich/embed stage loops + new shared filing loop + dashboard + progress reporter. The old scanner_loop, peertube_scanner_loop, crawler_scheduler_loop, and organizer_loop are removed from the service.
|
||||
- Migrate the 278 STATE 2 transcripts into `_acquired/stream/` so they get reprocessed through the new pipeline.
|
||||
- Mark all existing transcripts and PDFs as `organized_at=CURRENT_TIMESTAMP` if they're already in their final library location. This prevents the new shared filing loop from trying to re-file already-filed content.
|
||||
- Resweep the 18,855 transcripts currently at `library/_sources/streamecho6/` into `library/Domain/Subdomain/` using the existing sweep scaffolding. This is a significant sub-phase but reuses proven tooling.
|
||||
- Start `recon.service`.
|
||||
- Monitor logs for a defined observation period (30 minutes minimum). Verify: dispatcher picks up any pending `_acquired/` items, enrichment and embedding run on any queued work, filing moves completed items, no error spam, no unexpected crashes.
|
||||
|
||||
**Go/no-go gate:** service runs cleanly for the observation period with no errors. All baseline metrics still match (Qdrant count, catalogue count, etc.) except where deliberately changed (transcripts relocated to Domain tree). Dashboard shows a healthy system.
|
||||
|
||||
**Rollback:** stop the service. Restore `recon.py` from backup. Restore the old service unit. Restart with old code. The library state is preserved even if the service code is reverted.
|
||||
|
||||
**Estimated effort:** very large. This is the big phase. Includes the transcript resweep as a sub-phase.
|
||||
|
||||
---
|
||||
|
||||
### Phase 6: Cleanup and documentation
|
||||
|
||||
**Objective:** remove dead code, update documentation, close out the refactor.
|
||||
|
||||
**Activities:**
|
||||
- Delete old code paths that are no longer called: `scan_library()`, old `organize_document()`, `new_pipeline.ingest_scan/ingest_acquire/ingest_place`, legacy stage helpers.
|
||||
- Decide fate of `lib/web_scraper.py` and `lib/crawler.py` (delete or preserve, per open question in decisions.md).
|
||||
- Write a new `PROJECT-BIBLE.md` reflecting the refactored architecture. Use the documents in this repo as source material.
|
||||
- Update backup scripts if they reference paths that moved.
|
||||
- Close out backlog items that the refactor addressed (organizer error-spam, scanner race, crawler re-fire, deferred transcripts).
|
||||
- Tag a release in the RECON repo marking the refactor completion.
|
||||
- Mark this refactored-recon repo as complete. Future architecture evolution gets its own design cycle.
|
||||
|
||||
**Go/no-go gate:** no orphaned code, no broken references, documentation reflects reality, backlog is updated.
|
||||
|
||||
**Rollback:** cleanup phase rollbacks are less interesting — if something was deleted that shouldn't have been, restore from git history.
|
||||
|
||||
**Estimated effort:** medium. Mostly cleanup, documentation, and bookkeeping.
|
||||
|
||||
---
|
||||
|
||||
## Cumulative effort and timeline
|
||||
|
||||
The total effort is significant. In rough order of magnitude:
|
||||
|
||||
- Phase 0: hours
|
||||
- Phase 1: hours
|
||||
- Phase 2: one session
|
||||
- Phase 3: one to two sessions
|
||||
- Phase 4: one to two sessions
|
||||
- Phase 5: one to three sessions (the resweep alone could be a session)
|
||||
- Phase 6: one session
|
||||
|
||||
Total: likely five to ten sessions of focused work, with verification and backup time between phases. This is not a single-session refactor.
|
||||
|
||||
There is no time pressure — `recon.service` can stay stopped indefinitely — so the plan optimizes for correctness and verifiability, not speed.
|
||||
|
||||
## What could go wrong
|
||||
|
||||
Risks worth flagging now:
|
||||
|
||||
**Pre-enrichment metadata extraction for PDFs may be unreliable.** The level-4 dedupe check depends on getting title, edition, author, and year reliably from a PDF without spending enrichment dollars. If the cheap extraction produces garbage for a meaningful fraction of PDFs, the dedupe check fails silently and we either produce false positives (quarantining non-duplicates) or miss real duplicates. This is the single biggest design risk and it won't be known until phase 4.
|
||||
|
||||
**The transcript resweep in phase 5 could surface issues with the existing sweep tooling.** The PDF sweep worked well for 15,595 entries; the transcript resweep will be comparable in volume. Different failure modes are possible (different file types, different library tree). Risk is manageable because we have the backup-and-gate pattern proven.
|
||||
|
||||
**Qdrant payload updates during mass moves are the historical risk.** The scanner race earlier today was an example of the DB and Qdrant getting out of sync during bulk operations. The refactor must handle this cleanly or reproduce the race in a new form. The solution is atomic transitions: either everything moves together or nothing moves.
|
||||
|
||||
**Unknown interactions with the dashboard.** The dashboard reads DB status and Qdrant. It should keep working through the refactor, but there may be edge cases where a transitional state looks wrong in the UI. Acceptable if it's transient and resolves when the phase completes.
|
||||
|
||||
## What is explicitly out of scope
|
||||
|
||||
- Any changes to Qdrant schema (the collection, fields, or indexing)
|
||||
- Any changes to the enrichment model or prompts
|
||||
- Any changes to the embedding model or TEI configuration
|
||||
- Any changes to `lib/api.py` routes or the dashboard UI
|
||||
- Any work on new content types beyond the two (PDFs and transcripts) that already exist
|
||||
- Any work on the PeerTube infrastructure itself (CT 110, peertube_prod, etc.)
|
||||
- The pi-nas 283 GB orphaned NFS cleanup (separate backlog item)
|
||||
- The 2,775-hash physical duplicate cleanup (separate backlog item)
|
||||
- Any change to the three-month backup schedule
|
||||
|
||||
Stay in scope. If something feels like it should be in scope, flag it as an open question in `decisions.md` and decide separately.
|
||||
21
phases/README.md
Normal file
21
phases/README.md
Normal file
|
|
@ -0,0 +1,21 @@
|
|||
# Phases
|
||||
|
||||
Per-phase execution documents live here. Each phase gets its own markdown file with detailed steps, commands, verification criteria, and rollback procedures.
|
||||
|
||||
Phase documents are written just-in-time — the details of phase 3 are informed by what we learn in phases 1 and 2. Writing all six up front would be speculative and would churn as reality hits.
|
||||
|
||||
## Naming convention
|
||||
|
||||
- `phase-0-baseline.md` — baseline capture
|
||||
- `phase-1-scaffolding.md` — new directories and config
|
||||
- `phase-2-shared-filing.md` — shared filing function
|
||||
- `phase-3-transcript-processor.md` — first processor end-to-end
|
||||
- `phase-4-pdf-processor.md` — second processor
|
||||
- `phase-5-cutover.md` — service cutover and transcript resweep
|
||||
- `phase-6-cleanup.md` — dead code removal and docs
|
||||
|
||||
## Status
|
||||
|
||||
None of the per-phase docs exist yet. They will be added as each phase is scoped.
|
||||
|
||||
The high-level plan lives in `../migration-plan.md`.
|
||||
Loading…
Add table
Add a link
Reference in a new issue