refactored-recon/migration-plan.md

# Migration plan

How we get from the current state to the target architecture without losing data and with the ability to stop, verify, or roll back at each step.

This document is the high-level plan. Per-phase execution details will live in the `phases/` directory as each phase is scoped. Phase-level docs will be drafted just-in-time, not all at once — the details of phase 3 will be informed by what we learn in phases 1 and 2.

## Guiding principles for the migration

**One phase at a time.** Each phase has a clear objective, a clear go/no-go gate, and a rollback plan. No phase starts until the previous one is verified complete.

**Preserve data at every step.** Backups before destructive operations. Dry runs before real runs. Verification after every operation. The `cannot lose data` constraint is absolute.

**Minimize downtime risk by minimizing time pressure.** `recon.service` is already stopped and can stay stopped. We are not trying to get the system back online quickly. We are trying to get it back online correctly.

**Reuse proven tooling.** The PDF sweep scaffolding from the earlier session is proven machinery. The cleanup-with-preflight-backup-and-go pattern is a known-good workflow. Don't reinvent either.

**Write before cutting.** Build new code paths, verify them in isolation, then cut over. The old code paths stay alive until the new ones are proven.

## Phase overview

Six phases, executed in order. Each phase's rough objective is described here. Detailed execution plans live in `phases/phase-N-*.md` and are written just before each phase runs.

---

### Phase 0: Baseline capture and backup

**Objective:** establish a verified baseline of the current state so that the refactor starts from known-good ground truth, and create backups of everything that cannot be regenerated.

**Activities:**
- Full backup of `recon.db` to a new dated snapshot on CT 130 and cortex, MD5 verified.
- Full backup of `config.yaml` to a new dated snapshot.
- Capture baseline metrics: Qdrant point count, catalogue row count, documents row count, library PDF count, library transcript count, disk usage by directory.
- Write the baseline numbers to a file in this repo under `phases/phase-0-baseline.md` for future verification.
- Verify `recon.service` is stopped and will not restart automatically (check systemd unit state).
- Verify PeerTube acquisition is paused at the source.

**Go/no-go gate:** all backups verified by MD5, all baseline numbers recorded, service is confirmed stopped. If any backup fails or any metric cannot be captured, do not proceed.

**Rollback:** none needed. Read-only phase.

**Estimated effort:** small. Mostly running existing scripts and writing down numbers.

---

### Phase 1: Scaffolding — new directories and config, no behavior change

**Objective:** put the new physical layout in place and wire up configuration without changing any pipeline behavior. The system at the end of this phase is indistinguishable from the system at the start, except that new directories exist and config has new keys.

**Activities:**
- Create `/opt/recon/data/acquired/` with `pdf/`, `stream/`, `html/` subdirectories.
- Create `/opt/recon/data/processing/`.
- Update `config.yaml` to register the new paths under a new `pipeline` section. Keep existing config untouched so legacy code still works.
- Update `config.yaml` to disable the crawler (`crawler.sites = []` with a dated comment explaining why).
- Add a `dispatch` section to `config.yaml` registering the subfolder-to-processor mapping (even though no processors exist yet).
- Add a `text_dir` column to the `documents` table via a schema migration. All existing rows get NULL. Existing code falls back to legacy paths when `text_dir` is NULL.

**Go/no-go gate:** directories exist, config parses cleanly, `text_dir` column exists, existing code paths work unchanged (verify by running `recon.py status` or equivalent read-only operation).

**Rollback:** drop the new directories, revert `config.yaml` from backup, drop the `text_dir` column.

**Estimated effort:** small. Filesystem and config changes only.

---

### Phase 2: Shared filing refactor — prepare the organizer for any content type

**Objective:** refactor the existing organizer logic so it can file any content type, not just PDFs. This is preparation work for the processors — when a processor eventually moves items to `_processing/` and marks them complete, the organizer needs to be able to file them by domain regardless of type.

**Activities:**
- Extract the domain classification logic from `organize_document()` into a reusable library function.
- Write a new shared filing function that takes a hash, reads its concepts, derives canonical name and domain, moves from `_processing/{hash}/` to `library/Domain/Subdomain/`, and updates catalogue + documents + Qdrant atomically.
- The new filing function is written but NOT wired into the service loop yet. It's library code that the existing organizer can start calling, and that future processors can also call.
- Unit-test the new filing function in isolation using a synthetic hash and a synthetic `_processing/` directory.
- The existing `organize_document()` continues to work for the moment. We do not remove it.

**Go/no-go gate:** new filing function passes isolation test, existing organizer still works, no regressions in status command.

**Rollback:** remove the new filing function, restore the original `organizer.py` from backup.

**Estimated effort:** medium. Real code change, but scoped to one file and one function.

---

### Phase 3: Transcript processor — first end-to-end new-pipeline path

**Objective:** build the transcript processor as the first proof of the new architecture. End-to-end: acquisition module writes to `_acquired/stream/`, dispatcher routes to transcript processor, processor handles pre-flight and moves to `_processing/`, shared stages enrich and embed, shared filing function files to `library/Domain/Subdomain/`.

**Why transcripts first and not PDFs:** transcripts are simpler to pre-flight (no PDF metadata extraction required, title comes straight from the meta.json the scraper already builds), and they are the content type with existing architectural debt that the refactor is most motivated to fix. Proving the architecture on the simpler type first reduces risk.

**Activities:**
- Write `lib/acquisition/peertube.py` — a thin acquisition module that wraps the existing PeerTube API logic from `lib/peertube_scraper.py` and drops items into `_acquired/stream/` in a standardized form.
- Write `lib/processors/transcript.py` — the transcript processor. Implements `pre_flight(item_path, db, config)`: hash dedupe, extract metadata from meta.json, level-4 name dedupe, move to `_processing/{hash}/`, update DB, set status to `extracted`.
- Write the dispatcher as a new module `lib/dispatcher.py`. Small: watches configured subfolders, calls pre_flight on stable items.
- Test end-to-end with a single real transcript. Use a test transcript that is known not to be in the system yet (or temporarily remove one from the catalogue for testing).
- Verify: the transcript lands in `_acquired/stream/`, gets picked up by the dispatcher, routed to the transcript processor, pre-flight runs cleanly, item moves to `_processing/`, enrichment runs, embedding runs, filing moves it to `library/Domain/Subdomain/`, catalogue and documents and Qdrant all reflect the final state.

**Go/no-go gate:** a single transcript successfully flows through the entire new pipeline end-to-end. Catalogue, documents, Qdrant, and filesystem all consistent at the end.

**Rollback:** revert the acquisition module, processor, and dispatcher. Restore the test transcript to its original state. The old PeerTube ingestion path still exists and can be reactivated.

**Estimated effort:** large. First real end-to-end implementation work. Expect surprises.

---

### Phase 4: PDF processor — second processor proves modularity

**Objective:** build the PDF processor following the same pattern as the transcript processor. This phase validates that adding a new content type really is as simple as writing one acquisition module and one processor module, and it replaces the PDF-specific parts of the old pipeline.

**Activities:**
- Write `lib/acquisition/manual_pdf.py` or similar — a simple acquisition module for manual PDF uploads that drops into `_acquired/pdf/`. (The dashboard upload endpoint can be rewired to use this.)
- Write `lib/processors/pdf.py` — the PDF processor. Implements `pre_flight`: hash dedupe, pre-enrichment metadata extraction (the cheap path — see open question in decisions.md), level-4 name dedupe, move to `_processing/{hash}/`, extract text via existing `lib/extractor.py` as a library call, update DB, set status to `extracted`.
- Register `pdf: pdf_processor` in the dispatch config.
- Test end-to-end with a single real PDF. Same verification criteria as phase 3.
- Deprecate but do not yet remove the old scanner-based PDF ingestion path.

**Go/no-go gate:** a single PDF successfully flows through the entire new pipeline end-to-end. Both the PDF and transcript processors can run concurrently without interference.

**Rollback:** revert the PDF processor and its acquisition module. The old scanner path is still alive and usable.

**Estimated effort:** large. Same scope as phase 3 but with the added complexity of PDF metadata extraction.

---

### Phase 5: Cutover — new pipeline becomes the only pipeline

**Objective:** switch the service loop from running the old stage loops to running the new dispatcher + shared stages. Retire the old scan_library, old organize_document, and old new_pipeline watchdog. Bring recon.service back online.

**Activities:**
- Rewrite `cmd_service()` in `recon.py` to run: dispatcher thread + existing enrich/embed stage loops + new shared filing loop + dashboard + progress reporter. The old scanner_loop, peertube_scanner_loop, crawler_scheduler_loop, and organizer_loop are removed from the service.
- Migrate the 278 STATE 2 transcripts into `_acquired/stream/` so they get reprocessed through the new pipeline.
- Mark all existing transcripts and PDFs as `organized_at=CURRENT_TIMESTAMP` if they're already in their final library location. This prevents the new shared filing loop from trying to re-file already-filed content.
- Resweep the 18,855 transcripts currently at `library/_sources/streamecho6/` into `library/Domain/Subdomain/` using the existing sweep scaffolding. This is a significant sub-phase but reuses proven tooling.
- Start `recon.service`.
- Monitor logs for a defined observation period (30 minutes minimum). Verify: dispatcher picks up any pending `_acquired/` items, enrichment and embedding run on any queued work, filing moves completed items, no error spam, no unexpected crashes.

**Go/no-go gate:** service runs cleanly for the observation period with no errors. All baseline metrics still match (Qdrant count, catalogue count, etc.) except where deliberately changed (transcripts relocated to Domain tree). Dashboard shows a healthy system.

**Rollback:** stop the service. Restore `recon.py` from backup. Restore the old service unit. Restart with old code. The library state is preserved even if the service code is reverted.

**Estimated effort:** very large. This is the big phase. Includes the transcript resweep as a sub-phase.

---

### Phase 6: Cleanup and documentation

**Objective:** remove dead code, update documentation, close out the refactor.

**Activities:**
- Delete old code paths that are no longer called: `scan_library()`, old `organize_document()`, `new_pipeline.ingest_scan/ingest_acquire/ingest_place`, legacy stage helpers.
- Decide fate of `lib/web_scraper.py` and `lib/crawler.py` (delete or preserve, per open question in decisions.md).
- Write a new `PROJECT-BIBLE.md` reflecting the refactored architecture. Use the documents in this repo as source material.
- Update backup scripts if they reference paths that moved.
- Close out backlog items that the refactor addressed (organizer error-spam, scanner race, crawler re-fire, deferred transcripts).
- Tag a release in the RECON repo marking the refactor completion.
- Mark this refactored-recon repo as complete. Future architecture evolution gets its own design cycle.

**Go/no-go gate:** no orphaned code, no broken references, documentation reflects reality, backlog is updated.

**Rollback:** cleanup phase rollbacks are less interesting — if something was deleted that shouldn't have been, restore from git history.

**Estimated effort:** medium. Mostly cleanup, documentation, and bookkeeping.

---

## Cumulative effort and timeline

The total effort is significant. In rough order of magnitude:

- Phase 0: hours
- Phase 1: hours
- Phase 2: one session
- Phase 3: one to two sessions
- Phase 4: one to two sessions
- Phase 5: one to three sessions (the resweep alone could be a session)
- Phase 6: one session

Total: likely five to ten sessions of focused work, with verification and backup time between phases. This is not a single-session refactor.

There is no time pressure — `recon.service` can stay stopped indefinitely — so the plan optimizes for correctness and verifiability, not speed.

## What could go wrong

Risks worth flagging now:

**Pre-enrichment metadata extraction for PDFs may be unreliable.** The level-4 dedupe check depends on getting title, edition, author, and year reliably from a PDF without spending enrichment dollars. If the cheap extraction produces garbage for a meaningful fraction of PDFs, the dedupe check fails silently and we either produce false positives (quarantining non-duplicates) or miss real duplicates. This is the single biggest design risk and it won't be known until phase 4.

**The transcript resweep in phase 5 could surface issues with the existing sweep tooling.** The PDF sweep worked well for 15,595 entries; the transcript resweep will be comparable in volume. Different failure modes are possible (different file types, different library tree). Risk is manageable because we have the backup-and-gate pattern proven.

**Qdrant payload updates during mass moves are the historical risk.** The scanner race earlier today was an example of the DB and Qdrant getting out of sync during bulk operations. The refactor must handle this cleanly or reproduce the race in a new form. The solution is atomic transitions: either everything moves together or nothing moves.

**Unknown interactions with the dashboard.** The dashboard reads DB status and Qdrant. It should keep working through the refactor, but there may be edge cases where a transitional state looks wrong in the UI. Acceptable if it's transient and resolves when the phase completes.

## What is explicitly out of scope

- Any changes to Qdrant schema (the collection, fields, or indexing)
- Any changes to the enrichment model or prompts
- Any changes to the embedding model or TEI configuration
- Any changes to `lib/api.py` routes or the dashboard UI
- Any work on new content types beyond the two (PDFs and transcripts) that already exist
- Any work on the PeerTube infrastructure itself (CT 110, peertube_prod, etc.)
- The pi-nas 283 GB orphaned NFS cleanup (separate backlog item)
- The 2,775-hash physical duplicate cleanup (separate backlog item)
- Any change to the three-month backup schedule

Stay in scope. If something feels like it should be in scope, flag it as an open question in `decisions.md` and decide separately.