refactored-recon/decisions.md

197 lines
14 KiB
Markdown
Raw Normal View History

# Decisions
Architectural decisions made during the design of the RECON refactor. Each entry captures a choice, the alternatives considered, the rationale, and the status. New decisions get appended; existing decisions get edited in place when the thinking changes (commit history shows the evolution).
Format loosely follows ADR (Architecture Decision Record) conventions but kept informal.
---
## ADR-001: Hopper-based ingestion with type-subfolder dispatch
**Status:** accepted
**Context:** Current RECON has two overlapping ingestion paths (library-root scanning and `_acquired/` watchdog) that duplicate front-half logic. Adding new content types today requires touching shared code in multiple places. The user wants a modular approach where new content types can be added without refactoring existing ones.
**Decision:** Adopt a hopper model where all acquisition modules drop items into type-specific subfolders of `_acquired/`. A dispatcher watches each subfolder and routes items to type-specific processors. The subfolder determines the type — no filename conventions, no content sniffing.
**Alternatives considered:**
1. **Central rule table mapping filename patterns to processors.** Rejected: requires filename convention enforcement, adds complexity for ambiguous cases, more code than a subfolder listing.
2. **Self-registering processors via module-level attributes.** Rejected: implicit, harder to inspect, overkill for a small number of processors.
3. **Unified single hopper with content sniffing.** Rejected: requires every processor to read every item to decide if it owns it, or requires a central type-identifier that duplicates what a subfolder already provides for free.
**Consequences:**
- Adding a new content type is one new subfolder, one new processor module, one config line.
- Acquisition modules must know their target subfolder at write time (trivial — they already know what they produce).
- Files dropped at `_acquired/` root are orphaned. Accepted as a feature, not a bug: human attention is the recovery mechanism.
---
## ADR-002: `_acquired/` and `_processing/` live under RECON, not under the library
**Status:** accepted
**Context:** The hopper and processing scratch could live either under `/mnt/library/` (where the current `_acquired/` exists) or under `/opt/recon/data/` (co-located with RECON's other state). The user preferred keeping the library clean.
**Decision:** Both `_acquired/` and `_processing/` live under `/opt/recon/data/`. The library contains only finished, filed content.
**Alternatives considered:**
1. **Keep both in `/mnt/library/`.** Rejected: pollutes the library with in-flight content, complicates backups (library backup now needs to exclude `_acquired/` and `_processing/`), and exposes half-processed files to the file server.
2. **`_acquired/` under library, `_processing/` under RECON.** Rejected: splits the pipeline state across two filesystems, hybrid approach with no clear benefit.
**Consequences:**
- Two clean backup targets: library (finished content only) and RECON (all pipeline state including hopper and scratch).
- The library never contains in-flight content. Nothing half-processed is ever visible to anyone browsing.
- Acquisition modules must write over NFS if they run on a host other than CT 130. For modules running on CT 130, writes are local.
- The final move from `_processing/` to library/Domain/Subdomain/ is a cross-filesystem operation (copy + delete). Acceptable for the volumes involved.
---
## ADR-003: Everything files to `library/Domain/Subdomain/` regardless of type
**Status:** accepted
**Context:** Transcripts are currently filed at `library/_sources/streamecho6/channel/title__hash8/`, which is source-oriented rather than domain-oriented. PDFs file by domain. The user could have kept these as two separate conventions or unified them.
**Decision:** All content types file by domain. Transcripts, PDFs, HTML articles, and any future type all end up in `library/Domain/Subdomain/` with a canonical name. The filing logic is shared across all types.
**Alternatives considered:**
1. **Keep transcripts at `_sources/streamecho6/` and PDFs at `Domain/Subdomain/`.** Rejected: creates two different filing conventions and forces the organizer to dispatch by type. User preference was for uniformity.
2. **File by source for everything.** Rejected: fragments PDFs by acquisition source, which is not meaningful — a PDF from Anna's Archive and the same PDF from a manual upload should file to the same domain location.
**Consequences:**
- The existing 18,855 transcripts at `_sources/streamecho6/` need to be re-filed during the migration. A resweep using the PDF sweep scaffolding.
- Human browsability of "all transcripts from channel X" goes away. Replaced by search — the Qdrant `channel_name` field can be filtered on.
- The filing logic is fully shared; no per-type dispatch.
---
## ADR-004: Four-level naming hierarchy with strict level-4 dedupe check
**Status:** accepted
**Context:** Name-based duplicate detection needs a clear rule for what counts as "the same work." Too aggressive (match on title alone) produces false positives; too permissive (match only on byte-identical content) misses non-identical duplicates.
**Decision:** Canonical names follow a four-level hierarchy:
1. `Title`
2. `Title_Author`
3. `Title_Edition_Author`
4. `Title_Edition_Author_Year`
Pre-enrichment duplicate detection uses a strict level-4 match — all four fields must be present and equal to count as a duplicate. Missing fields fail the match and the item proceeds as non-duplicate.
At filing time, the organizer starts at level 1 and escalates only if needed to resolve physical collisions at the target library path.
**Alternatives considered:**
1. **Level-2 check (Title + Author).** Rejected: too aggressive. Different editions of the same book by the same author would be flagged as duplicates and quarantined, creating false-positive review load.
2. **Fuzzy match on missing fields.** Rejected: produces false positives when metadata extraction is incomplete. Strict is safer.
3. **Level-1 filing only (always use `Title` unless collision).** Accepted for filing — the escalation only happens on collision, and most files file at level 1.
**Consequences:**
- Different editions of the same work are kept as separate documents. Each gets its own concepts, its own vectors, its own library entry.
- Concept-level dedupe becomes a possible future cleanup activity — if two editions have concepts that are substantively identical, they can be merged at the vector level, separate from file-level deduplication.
- Pre-enrichment metadata extraction must reliably get title, edition, author, and year. This is hard; it likely requires reading the first few pages and using a cheap LLM call or a structured parser. The cost of that extraction is the price of the dedupe savings.
---
## ADR-005: Duplicate handling — hash match deletes, name match quarantines
**Status:** accepted
**Context:** Duplicate detection has two trigger conditions (byte-identical content and matching canonical name with different hash). The response to each should match the confidence level.
**Decision:**
- **Byte-identical (hash match):** delete the file immediately. No review, no logging beyond debug. Zero risk of data loss because the existing copy is still in the catalogue.
- **Name match (level 4, different hash):** move to a `_duplicates/` quarantine folder and flag for human review. Preserve the file. Do not process.
- **No match:** proceed through the pipeline normally.
**Alternatives considered:**
1. **Quarantine both.** Rejected: hash matches are certain duplicates and don't warrant human attention.
2. **Delete both.** Rejected: level-4 name matches can be legitimate re-ingests (better scans, corrected OCR, different metadata) and should not be silently destroyed.
3. **Process everything and deduplicate at concept level later.** Rejected: wastes enrichment API calls on the specific files we want to avoid wasting money on.
**Consequences:**
- The `_duplicates/` folder needs a human review process. Items sit there indefinitely until someone decides.
- Hash match deletions are silent. There is no audit trail beyond a debug log and the presence of the original in the catalogue.
- The quarantine gives a natural place to inspect near-duplicates without losing data.
---
## ADR-006: Shared enrich/embed infrastructure, processor-specific pre-flight
**Status:** accepted
**Context:** Some pipeline work is type-specific (extracting text from a PDF vs parsing a VTT vs scraping HTML) and some is type-agnostic (calling Gemini on text pages, embedding concepts into Qdrant). The question is where to draw the line.
**Decision:** Each processor owns its type-specific pre-flight: dedupe check, metadata extraction, source-to-text-pages conversion, move to `_processing/`. After that, shared stage loops handle enrichment and embedding. The organizer handles filing. None of those shared stages know or care what type produced the item.
**Alternatives considered:**
1. **Processors own end-to-end flow.** Rejected: duplicates enrichment and embedding logic across every processor, and loses batching benefits for API-rate-limited stages.
2. **Single universal pipeline with no processors.** Rejected: cannot handle type-specific pre-flight (PDF metadata extraction vs VTT parsing are genuinely different operations).
**Consequences:**
- Enrichment and embedding keep their current stage-loop architecture. Batching across items continues to work for throughput and rate-limit management.
- The processor interface is small: one function, `pre_flight(item_path, db, config)`.
- Filing is shared and handles any type uniformly.
- The convergence point is "standardized `page_NNNN.txt` + `meta.json` in `_processing/{hash}/`." From there, everything is uniform.
---
## ADR-007: Refactor, not rebuild
**Status:** accepted
**Context:** The design described in `architecture.md` is substantially different from the current implementation. We considered rebuilding RECON from scratch against the new design vs refactoring the existing codebase incrementally.
**Decision:** Refactor, not rebuild. Preserve all existing core logic (extraction, enrichment, embedding, classification, sanitization) as library functions. Rewire the orchestration layer. Migrate data in place.
**Alternatives considered:**
1. **Rebuild from scratch.** Rejected on three grounds: (a) the user trusts the existing code and wants to preserve it, (b) the "cannot lose data" constraint makes parallel-build-and-cutover risky, (c) downtime is available but not unlimited.
2. **Minimal patch to unbreak current state without architectural change.** Rejected: does not solve the underlying design issues and postpones the inevitable.
**Consequences:**
- The refactor proceeds in phases with each phase independently verifiable and rollback-able.
- Code that works stays. Code that wires things together changes.
- Data (Qdrant, DB, library files) is migrated in place via explicit migration steps, not via a fresh rebuild.
- Each phase has a clear go/no-go gate before moving on.
---
## ADR-008: Transcripts will be re-filed by domain during the migration
**Status:** accepted
**Context:** 18,855 transcripts are currently filed at `library/_sources/streamecho6/channel/title__hash8/`. Under the target architecture, everything files by domain. The existing transcripts need to move to `library/Domain/Subdomain/` with canonical names derived from their titles.
**Decision:** Execute a transcript resweep during the migration, using the same sweep scaffolding built for the earlier PDF sweep. Read domain classifications from existing concept JSONs (no new enrichment needed), derive canonical names from existing meta.json titles, move files, update catalogue + documents + Qdrant atomically.
**Alternatives considered:**
1. **Leave transcripts where they are as a legacy exception.** Rejected: creates a permanent exception to an otherwise-uniform architecture.
2. **Reprocess transcripts through the full new pipeline.** Rejected: wastes Gemini API calls on content that has already been enriched.
**Consequences:**
- One additional migration step in the plan, reusing proven sweep tooling.
- No new enrichment cost — classifications already exist.
- The `_sources/streamecho6/` directory tree goes away after the resweep.
- Transcripts in the library become searchable alongside PDFs by domain rather than by source.
---
## Open questions
These are things we know we need to decide but haven't yet.
- **Pre-enrichment metadata extraction method for PDFs.** What's the cheapest reliable way to get title, edition, author, and year from a PDF before running full enrichment? Options: parse PDF metadata fields (fast, unreliable), parse filename (fast, unreliable for junk filenames), read first-page text with heuristics (medium cost, medium reliability), small LLM call on first-page text (medium cost, high reliability). Needs experimentation.
- **Cleanup of `_processing/` scratch after filing.** Keep or delete the extracted text and concept JSONs after an item is filed? Keep means re-embedding is possible without re-extracting, useful if embedding models change. Delete means scratch doesn't grow unbounded. Open.
- **Fate of `lib/web_scraper.py` and `lib/crawler.py`.** Dead code in the codebase, delete outright, or preserve in case web ingestion returns later? Open.
- **Metrics and observability for the new pipeline.** The current dashboard queries DB status counts. Under the refactor, pipeline state is partly in the DB and partly on disk (`_acquired/` and `_processing/` contents). Should the dashboard be extended to show filesystem state? Open.