Set page_count on documents row during pre_flight. Without this,
enricher comparison `page_count >= 3` fails with TypeError on NULL.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- lib/dispatcher.py: one-shot dispatcher that scans acquired/<type>/
for content+sidecar pairs and routes to registered processors
- lib/processors/transcript_processor.py: pre_flight() for transcripts
(hash, dedupe, split into pages, register in DB, set text_dir)
- lib/utils.py: resolve_text_dir() helper for text_dir column fallback
- lib/enricher.py: use resolve_text_dir() instead of hardcoded path
- lib/embedder.py: use resolve_text_dir() instead of hardcoded path
- lib/processors/__init__.py, lib/acquisition/__init__.py: package inits
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New reusable file_processed_item() that future processors will call to file
completed items from /opt/recon/data/processing/{hash}/ into the library.
Reuses existing organizer logic for domain classification and collision handling.
Not yet wired into the service loop — exists as library code for Phase 3+ to call.
Phase 2 of the refactor. See https://forge.echo6.co/matt/refactored-recon
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Current state of the pipeline code as of 2026-04-14 (Phase 1 scaffolding complete).
Config has new_pipeline.enabled=false and crawler.sites=[] per refactor plan.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>