Phase 5c-2 failed because shutil.rmtree(ignore_errors=True) silently
failed to clean up root-owned legacy files in processing/{hash}/,
letting the processor proceed into a half-cleaned directory and then
crash on subsequent file writes.
Changes: removed ignore_errors=True, wrapped in try/except that logs
and re-raises, so the processor fails early and visibly if stale
cleanup fails.
Recovery from Phase 5c-2 failure.
- Add lib/processors/pdf_processor.py with full pre_flight pipeline
- Layered metadata: Source A (PDF dict), Source B (filename), Source C (Gemini)
- Field-by-field voting with provenance tracking (metadata_provenance column)
- Level-4 strict dedupe (title+author+edition+year)
- Content failures route to _review/rejected_pdfs/
- Level-4 duplicates route to _review/duplicate_quarantine/
- Full text extraction using existing extract_text_from_page fallback chain
- Schema: added metadata_provenance TEXT to documents table
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fix 1.1: filing preserves source file extension instead of defaulting to .pdf
Fix 1.2: back-fixed soldering transcript from .pdf to .txt (hash 380dbc78)
Fix 1.3: dispatcher logs missing processor modules at DEBUG, not ERROR
Fix 1.4: transcript processor cleans stale processing/concepts dirs on entry
Also: dispatcher now handles solo content files without .meta.json sidecar
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Set page_count on documents row during pre_flight. Without this,
enricher comparison `page_count >= 3` fails with TypeError on NULL.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- lib/dispatcher.py: one-shot dispatcher that scans acquired/<type>/
for content+sidecar pairs and routes to registered processors
- lib/processors/transcript_processor.py: pre_flight() for transcripts
(hash, dedupe, split into pages, register in DB, set text_dir)
- lib/utils.py: resolve_text_dir() helper for text_dir column fallback
- lib/enricher.py: use resolve_text_dir() instead of hardcoded path
- lib/embedder.py: use resolve_text_dir() instead of hardcoded path
- lib/processors/__init__.py, lib/acquisition/__init__.py: package inits
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>