mirror of
https://github.com/zvx-echo6/recon.git
synced 2026-05-20 06:34:40 +02:00
- Add lib/processors/pdf_processor.py with full pre_flight pipeline - Layered metadata: Source A (PDF dict), Source B (filename), Source C (Gemini) - Field-by-field voting with provenance tracking (metadata_provenance column) - Level-4 strict dedupe (title+author+edition+year) - Content failures route to _review/rejected_pdfs/ - Level-4 duplicates route to _review/duplicate_quarantine/ - Full text extraction using existing extract_text_from_page fallback chain - Schema: added metadata_provenance TEXT to documents table Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|---|---|---|
| .. | ||
| __init__.py | ||
| pdf_processor.py | ||
| transcript_processor.py | ||