recon/lib
Matt 96e1e642c4 Phase 4: PDF processor with layered metadata extraction
- Add lib/processors/pdf_processor.py with full pre_flight pipeline
- Layered metadata: Source A (PDF dict), Source B (filename), Source C (Gemini)
- Field-by-field voting with provenance tracking (metadata_provenance column)
- Level-4 strict dedupe (title+author+edition+year)
- Content failures route to _review/rejected_pdfs/
- Level-4 duplicates route to _review/duplicate_quarantine/
- Full text extraction using existing extract_text_from_page fallback chain
- Schema: added metadata_provenance TEXT to documents table

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-14 16:57:44 +00:00
..
acquisition Phase 3: dispatcher, transcript processor, text_dir resolution 2026-04-14 15:39:42 +00:00
processors Phase 4: PDF processor with layered metadata extraction 2026-04-14 16:57:44 +00:00
__init__.py Initial commit: RECON codebase baseline 2026-04-14 14:57:23 +00:00
api.py Initial commit: RECON codebase baseline 2026-04-14 14:57:23 +00:00
crawler.py Initial commit: RECON codebase baseline 2026-04-14 14:57:23 +00:00
dispatcher.py Phase 4: Phase 3 cleanup fixes 2026-04-14 16:39:57 +00:00
embedder.py Phase 3: dispatcher, transcript processor, text_dir resolution 2026-04-14 15:39:42 +00:00
enricher.py Phase 3: dispatcher, transcript processor, text_dir resolution 2026-04-14 15:39:42 +00:00
extractor.py Initial commit: RECON codebase baseline 2026-04-14 14:57:23 +00:00
filing.py Phase 4: Phase 3 cleanup fixes 2026-04-14 16:39:57 +00:00
ingester.py Initial commit: RECON codebase baseline 2026-04-14 14:57:23 +00:00
key_manager.py Initial commit: RECON codebase baseline 2026-04-14 14:57:23 +00:00
new_pipeline.py Initial commit: RECON codebase baseline 2026-04-14 14:57:23 +00:00
organizer.py Initial commit: RECON codebase baseline 2026-04-14 14:57:23 +00:00
peertube_collector.py Initial commit: RECON codebase baseline 2026-04-14 14:57:23 +00:00
peertube_scraper.py Initial commit: RECON codebase baseline 2026-04-14 14:57:23 +00:00
status.py Initial commit: RECON codebase baseline 2026-04-14 14:57:23 +00:00
utils.py Phase 3: dispatcher, transcript processor, text_dir resolution 2026-04-14 15:39:42 +00:00
web_scraper.py Initial commit: RECON codebase baseline 2026-04-14 14:57:23 +00:00