6.1 KiB
Phase 4: PDF Processor with Layered Metadata Extraction
Executed: 2026-04-14T16:40Z UTC
Backup
| Item | Location | MD5 Hash |
|---|---|---|
| recon.db (pre-Phase 4) | CT 130: /tmp/recon.db.phase4.20260414.bak |
1d76f8ba0f169f9a77666af56707f71d |
| Test row SQL backup | CT 130: /tmp/recon_phase4_test_93aad72f.sql |
— |
Schema Change
Added metadata_provenance TEXT column to documents table. Stores JSON with voted metadata fields, per-field provenance (which source won), and raw source data from all three extraction sources.
What Was Created
lib/processors/pdf_processor.py — pre_flight()
Handles PDF content from acquired/pdf/. Implements a 17-step pipeline:
- Hash — MD5 of PDF content via
content_hash() - Stale cleanup — removes pre-existing
processing/{hash}/andconcepts/{hash}/directories - Hash dedupe — exact content match against catalogue; removes pair if duplicate
- Size check — rejects PDFs exceeding
processing.max_pdf_size_mb(default 200MB) - Open PDF — PyPDF2
PdfReaderwith pdfinfo fallback for page count - Source A — PDF info dictionary metadata (title, author, edition, year)
- Source B — Filename parsing via
clean_filename_to_title()+ regex patterns - Extract first 3 pages — for Source C input, using existing
extract_text_from_page()fallback chain - Source C — Gemini LLM metadata extraction from first 3 pages (retries 3x with 30s backoff)
- Vote — per-field voting across sources; 2+ agreement wins, else priority C > A > B
- Level-4 dedupe — strict check requiring ALL FOUR fields (title, author, edition, year) present and matching an existing document
- Move to processing — PDF →
processing/{hash}/source.pdf, sidecar →sidecar.meta.json - Full text extraction — all pages via
extract_text_from_page()(PyPDF2 → pdftotext → Tesseract → Gemini Vision) - Write meta.json — extraction stats, voted metadata, provenance record
- Register in DB —
add_to_catalogue()+queue_document() - Update documents row — sets
text_dir,page_count,book_title,book_author,metadata_provenance - Status = extracted — advances to next pipeline stage
Failure Modes
| Type | Behavior |
|---|---|
| Hash duplicate | Removes pair from acquired/, returns action='duplicate' |
| Content failure (unreadable PDF) | Moves to /mnt/library/_review/rejected_pdfs/, returns action='content_failure' |
| Level-4 duplicate | Moves to /mnt/library/_review/duplicate_quarantine/, queues for human review, returns action='level4_duplicate' |
| Gemini API transient | Retries 3x with 30s backoff; continues without Source C if exhausted |
| Oversized PDF | Moves to rejected_pdfs, returns action='content_failure' |
Metadata Voting Example
From the end-to-end test (93aad72f — hydro-electric installation):
| Field | Source A (PDF dict) | Source B (Filename) | Source C (Gemini) | Winner |
|---|---|---|---|---|
| Title | ew-FinishHydro70g.PDF |
Finalizing A Hydro Electric Installation Hackleman |
Finalizing a hydro-electric installation |
gemini |
| Author | Dave |
— | Michael Hackleman |
gemini |
| Edition | — | — | — | null |
| Year | 2001 |
— | 2001 |
agreed(pdf_dict,gemini) |
Phase 3 Cleanup Fixes (also committed in this phase)
Fix 1.1: Extension preservation in filing.py
_build_target_path() calls sanitize_filename() which defaults to .pdf. For transcripts (.txt files), this caused incorrect extensions. Fix: after _build_target_path(), replace the target extension with the source file's actual extension.
Fix 1.2: Back-fix soldering transcript
One-off script renamed the filed soldering transcript from .pdf to .txt in filesystem, catalogue, documents, and Qdrant (5 points).
Fix 1.3: Dispatcher log noise
_load_processor() now catches ModuleNotFoundError at DEBUG level (not ERROR). Only actual ImportError from broken modules logs as ERROR.
Fix 1.4: Stale state cleanup in transcript processor
pre_flight() now removes pre-existing processing/{hash}/ and concepts/{hash}/ directories before processing, preventing stale concept JSONs from interfering with re-enrichment.
Fix 1.5: Solo content files in dispatcher
_find_pairs() now has a second pass that picks up content files without a .meta.json sidecar, passing meta_path=None to the processor.
Directories Created
| Path | Purpose |
|---|---|
/mnt/library/_review/rejected_pdfs/ |
Unreadable PDFs (0 pages, corrupt) |
/mnt/library/_review/duplicate_quarantine/ |
Level-4 metadata-duplicate PDFs for human review |
/opt/recon/data/acquired/pdf/ |
Intake directory for PDF dispatcher |
End-to-End Test
Test document: 93aad72f49207f72af77b90aa7e62016 — "Finalizing a hydro-electric installation" by Michael Hackleman (12 pages, 468KB)
Pipeline Execution
| Stage | Result |
|---|---|
| Dispatch + pre_flight | action='extracted', 12/12 pages, metadata voted |
| Enrich | 26 concepts from 3 windows |
| Embed | 26 vectors inserted into Qdrant |
| File | Filed to /mnt/library/Power-Systems/Hydroelectric-Systems/, 35 Qdrant points updated |
Comparison to Baseline
| Metric | Baseline | Phase 4 |
|---|---|---|
| Status | complete | complete |
| Pages extracted | 12 | 12 |
| Concepts | 20 | 26 |
| Vectors | 20 | 26 |
| Title | Finalizing a hydro-electric installation | Finalizing a hydro-electric installation |
| Author | Michael Hackleman | Michael Hackleman |
| DB totals | 29812 | 29812 |
Concept count difference (20 → 26) is expected — enrichment is non-deterministic. Domain classification changed from "Off-grid Systems" to "Power Systems" due to fresh concept extraction.
Commits
| Hash | Message |
|---|---|
9fe6a0a |
Phase 4: Phase 3 cleanup fixes |
96e1e64 |
Phase 4: PDF processor with layered metadata extraction |
Branch: refactor on forge.echo6.co/matt/recon