refactored-recon/phases/phase-4-pdf-processor.md
Ubuntu 1d9727f26f Phase 4: PDF processor with layered metadata extraction
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-14 16:59:59 +00:00

6.1 KiB

Phase 4: PDF Processor with Layered Metadata Extraction

Executed: 2026-04-14T16:40Z UTC


Backup

Item Location MD5 Hash
recon.db (pre-Phase 4) CT 130: /tmp/recon.db.phase4.20260414.bak 1d76f8ba0f169f9a77666af56707f71d
Test row SQL backup CT 130: /tmp/recon_phase4_test_93aad72f.sql

Schema Change

Added metadata_provenance TEXT column to documents table. Stores JSON with voted metadata fields, per-field provenance (which source won), and raw source data from all three extraction sources.


What Was Created

lib/processors/pdf_processor.pypre_flight()

Handles PDF content from acquired/pdf/. Implements a 17-step pipeline:

  1. Hash — MD5 of PDF content via content_hash()
  2. Stale cleanup — removes pre-existing processing/{hash}/ and concepts/{hash}/ directories
  3. Hash dedupe — exact content match against catalogue; removes pair if duplicate
  4. Size check — rejects PDFs exceeding processing.max_pdf_size_mb (default 200MB)
  5. Open PDF — PyPDF2 PdfReader with pdfinfo fallback for page count
  6. Source A — PDF info dictionary metadata (title, author, edition, year)
  7. Source B — Filename parsing via clean_filename_to_title() + regex patterns
  8. Extract first 3 pages — for Source C input, using existing extract_text_from_page() fallback chain
  9. Source C — Gemini LLM metadata extraction from first 3 pages (retries 3x with 30s backoff)
  10. Vote — per-field voting across sources; 2+ agreement wins, else priority C > A > B
  11. Level-4 dedupe — strict check requiring ALL FOUR fields (title, author, edition, year) present and matching an existing document
  12. Move to processing — PDF → processing/{hash}/source.pdf, sidecar → sidecar.meta.json
  13. Full text extraction — all pages via extract_text_from_page() (PyPDF2 → pdftotext → Tesseract → Gemini Vision)
  14. Write meta.json — extraction stats, voted metadata, provenance record
  15. Register in DBadd_to_catalogue() + queue_document()
  16. Update documents row — sets text_dir, page_count, book_title, book_author, metadata_provenance
  17. Status = extracted — advances to next pipeline stage

Failure Modes

Type Behavior
Hash duplicate Removes pair from acquired/, returns action='duplicate'
Content failure (unreadable PDF) Moves to /mnt/library/_review/rejected_pdfs/, returns action='content_failure'
Level-4 duplicate Moves to /mnt/library/_review/duplicate_quarantine/, queues for human review, returns action='level4_duplicate'
Gemini API transient Retries 3x with 30s backoff; continues without Source C if exhausted
Oversized PDF Moves to rejected_pdfs, returns action='content_failure'

Metadata Voting Example

From the end-to-end test (93aad72f — hydro-electric installation):

Field Source A (PDF dict) Source B (Filename) Source C (Gemini) Winner
Title ew-FinishHydro70g.PDF Finalizing A Hydro Electric Installation Hackleman Finalizing a hydro-electric installation gemini
Author Dave Michael Hackleman gemini
Edition null
Year 2001 2001 agreed(pdf_dict,gemini)

Phase 3 Cleanup Fixes (also committed in this phase)

Fix 1.1: Extension preservation in filing.py

_build_target_path() calls sanitize_filename() which defaults to .pdf. For transcripts (.txt files), this caused incorrect extensions. Fix: after _build_target_path(), replace the target extension with the source file's actual extension.

Fix 1.2: Back-fix soldering transcript

One-off script renamed the filed soldering transcript from .pdf to .txt in filesystem, catalogue, documents, and Qdrant (5 points).

Fix 1.3: Dispatcher log noise

_load_processor() now catches ModuleNotFoundError at DEBUG level (not ERROR). Only actual ImportError from broken modules logs as ERROR.

Fix 1.4: Stale state cleanup in transcript processor

pre_flight() now removes pre-existing processing/{hash}/ and concepts/{hash}/ directories before processing, preventing stale concept JSONs from interfering with re-enrichment.

Fix 1.5: Solo content files in dispatcher

_find_pairs() now has a second pass that picks up content files without a .meta.json sidecar, passing meta_path=None to the processor.


Directories Created

Path Purpose
/mnt/library/_review/rejected_pdfs/ Unreadable PDFs (0 pages, corrupt)
/mnt/library/_review/duplicate_quarantine/ Level-4 metadata-duplicate PDFs for human review
/opt/recon/data/acquired/pdf/ Intake directory for PDF dispatcher

End-to-End Test

Test document: 93aad72f49207f72af77b90aa7e62016 — "Finalizing a hydro-electric installation" by Michael Hackleman (12 pages, 468KB)

Pipeline Execution

Stage Result
Dispatch + pre_flight action='extracted', 12/12 pages, metadata voted
Enrich 26 concepts from 3 windows
Embed 26 vectors inserted into Qdrant
File Filed to /mnt/library/Power-Systems/Hydroelectric-Systems/, 35 Qdrant points updated

Comparison to Baseline

Metric Baseline Phase 4
Status complete complete
Pages extracted 12 12
Concepts 20 26
Vectors 20 26
Title Finalizing a hydro-electric installation Finalizing a hydro-electric installation
Author Michael Hackleman Michael Hackleman
DB totals 29812 29812

Concept count difference (20 → 26) is expected — enrichment is non-deterministic. Domain classification changed from "Off-grid Systems" to "Power Systems" due to fresh concept extraction.


Commits

Hash Message
9fe6a0a Phase 4: Phase 3 cleanup fixes
96e1e64 Phase 4: PDF processor with layered metadata extraction

Branch: refactor on forge.echo6.co/matt/recon