mirror of https://github.com/zvx-echo6/refactored-recon.git synced 2026-05-20 14:44:39 +02:00

Ubuntu 1d9727f26f Phase 4: PDF processor with layered metadata extraction

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-04-14 16:59:59 +00:00

6.1 KiB

Raw Blame History

Phase 4: PDF Processor with Layered Metadata Extraction

Executed: 2026-04-14T16:40Z UTC

Backup

Item	Location	MD5 Hash
recon.db (pre-Phase 4)	CT 130: `/tmp/recon.db.phase4.20260414.bak`	`1d76f8ba0f169f9a77666af56707f71d`
Test row SQL backup	CT 130: `/tmp/recon_phase4_test_93aad72f.sql`	—

Schema Change

Added metadata_provenance TEXT column to documents table. Stores JSON with voted metadata fields, per-field provenance (which source won), and raw source data from all three extraction sources.

What Was Created

`lib/processors/pdf_processor.py` — `pre_flight()`

Handles PDF content from acquired/pdf/. Implements a 17-step pipeline:

Hash — MD5 of PDF content via content_hash()
Stale cleanup — removes pre-existing processing/{hash}/ and concepts/{hash}/ directories
Hash dedupe — exact content match against catalogue; removes pair if duplicate
Size check — rejects PDFs exceeding processing.max_pdf_size_mb (default 200MB)
Open PDF — PyPDF2 PdfReader with pdfinfo fallback for page count
Source A — PDF info dictionary metadata (title, author, edition, year)
Source B — Filename parsing via clean_filename_to_title() + regex patterns
Extract first 3 pages — for Source C input, using existing extract_text_from_page() fallback chain
Source C — Gemini LLM metadata extraction from first 3 pages (retries 3x with 30s backoff)
Vote — per-field voting across sources; 2+ agreement wins, else priority C > A > B
Level-4 dedupe — strict check requiring ALL FOUR fields (title, author, edition, year) present and matching an existing document
Move to processing — PDF → processing/{hash}/source.pdf, sidecar → sidecar.meta.json
Full text extraction — all pages via extract_text_from_page() (PyPDF2 → pdftotext → Tesseract → Gemini Vision)
Write meta.json — extraction stats, voted metadata, provenance record
Register in DB — add_to_catalogue() + queue_document()
Update documents row — sets text_dir, page_count, book_title, book_author, metadata_provenance
Status = extracted — advances to next pipeline stage

Failure Modes

Type	Behavior
Hash duplicate	Removes pair from acquired/, returns `action='duplicate'`
Content failure (unreadable PDF)	Moves to `/mnt/library/_review/rejected_pdfs/`, returns `action='content_failure'`
Level-4 duplicate	Moves to `/mnt/library/_review/duplicate_quarantine/`, queues for human review, returns `action='level4_duplicate'`
Gemini API transient	Retries 3x with 30s backoff; continues without Source C if exhausted
Oversized PDF	Moves to rejected_pdfs, returns `action='content_failure'`

Metadata Voting Example

From the end-to-end test (93aad72f — hydro-electric installation):

Field	Source A (PDF dict)	Source B (Filename)	Source C (Gemini)	Winner
Title	`ew-FinishHydro70g.PDF`	`Finalizing A Hydro Electric Installation Hackleman`	`Finalizing a hydro-electric installation`	gemini
Author	`Dave`	—	`Michael Hackleman`	gemini
Edition	—	—	—	null
Year	`2001`	—	`2001`	agreed(pdf_dict,gemini)

Phase 3 Cleanup Fixes (also committed in this phase)

Fix 1.1: Extension preservation in `filing.py`

_build_target_path() calls sanitize_filename() which defaults to .pdf. For transcripts (.txt files), this caused incorrect extensions. Fix: after _build_target_path(), replace the target extension with the source file's actual extension.

Fix 1.2: Back-fix soldering transcript

One-off script renamed the filed soldering transcript from .pdf to .txt in filesystem, catalogue, documents, and Qdrant (5 points).

Fix 1.3: Dispatcher log noise

_load_processor() now catches ModuleNotFoundError at DEBUG level (not ERROR). Only actual ImportError from broken modules logs as ERROR.

Fix 1.4: Stale state cleanup in transcript processor

pre_flight() now removes pre-existing processing/{hash}/ and concepts/{hash}/ directories before processing, preventing stale concept JSONs from interfering with re-enrichment.

Fix 1.5: Solo content files in dispatcher

_find_pairs() now has a second pass that picks up content files without a .meta.json sidecar, passing meta_path=None to the processor.

Directories Created

Path	Purpose
`/mnt/library/_review/rejected_pdfs/`	Unreadable PDFs (0 pages, corrupt)
`/mnt/library/_review/duplicate_quarantine/`	Level-4 metadata-duplicate PDFs for human review
`/opt/recon/data/acquired/pdf/`	Intake directory for PDF dispatcher

End-to-End Test

Test document: 93aad72f49207f72af77b90aa7e62016 — "Finalizing a hydro-electric installation" by Michael Hackleman (12 pages, 468KB)

Pipeline Execution

Stage	Result
Dispatch + pre_flight	`action='extracted'`, 12/12 pages, metadata voted
Enrich	26 concepts from 3 windows
Embed	26 vectors inserted into Qdrant
File	Filed to `/mnt/library/Power-Systems/Hydroelectric-Systems/`, 35 Qdrant points updated

Comparison to Baseline

Metric	Baseline	Phase 4
Status	complete	complete
Pages extracted	12	12
Concepts	20	26
Vectors	20	26
Title	Finalizing a hydro-electric installation	Finalizing a hydro-electric installation
Author	Michael Hackleman	Michael Hackleman
DB totals	29812	29812

Concept count difference (20 → 26) is expected — enrichment is non-deterministic. Domain classification changed from "Off-grid Systems" to "Power Systems" due to fresh concept extraction.

Commits

Hash	Message
`9fe6a0a`	Phase 4: Phase 3 cleanup fixes
`96e1e64`	Phase 4: PDF processor with layered metadata extraction

Branch: refactor on forge.echo6.co/matt/recon

6.1 KiB Raw Blame History