mirror of
https://github.com/zvx-echo6/refactored-recon.git
synced 2026-05-20 06:34:34 +02:00
Phase 4: PDF processor with layered metadata extraction
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
parent
0747cb761f
commit
1d9727f26f
1 changed files with 139 additions and 0 deletions
139
phases/phase-4-pdf-processor.md
Normal file
139
phases/phase-4-pdf-processor.md
Normal file
|
|
@ -0,0 +1,139 @@
|
||||||
|
# Phase 4: PDF Processor with Layered Metadata Extraction
|
||||||
|
|
||||||
|
**Executed:** 2026-04-14T16:40Z UTC
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Backup
|
||||||
|
|
||||||
|
| Item | Location | MD5 Hash |
|
||||||
|
|------|----------|----------|
|
||||||
|
| recon.db (pre-Phase 4) | CT 130: `/tmp/recon.db.phase4.20260414.bak` | `1d76f8ba0f169f9a77666af56707f71d` |
|
||||||
|
| Test row SQL backup | CT 130: `/tmp/recon_phase4_test_93aad72f.sql` | — |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Schema Change
|
||||||
|
|
||||||
|
Added `metadata_provenance TEXT` column to `documents` table. Stores JSON with voted metadata fields, per-field provenance (which source won), and raw source data from all three extraction sources.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What Was Created
|
||||||
|
|
||||||
|
### `lib/processors/pdf_processor.py` — `pre_flight()`
|
||||||
|
|
||||||
|
Handles PDF content from `acquired/pdf/`. Implements a 17-step pipeline:
|
||||||
|
|
||||||
|
1. **Hash** — MD5 of PDF content via `content_hash()`
|
||||||
|
2. **Stale cleanup** — removes pre-existing `processing/{hash}/` and `concepts/{hash}/` directories
|
||||||
|
3. **Hash dedupe** — exact content match against catalogue; removes pair if duplicate
|
||||||
|
4. **Size check** — rejects PDFs exceeding `processing.max_pdf_size_mb` (default 200MB)
|
||||||
|
5. **Open PDF** — PyPDF2 `PdfReader` with pdfinfo fallback for page count
|
||||||
|
6. **Source A** — PDF info dictionary metadata (title, author, edition, year)
|
||||||
|
7. **Source B** — Filename parsing via `clean_filename_to_title()` + regex patterns
|
||||||
|
8. **Extract first 3 pages** — for Source C input, using existing `extract_text_from_page()` fallback chain
|
||||||
|
9. **Source C** — Gemini LLM metadata extraction from first 3 pages (retries 3x with 30s backoff)
|
||||||
|
10. **Vote** — per-field voting across sources; 2+ agreement wins, else priority C > A > B
|
||||||
|
11. **Level-4 dedupe** — strict check requiring ALL FOUR fields (title, author, edition, year) present and matching an existing document
|
||||||
|
12. **Move to processing** — PDF → `processing/{hash}/source.pdf`, sidecar → `sidecar.meta.json`
|
||||||
|
13. **Full text extraction** — all pages via `extract_text_from_page()` (PyPDF2 → pdftotext → Tesseract → Gemini Vision)
|
||||||
|
14. **Write meta.json** — extraction stats, voted metadata, provenance record
|
||||||
|
15. **Register in DB** — `add_to_catalogue()` + `queue_document()`
|
||||||
|
16. **Update documents row** — sets `text_dir`, `page_count`, `book_title`, `book_author`, `metadata_provenance`
|
||||||
|
17. **Status = extracted** — advances to next pipeline stage
|
||||||
|
|
||||||
|
### Failure Modes
|
||||||
|
|
||||||
|
| Type | Behavior |
|
||||||
|
|------|----------|
|
||||||
|
| **Hash duplicate** | Removes pair from acquired/, returns `action='duplicate'` |
|
||||||
|
| **Content failure** (unreadable PDF) | Moves to `/mnt/library/_review/rejected_pdfs/`, returns `action='content_failure'` |
|
||||||
|
| **Level-4 duplicate** | Moves to `/mnt/library/_review/duplicate_quarantine/`, queues for human review, returns `action='level4_duplicate'` |
|
||||||
|
| **Gemini API transient** | Retries 3x with 30s backoff; continues without Source C if exhausted |
|
||||||
|
| **Oversized PDF** | Moves to rejected_pdfs, returns `action='content_failure'` |
|
||||||
|
|
||||||
|
### Metadata Voting Example
|
||||||
|
|
||||||
|
From the end-to-end test (`93aad72f` — hydro-electric installation):
|
||||||
|
|
||||||
|
| Field | Source A (PDF dict) | Source B (Filename) | Source C (Gemini) | Winner |
|
||||||
|
|-------|---------------------|---------------------|-------------------|--------|
|
||||||
|
| Title | `ew-FinishHydro70g.PDF` | `Finalizing A Hydro Electric Installation Hackleman` | `Finalizing a hydro-electric installation` | gemini |
|
||||||
|
| Author | `Dave` | — | `Michael Hackleman` | gemini |
|
||||||
|
| Edition | — | — | — | null |
|
||||||
|
| Year | `2001` | — | `2001` | agreed(pdf_dict,gemini) |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 3 Cleanup Fixes (also committed in this phase)
|
||||||
|
|
||||||
|
### Fix 1.1: Extension preservation in `filing.py`
|
||||||
|
|
||||||
|
`_build_target_path()` calls `sanitize_filename()` which defaults to `.pdf`. For transcripts (`.txt` files), this caused incorrect extensions. Fix: after `_build_target_path()`, replace the target extension with the source file's actual extension.
|
||||||
|
|
||||||
|
### Fix 1.2: Back-fix soldering transcript
|
||||||
|
|
||||||
|
One-off script renamed the filed soldering transcript from `.pdf` to `.txt` in filesystem, catalogue, documents, and Qdrant (5 points).
|
||||||
|
|
||||||
|
### Fix 1.3: Dispatcher log noise
|
||||||
|
|
||||||
|
`_load_processor()` now catches `ModuleNotFoundError` at DEBUG level (not ERROR). Only actual `ImportError` from broken modules logs as ERROR.
|
||||||
|
|
||||||
|
### Fix 1.4: Stale state cleanup in transcript processor
|
||||||
|
|
||||||
|
`pre_flight()` now removes pre-existing `processing/{hash}/` and `concepts/{hash}/` directories before processing, preventing stale concept JSONs from interfering with re-enrichment.
|
||||||
|
|
||||||
|
### Fix 1.5: Solo content files in dispatcher
|
||||||
|
|
||||||
|
`_find_pairs()` now has a second pass that picks up content files without a `.meta.json` sidecar, passing `meta_path=None` to the processor.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Directories Created
|
||||||
|
|
||||||
|
| Path | Purpose |
|
||||||
|
|------|---------|
|
||||||
|
| `/mnt/library/_review/rejected_pdfs/` | Unreadable PDFs (0 pages, corrupt) |
|
||||||
|
| `/mnt/library/_review/duplicate_quarantine/` | Level-4 metadata-duplicate PDFs for human review |
|
||||||
|
| `/opt/recon/data/acquired/pdf/` | Intake directory for PDF dispatcher |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## End-to-End Test
|
||||||
|
|
||||||
|
**Test document:** `93aad72f49207f72af77b90aa7e62016` — "Finalizing a hydro-electric installation" by Michael Hackleman (12 pages, 468KB)
|
||||||
|
|
||||||
|
### Pipeline Execution
|
||||||
|
|
||||||
|
| Stage | Result |
|
||||||
|
|-------|--------|
|
||||||
|
| Dispatch + pre_flight | `action='extracted'`, 12/12 pages, metadata voted |
|
||||||
|
| Enrich | 26 concepts from 3 windows |
|
||||||
|
| Embed | 26 vectors inserted into Qdrant |
|
||||||
|
| File | Filed to `/mnt/library/Power-Systems/Hydroelectric-Systems/`, 35 Qdrant points updated |
|
||||||
|
|
||||||
|
### Comparison to Baseline
|
||||||
|
|
||||||
|
| Metric | Baseline | Phase 4 |
|
||||||
|
|--------|----------|---------|
|
||||||
|
| Status | complete | complete |
|
||||||
|
| Pages extracted | 12 | 12 |
|
||||||
|
| Concepts | 20 | 26 |
|
||||||
|
| Vectors | 20 | 26 |
|
||||||
|
| Title | Finalizing a hydro-electric installation | Finalizing a hydro-electric installation |
|
||||||
|
| Author | Michael Hackleman | Michael Hackleman |
|
||||||
|
| DB totals | 29812 | 29812 |
|
||||||
|
|
||||||
|
Concept count difference (20 → 26) is expected — enrichment is non-deterministic. Domain classification changed from "Off-grid Systems" to "Power Systems" due to fresh concept extraction.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Commits
|
||||||
|
|
||||||
|
| Hash | Message |
|
||||||
|
|------|---------|
|
||||||
|
| `9fe6a0a` | Phase 4: Phase 3 cleanup fixes |
|
||||||
|
| `96e1e64` | Phase 4: PDF processor with layered metadata extraction |
|
||||||
|
|
||||||
|
Branch: `refactor` on `forge.echo6.co/matt/recon`
|
||||||
Loading…
Add table
Add a link
Reference in a new issue