recon

matt/recon

mirror of https://github.com/zvx-echo6/recon.git synced 2026-05-20 14:44:54 +02:00

Author	SHA1	Message	Date
Matt	df29d598d3	Phase 6a: transcripts mark organized in-place, skip filing Transcripts are derived text from PeerTube videos, not primary source files. They do not belong in library/Domain/Subdomain/ like PDFs. Change: transcript_processor.pre_flight() now sets organized_at = CURRENT_TIMESTAMP at the end of successful processing, marking the transcript as organized in place. The watch URL remains in catalogue.path and Qdrant download_url so users clicking search results go to the PeerTube video. The filing workers path LIKE filter naturally excludes transcripts since their documents.path is the watch URL, not a filesystem path. No filing worker changes needed. Back-fills 2,260 drain items from Phase 5c-2 via one-time SQL. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-14 22:49:21 +00:00
Matt	9fa60f9c86	Fix: stale cleanup in processors must fail loudly on permission errors Phase 5c-2 failed because shutil.rmtree(ignore_errors=True) silently failed to clean up root-owned legacy files in processing/{hash}/, letting the processor proceed into a half-cleaned directory and then crash on subsequent file writes. Changes: removed ignore_errors=True, wrapped in try/except that logs and re-raises, so the processor fails early and visibly if stale cleanup fails. Recovery from Phase 5c-2 failure.	2026-04-14 20:15:48 +00:00
Matt	96e1e642c4	Phase 4: PDF processor with layered metadata extraction - Add lib/processors/pdf_processor.py with full pre_flight pipeline - Layered metadata: Source A (PDF dict), Source B (filename), Source C (Gemini) - Field-by-field voting with provenance tracking (metadata_provenance column) - Level-4 strict dedupe (title+author+edition+year) - Content failures route to _review/rejected_pdfs/ - Level-4 duplicates route to _review/duplicate_quarantine/ - Full text extraction using existing extract_text_from_page fallback chain - Schema: added metadata_provenance TEXT to documents table Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-14 16:57:44 +00:00
Matt	9fe6a0a782	Phase 4: Phase 3 cleanup fixes Fix 1.1: filing preserves source file extension instead of defaulting to .pdf Fix 1.2: back-fixed soldering transcript from .pdf to .txt (hash 380dbc78) Fix 1.3: dispatcher logs missing processor modules at DEBUG, not ERROR Fix 1.4: transcript processor cleans stale processing/concepts dirs on entry Also: dispatcher now handles solo content files without .meta.json sidecar Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-14 16:39:57 +00:00
Matt	f69c04a0e3	Phase 3: fix page_count in transcript processor Set page_count on documents row during pre_flight. Without this, enricher comparison `page_count >= 3` fails with TypeError on NULL. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-14 15:43:21 +00:00
Matt	66fadb7487	Phase 3: dispatcher, transcript processor, text_dir resolution - lib/dispatcher.py: one-shot dispatcher that scans acquired/<type>/ for content+sidecar pairs and routes to registered processors - lib/processors/transcript_processor.py: pre_flight() for transcripts (hash, dedupe, split into pages, register in DB, set text_dir) - lib/utils.py: resolve_text_dir() helper for text_dir column fallback - lib/enricher.py: use resolve_text_dir() instead of hardcoded path - lib/embedder.py: use resolve_text_dir() instead of hardcoded path - lib/processors/__init__.py, lib/acquisition/__init__.py: package inits Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-14 15:39:42 +00:00

6 commits