recon

matt/recon

mirror of https://github.com/zvx-echo6/recon.git synced 2026-05-20 14:44:54 +02:00

Author	SHA1	Message	Date
Matt	d9aed35fd7	Phase 5c-1: dispatcher loop, filing worker loop, service rewire Adds dispatch_loop() alongside dispatch_once() for service-thread use. Adds filing_worker_loop() that watches for status=complete items in /opt/recon/data/processing/ and files them to library/Domain/Subdomain/. Rewires cmd_service() to run the new architecture: - Removed: scanner_loop, peertube_scanner_loop, crawler_scheduler_loop, organizer_loop (all replaced by dispatcher + new filing worker) - Kept: enrich and embed stage workers, progress, dashboard - Kept (vestigial): extract stage worker — will be removed in Phase 6 cleanup - Added: dispatcher loop thread, filing worker thread Phase 5c-1 of the refactor. Service not yet started — Phase 5c-2 will do that. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-14 18:30:58 +00:00
Matt	96e1e642c4	Phase 4: PDF processor with layered metadata extraction - Add lib/processors/pdf_processor.py with full pre_flight pipeline - Layered metadata: Source A (PDF dict), Source B (filename), Source C (Gemini) - Field-by-field voting with provenance tracking (metadata_provenance column) - Level-4 strict dedupe (title+author+edition+year) - Content failures route to _review/rejected_pdfs/ - Level-4 duplicates route to _review/duplicate_quarantine/ - Full text extraction using existing extract_text_from_page fallback chain - Schema: added metadata_provenance TEXT to documents table Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-14 16:57:44 +00:00
Matt	9fe6a0a782	Phase 4: Phase 3 cleanup fixes Fix 1.1: filing preserves source file extension instead of defaulting to .pdf Fix 1.2: back-fixed soldering transcript from .pdf to .txt (hash 380dbc78) Fix 1.3: dispatcher logs missing processor modules at DEBUG, not ERROR Fix 1.4: transcript processor cleans stale processing/concepts dirs on entry Also: dispatcher now handles solo content files without .meta.json sidecar Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-14 16:39:57 +00:00
Matt	f69c04a0e3	Phase 3: fix page_count in transcript processor Set page_count on documents row during pre_flight. Without this, enricher comparison `page_count >= 3` fails with TypeError on NULL. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-14 15:43:21 +00:00
Matt	66fadb7487	Phase 3: dispatcher, transcript processor, text_dir resolution - lib/dispatcher.py: one-shot dispatcher that scans acquired/<type>/ for content+sidecar pairs and routes to registered processors - lib/processors/transcript_processor.py: pre_flight() for transcripts (hash, dedupe, split into pages, register in DB, set text_dir) - lib/utils.py: resolve_text_dir() helper for text_dir column fallback - lib/enricher.py: use resolve_text_dir() instead of hardcoded path - lib/embedder.py: use resolve_text_dir() instead of hardcoded path - lib/processors/__init__.py, lib/acquisition/__init__.py: package inits Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-14 15:39:42 +00:00
Matt	de2c59a501	Phase 2: add shared filing function (lib/filing.py) New reusable file_processed_item() that future processors will call to file completed items from /opt/recon/data/processing/{hash}/ into the library. Reuses existing organizer logic for domain classification and collision handling. Not yet wired into the service loop — exists as library code for Phase 3+ to call. Phase 2 of the refactor. See https://forge.echo6.co/matt/refactored-recon Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-14 15:03:36 +00:00
Matt	563c16bb71	Initial commit: RECON codebase baseline Current state of the pipeline code as of 2026-04-14 (Phase 1 scaffolding complete). Config has new_pipeline.enabled=false and crawler.sites=[] per refactor plan. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-14 14:57:23 +00:00

7 commits