recon

matt/recon

mirror of https://github.com/zvx-echo6/recon.git synced 2026-05-20 06:34:40 +02:00

History

Matt c60aa5e80d Phase 2: ZIM processor — batch article ingestion pipeline Adds lib/processors/zim_processor.py which opens a ZIM file via python-libzim, iterates HTML articles, strips to clean text (lxml), and feeds each article into the existing RECON enrichment pipeline. Key features: - HTML to text via lxml (strips nav/footer/script/style) - Filters redirects, non-HTML entries, stubs (<200 chars) - Content hash dedup against existing catalogue - Creates processing dirs with page files and meta.json - Registers articles as "extracted" for automatic enrichment - Checkpointing via zim_sources.last_checkpoint for resume - Configurable batch size and delay for rate control - Standalone CLI: python3 -m lib.processors.zim_processor Tested: 100 Appropedia articles processed in 3s, enricher picks them up automatically via the existing pipeline. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>		2026-04-17 02:03:12 +00:00
..
__init__.py
pdf_processor.py	Fix: Gemini "null" string bug in pdf_processor metadata voting	2026-04-15 23:30:59 +00:00
text_processor.py	Phase 6f: text processor for .txt file ingestion	2026-04-15 22:39:31 +00:00
transcript_processor.py	Phase 6a: transcripts mark organized in-place, skip filing	2026-04-14 22:49:21 +00:00
zim_processor.py	Phase 2: ZIM processor — batch article ingestion pipeline	2026-04-17 02:03:12 +00:00