mirror of
https://github.com/zvx-echo6/recon.git
synced 2026-05-20 06:34:40 +02:00
Adds lib/processors/zim_processor.py which opens a ZIM file via python-libzim, iterates HTML articles, strips to clean text (lxml), and feeds each article into the existing RECON enrichment pipeline. Key features: - HTML to text via lxml (strips nav/footer/script/style) - Filters redirects, non-HTML entries, stubs (<200 chars) - Content hash dedup against existing catalogue - Creates processing dirs with page files and meta.json - Registers articles as "extracted" for automatic enrichment - Checkpointing via zim_sources.last_checkpoint for resume - Configurable batch size and delay for rate control - Standalone CLI: python3 -m lib.processors.zim_processor Tested: 100 Appropedia articles processed in 3s, enricher picks them up automatically via the existing pipeline. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|---|---|---|
| .. | ||
| acquisition | ||
| processors | ||
| __init__.py | ||
| api.py | ||
| dispatcher.py | ||
| embedder.py | ||
| enricher.py | ||
| extractor.py | ||
| filing.py | ||
| ingester.py | ||
| key_manager.py | ||
| new_pipeline.py | ||
| organizer.py | ||
| peertube_collector.py | ||
| peertube_scraper.py | ||
| status.py | ||
| utils.py | ||
| web_scraper.py | ||
| zim_monitor.py | ||