mirror of
https://github.com/zvx-echo6/recon.git
synced 2026-05-20 14:44:54 +02:00
Skip articles with MediaWiki translation suffixes (/es, /fr, /pl, etc.) before text extraction to avoid wasting Gemini enrichment on translations. Uses path-based regex matching against ISO 639 language codes. ~5,276 non-English articles already ingested from Appropedia (top: es=837, zh=765, ru=475, fr=433, ko=407). Purge decision deferred. |
||
|---|---|---|
| .. | ||
| __init__.py | ||
| pdf_processor.py | ||
| text_processor.py | ||
| transcript_processor.py | ||
| zim_processor.py | ||