recon/lib/processors
Matt 501004ecf1 Filter non-English articles from ZIM ingestion
Skip articles with MediaWiki translation suffixes (/es, /fr, /pl, etc.)
before text extraction to avoid wasting Gemini enrichment on translations.
Uses path-based regex matching against ISO 639 language codes.

~5,276 non-English articles already ingested from Appropedia (top: es=837,
zh=765, ru=475, fr=433, ko=407). Purge decision deferred.
2026-04-17 07:30:30 +00:00
..
__init__.py
pdf_processor.py Fix: Gemini "null" string bug in pdf_processor metadata voting 2026-04-15 23:30:59 +00:00
text_processor.py
transcript_processor.py
zim_processor.py Filter non-English articles from ZIM ingestion 2026-04-17 07:30:30 +00:00