recon/lib/processors
Matt 501004ecf1 Filter non-English articles from ZIM ingestion
Skip articles with MediaWiki translation suffixes (/es, /fr, /pl, etc.)
before text extraction to avoid wasting Gemini enrichment on translations.
Uses path-based regex matching against ISO 639 language codes.

~5,276 non-English articles already ingested from Appropedia (top: es=837,
zh=765, ru=475, fr=433, ko=407). Purge decision deferred.
2026-04-17 07:30:30 +00:00
..
__init__.py
pdf_processor.py Fix: Gemini "null" string bug in pdf_processor metadata voting 2026-04-15 23:30:59 +00:00
text_processor.py Phase 6f: text processor for .txt file ingestion 2026-04-15 22:39:31 +00:00
transcript_processor.py Phase 6a: transcripts mark organized in-place, skip filing 2026-04-14 22:49:21 +00:00
zim_processor.py Filter non-English articles from ZIM ingestion 2026-04-17 07:30:30 +00:00