Wave 3 pipeline for processing 253K+ place types with NO wiki/wikidata
tags (US+CA only). Uses Gemini to resolve Wikipedia titles.
Key feature: resolve_wikipedia_titles() now uses ThreadPoolExecutor
with 5 parallel workers, improving throughput from ~14/min to ~75/min.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Processes places with wikidata but no wikipedia tag:
- Batch resolve Q-IDs via Wikidata API (50/request)
- Validate resolved titles against local ZIM
- Generate summaries with Gemini API (3-4 sentences)
- Circuit breaker: 50 consecutive 429s triggers 5min pause
- Revalidate any remaining unvalidated entries
Filters for US+CA places, skips existing wave 1 entries.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Current state of the pipeline code as of 2026-04-14 (Phase 1 scaffolding complete).
Config has new_pipeline.enabled=false and crawler.sites=[] per refactor plan.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>