Commit graph

4 commits

Author SHA1 Message Date
Ubuntu
50c54d2a72 Switch wiki wave3 to gemini-2.5-flash-lite, 10 workers
- Model: gemini-2.5-flash -> gemini-2.5-flash-lite (6x cheaper output)
- Workers: 5 -> 10 for better throughput

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-05-01 14:44:01 +00:00
Ubuntu
5d618da2a4 Add wiki_index_wave3.py with parallel resolve
Wave 3 pipeline for processing 253K+ place types with NO wiki/wikidata
tags (US+CA only). Uses Gemini to resolve Wikipedia titles.

Key feature: resolve_wikipedia_titles() now uses ThreadPoolExecutor
with 5 parallel workers, improving throughput from ~14/min to ~75/min.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-04-30 21:37:51 +00:00
6be1e4cfa6 feat(wiki-index): add wave 2 pipeline for wikidata-only places
Processes places with wikidata but no wikipedia tag:
- Batch resolve Q-IDs via Wikidata API (50/request)
- Validate resolved titles against local ZIM
- Generate summaries with Gemini API (3-4 sentences)
- Circuit breaker: 50 consecutive 429s triggers 5min pause
- Revalidate any remaining unvalidated entries

Filters for US+CA places, skips existing wave 1 entries.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-04-29 19:50:43 +00:00
563c16bb71 Initial commit: RECON codebase baseline
Current state of the pipeline code as of 2026-04-14 (Phase 1 scaffolding complete).
Config has new_pipeline.enabled=false and crawler.sites=[] per refactor plan.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-14 14:57:23 +00:00