recon

matt/recon

mirror of https://github.com/zvx-echo6/recon.git synced 2026-05-20 14:44:54 +02:00

Author	SHA1	Message	Date
Matt	9692044790	Fix progress parsing for Browsertrix JSON log format Parse "crawled":N from Browsertrix crawlStatus JSON logs instead of looking for "N pages" pattern. Also check stdout (not just stderr). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-19 19:33:50 +00:00
Matt	b035ba3f20	Fix Zimit: add required --name flag for warc2zim warc2zim (called internally by zimit) requires --name for ZIM metadata. Without it, argument validation fails with exit code 2. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-19 14:30:42 +00:00
Matt	76076fc4ab	Fix Zimit CLI: add subcommand, correct flag names, fix container cleanup - Must pass `zimit` as command after image name (entrypoint execs args) - --url → --seeds, --name removed, --lang → --zim-lang, --workers → -w - Remove --rm so docker logs work after exit, manually rm container Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-19 14:13:34 +00:00
Matt	8945c82e3f	Replace wget/SingleFile/Playwright backends with Zimit - Zimit Docker container handles all site types (static, SPA, JS redirects) - Removed: _detect_crawl_mode, _crawl_wget, _crawl_singlefile, preflight logic - Added: _crawl_zimit() with Docker lifecycle management - Simplified pipeline: submit → Zimit crawl → kiwix-manage register → done - No more zimwriterfs step — Zimit produces ZIM directly - Dashboard UI simplified: removed crawl mode dropdown - Config simplified: removed reject patterns, preflight, singlefile sections Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-19 14:06:23 +00:00
Matt	45b954fccc	Fix ZIM filename collisions by appending job ID Format: {domain}_{lang}_{YYYY-MM}_{job_id}.zim Prevents zimwriterfs failures when the same domain is scraped multiple times in the same month. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-18 20:17:53 +00:00
Matt	125602fa13	Fix SingleFile CLI: remove invalid --crawl-delay flag SingleFile CLI has no --crawl-delay option. The invalid flag caused the process to print help and exit with no output. Added --crawl-no-parent and --crawl-replace-URLs instead. Removed unused crawl_delay config key. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-18 19:28:03 +00:00
Matt	da50e5f0b8	Add scraper Phase 2: smart crawl mode detection + browser fallback - Pre-flight detection: wget + Playwright probe to auto-detect if site needs browser rendering (JS apps, parking page redirects) - SingleFile CLI crawl backend for JS-rendered sites - crawl_mode column in scrape_jobs (static/browser/redirect/auto) - API: optional crawl_mode param on submit, cleared on retry - Config: rate_limit_delay 2.0→0.5, /api/ reject pattern, preflight + singlefile config sections - Prerequisites: Node.js 22, single-file-cli, Playwright + Chromium Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-18 18:26:43 +00:00

7 commits