Commit graph

3 commits

Author SHA1 Message Date
5f36c52fb1 Phase 2: documents table schema for domain assignment
Adds four columns to documents table via idempotent ALTER TABLE
migrations: recon_domain, recon_domain_status, recon_domain_assigned_at,
peertube_category_pushed_at. Adds index on recon_domain_status.

Includes StatusDB helper methods: get/set_domain_assignment,
set_peertube_pushed, get_unpushed_assignments, get_items_by_domain_status,
get_domain_status_counts, get_domain_distribution.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-28 00:04:29 +00:00
da50e5f0b8 Add scraper Phase 2: smart crawl mode detection + browser fallback
- Pre-flight detection: wget + Playwright probe to auto-detect if site
  needs browser rendering (JS apps, parking page redirects)
- SingleFile CLI crawl backend for JS-rendered sites
- crawl_mode column in scrape_jobs (static/browser/redirect/auto)
- API: optional crawl_mode param on submit, cleared on retry
- Config: rate_limit_delay 2.0→0.5, /api/ reject pattern, preflight
  + singlefile config sections
- Prerequisites: Node.js 22, single-file-cli, Playwright + Chromium

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-18 18:26:43 +00:00
563c16bb71 Initial commit: RECON codebase baseline
Current state of the pipeline code as of 2026-04-14 (Phase 1 scaffolding complete).
Config has new_pipeline.enabled=false and crawler.sites=[] per refactor plan.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-14 14:57:23 +00:00