recon

matt/recon

mirror of https://github.com/zvx-echo6/recon.git synced 2026-05-20 06:34:40 +02:00

Author	SHA1	Message	Date
Matt	5f5bcedab9	Fix progress regex and SIGHUP/scan_zims race condition - Parse Browsertrix "crawled":N JSON format instead of "N pages" - Add 3s delay between SIGHUP to kiwix-serve and scan_zims() call so the OPDS catalog is reloaded before we query it for linking Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-19 19:35:42 +00:00
Matt	9692044790	Fix progress parsing for Browsertrix JSON log format Parse "crawled":N from Browsertrix crawlStatus JSON logs instead of looking for "N pages" pattern. Also check stdout (not just stderr). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-19 19:33:50 +00:00
Matt	b035ba3f20	Fix Zimit: add required --name flag for warc2zim warc2zim (called internally by zimit) requires --name for ZIM metadata. Without it, argument validation fails with exit code 2. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-19 14:30:42 +00:00
Matt	76076fc4ab	Fix Zimit CLI: add subcommand, correct flag names, fix container cleanup - Must pass `zimit` as command after image name (entrypoint execs args) - --url → --seeds, --name removed, --lang → --zim-lang, --workers → -w - Remove --rm so docker logs work after exit, manually rm container Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-19 14:13:34 +00:00
Matt	8945c82e3f	Replace wget/SingleFile/Playwright backends with Zimit - Zimit Docker container handles all site types (static, SPA, JS redirects) - Removed: _detect_crawl_mode, _crawl_wget, _crawl_singlefile, preflight logic - Added: _crawl_zimit() with Docker lifecycle management - Simplified pipeline: submit → Zimit crawl → kiwix-manage register → done - No more zimwriterfs step — Zimit produces ZIM directly - Dashboard UI simplified: removed crawl mode dropdown - Config simplified: removed reject patterns, preflight, singlefile sections Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-19 14:06:23 +00:00
Matt	f0b160ef7c	Extract _full_zim_cleanup helper, add SIGHUP + scrape_jobs cleanup - Extract shared _full_zim_cleanup(source_id) from api_kiwix_remove - Add SIGHUP to kiwix-serve after kiwix-manage remove - Delete linked scrape_jobs rows during ZIM removal - Update api_scraper_delete to do full ZIM cleanup when applicable - Set chromium_path for single-file browser crawl support - Add status.db to .gitignore Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-19 02:28:49 +00:00
Matt	45c3bb8d56	Add scraper job queue management (delete, clear failed) New API endpoints: DELETE single job, clear all failed/cancelled. Dashboard now shows Delete buttons on completed/failed jobs, Retry+Delete on failed jobs, and a Clear Failed bulk action. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-18 21:03:39 +00:00
Matt	1ce9a3731f	Add scraper dashboard UI under Kiwix tab New /kiwix/scraper page with submit form (URL, title, language, crawl mode), stats cards, and auto-refreshing jobs table with cancel/retry actions. Kiwix section now has Library/Scraper subnav. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-18 20:47:17 +00:00
Matt	45b954fccc	Fix ZIM filename collisions by appending job ID Format: {domain}_{lang}_{YYYY-MM}_{job_id}.zim Prevents zimwriterfs failures when the same domain is scraped multiple times in the same month. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-18 20:17:53 +00:00
Matt	125602fa13	Fix SingleFile CLI: remove invalid --crawl-delay flag SingleFile CLI has no --crawl-delay option. The invalid flag caused the process to print help and exit with no output. Added --crawl-no-parent and --crawl-replace-URLs instead. Removed unused crawl_delay config key. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-18 19:28:03 +00:00
Matt	da50e5f0b8	Add scraper Phase 2: smart crawl mode detection + browser fallback - Pre-flight detection: wget + Playwright probe to auto-detect if site needs browser rendering (JS apps, parking page redirects) - SingleFile CLI crawl backend for JS-rendered sites - crawl_mode column in scrape_jobs (static/browser/redirect/auto) - API: optional crawl_mode param on submit, cleared on retry - Config: rate_limit_delay 2.0→0.5, /api/ reject pattern, preflight + singlefile config sections - Prerequisites: Node.js 22, single-file-cli, Playwright + Chromium Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-18 18:26:43 +00:00
Matt	b250d0c257	Fix Kiwix download URL generation in embedder - Add /content/ prefix to wiki.echo6.co URLs (required by kiwix-serve) - Stop stripping ZIM flavor/date suffix (e.g. _maxi_2025-11) from filename - Use str.removesuffix instead of regex to strip only .zim extension Before: https://wiki.echo6.co/appropedia_en_all/Article After: https://wiki.echo6.co/content/appropedia_en_all_maxi_2025-11/Article Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-18 00:06:52 +00:00
Matt	fed02186fa	Fix Kiwix status badges to reflect full pipeline state Status was showing COMPLETE after ZIM extraction finished, even when documents were still queued for enrichment/embedding. Now computes effective_status by checking actual pipeline state per-source: - DETECTED: ingest not enabled (gray) - EXTRACTING: ZIM processor running (blue) - PROCESSING: extracted but docs still in enricher/embedder queue (amber) - COMPLETE: all docs fully enriched and embedded in Qdrant (green) Also fixed _build_kiwix_sources pipeline query to filter by category per-source instead of returning global kiwix stats for every source. Progress column now shows "X / Y in Qdrant" when processing, or "X / Y extracted" otherwise. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-17 15:22:44 +00:00
Matt	6f2a1d206e	Add langdetect language filter to enricher + purge non-English ZIM articles - Install langdetect package for content-level language detection - Add _check_language() to enricher.py: reads first 1500 chars of first page, detects language via langdetect, skips if not in allowed list - Configurable via config.yaml pipeline.language_filter and pipeline.allowed_languages (default: en only) - Catches non-English content from ANY source (PDF, web, ZIM, PeerTube) before burning Gemini API quota on enrichment - Add scan_zims retry logic (3 attempts, 2s delay) for upload handler - Purged 6,483 stale non-English zim_articles rows from DB Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-17 14:37:13 +00:00
Matt	501004ecf1	Filter non-English articles from ZIM ingestion Skip articles with MediaWiki translation suffixes (/es, /fr, /pl, etc.) before text extraction to avoid wasting Gemini enrichment on translations. Uses path-based regex matching against ISO 639 language codes. ~5,276 non-English articles already ingested from Appropedia (top: es=837, zh=765, ru=475, fr=433, ko=407). Purge decision deferred.	2026-04-17 07:30:30 +00:00
Matt	2635160887	Kiwix integration: ZIM processor, dashboard tab, wiki.echo6.co citations - ZIM processor: extract articles from ZIM files, feed into existing enrichment pipeline - Dashboard: Kiwix tab with library table, ingest toggle, upload, remove - kiwix-serve on port 8430, wiki.echo6.co behind Authentik - Citation URLs point to wiki.echo6.co/{zimname}/{article_path} - Dashboard shows WIKI type badge for ZIM-sourced content - Appropedia EN (19,445 articles) fully ingested as proof of concept	2026-04-17 07:00:24 +00:00
Matt	c60aa5e80d	Phase 2: ZIM processor — batch article ingestion pipeline Adds lib/processors/zim_processor.py which opens a ZIM file via python-libzim, iterates HTML articles, strips to clean text (lxml), and feeds each article into the existing RECON enrichment pipeline. Key features: - HTML to text via lxml (strips nav/footer/script/style) - Filters redirects, non-HTML entries, stubs (<200 chars) - Content hash dedup against existing catalogue - Creates processing dirs with page files and meta.json - Registers articles as "extracted" for automatic enrichment - Checkpointing via zim_sources.last_checkpoint for resume - Configurable batch size and delay for rate control - Standalone CLI: python3 -m lib.processors.zim_processor Tested: 100 Appropedia articles processed in 3s, enricher picks them up automatically via the existing pipeline. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-17 02:03:12 +00:00
Matt	7c1af0f063	Phase 1: Kiwix foundation — ZIM monitor and kiwix-serve setup - Add lib/zim_monitor.py: polls kiwix-serve OPDS v2 catalog, detects new ZIMs, reads accurate article count from python-libzim Counter metadata (not inflated OPDS count), inserts into zim_sources table. Idempotent on re-run, marks removed ZIMs. - DB schema: zim_sources, zim_samples, zim_articles tables (created via sqlite3, not in migrations — matches existing RECON pattern) - kiwix-tools 3.7.0 installed from binary tarball at /opt/recon/bin/ (Ubuntu 24.04 apt ships 3.5.0 which lacks OPDS v2) - kiwix.service systemd unit on port 8430 - python-libzim 3.9.0 installed - Test ZIM: Appropedia EN maxi (496 MB, 19,445 articles) - Add bin/ to .gitignore (binary tarball, not source)	2026-04-16 23:39:34 +00:00
Matt	e6224cb279	Migrate dashboard upload to pipeline with multi-format support Upload handler now writes files to the appropriate hopper subfolder instead of copying directly to /mnt/library/: - .pdf -> acquired/pdf/ - .txt -> acquired/text/ - .epub, .doc, .docx, .mobi -> acquired/pdf/ (dispatcher format normalizer converts to PDF before processing) The dispatcher picks up files and routes through the appropriate processor (pdf_processor or text_processor) for full metadata voting, domain classification, and canonical filing. Changes to api_upload() / _process_upload(): - Relaxed extension check: PDF, TXT, EPUB, DOC, DOCX, MOBI - Routes to correct hopper subfolder by extension - Writes meta.json sidecar with original filename and category hint - Removed: direct library copy, add_to_catalogue, queue_document - Added: hopper-level dedup check (catches rapid re-uploads) - Kept: catalogue dedup check for immediate user feedback Changes to api_upload_status(): - Added fallback: checks acquired/ and processing/ dirs if hash not yet in documents table (covers gap between upload and dispatcher pickup) Template updated: accept attribute and help text now reflect multi-format support. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-16 02:18:45 +00:00
Matt	999cf37626	Fix: Gemini "null" string bug in pdf_processor metadata voting Same fix as text_processor — Gemini sometimes returns the literal string "null" instead of JSON null for empty metadata fields. The voting logic and Gemini extraction now both treat "null" strings as None. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-15 23:30:59 +00:00
Matt	f4659d155f	Phase 6f-2: format normalizer in dispatcher Adds _normalize_formats() to the dispatcher that converts non-standard document formats to PDF before dispatch. Supports: - .epub, .mobi -> PDF via ebook-convert (Calibre) - .doc, .docx -> PDF via LibreOffice headless Called per-subfolder before _find_pairs() so _find_pairs() only ever sees standard content files. Conversion failures are logged and skipped -- the original file stays in acquired/ for manual review. Also converts 3 staged epub files and cleans up _staging/. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-15 23:08:19 +00:00
Matt	62539861f2	Phase 6f: text processor for .txt file ingestion New processor: lib/processors/text_processor.py Handles plain text files (.txt) as primary source documents. Pipeline: acquired/text/ -> dispatcher -> text_processor.pre_flight() -> enrich -> embed -> filing worker -> library/Domain/Subdomain/ Metadata extraction via two-source vote: - Source A: filename parsing (title from filename) - Source B: Gemini LLM extraction (title/author/edition/year from first 3 pages of text) Page splitting reuses chunk_text() from lib/web_scraper.py. Filing behavior matches PDFs (files to library, not organized in-place like transcripts). Config: adds text: text_processor to pipeline.dispatch map. New hopper subfolder: data/acquired/text/ Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-15 22:39:31 +00:00
Matt	7fe7d03583	Revert "Phase 6e: rewire dashboard PeerTube endpoint to acquisition module" This reverts commit `7e42528d2f`.	2026-04-15 03:20:46 +00:00
Matt	7e42528d2f	Phase 6e: rewire dashboard PeerTube endpoint to acquisition module Replace legacy ingest_channel/ingest_all imports with acquire_batch from lib.acquisition.peertube. The endpoint now writes flat file pairs to the hopper and lets the dispatcher handle processing, matching the Phase 6d architecture. Removes channel/since/process parameters that were tied to the old direct-ingest path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-15 03:15:41 +00:00
Matt	277110d999	Phase 6d: PeerTube acquisition module + service thread New lib/acquisition/peertube.py replaces the removed peertube_scanner_loop. Polls PeerTube API every 30min, dedupes against catalogue (UUID + title), writes flat file pairs to data/acquired/stream/ for the dispatcher. - acquire_batch(): one-shot find-and-acquire with rate limiting - acquisition_loop(): service thread wrapper (interval from config) - list_new_videos(): dedup via _build_known_sets() against catalogue - acquire_one(): fetch VTT, convert, write .tmp then rename atomically cmd_service(): added peertube-acq daemon thread cmd_ingest_peertube(): rewired to use acquire_batch(), drops --channel/ --since/--enrich/--process (dispatcher handles full pipeline) config.yaml: added peertube.poll_interval: 1800 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-15 03:08:51 +00:00
Matt	efae4023f6	Phase 6c: remove vestigial extract worker, dead crawler, .bak files recon.py: - Remove extract stage_loop thread from cmd_service(). Confirmed vestigial: 0 queued items, silent logs over 24+ hour run. The new processors do extraction inline in pre_flight(). - Remove cmd_crawl CLI subcommand and its argparse registration. - Clean up associated imports and variables. Deleted: - lib/crawler.py (432 lines) -- old web crawler subsystem, only referenced by the removed CLI subcommand. - 24 .bak files (untracked pre-edit safety backups, originals preserved in git history). Investigation found the four old loop function definitions (scanner_loop, peertube_scanner_loop, crawler_scheduler_loop, organizer_loop) were already deleted in Phase 5c-1. Modules investigated and KEPT: - lib/web_scraper.py -- exports chunk_text() used by transcript_processor - lib/new_pipeline.py -- active Stream B library management CLI tool - lib/peertube_scraper.py -- only mechanism for transcript ingestion - lib/extractor.py -- would activate for new PDFs via cmd_run CLI Service restart verified: 6 threads (dispatcher, enrich, embed, filing, progress, dashboard), no extract worker, zero errors. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-14 23:46:00 +00:00
Matt	70b80cb312	Phase 6b: fix dashboard Untitled/WEB bug for transcripts Two bugs in the Recently Completed table: 1. Title showed "Untitled" for all transcripts because the dashboard read documents.book_title (populated by PDF metadata voting) which is NULL for transcripts. Fixed by COALESCE(book_title, filename) in the SQL query -- falls back to catalogue.filename which holds the real video title. 2. Type showed "WEB" for all transcripts because the type CASE expression only had web and pdf branches, with web matching any http% path -- and transcript paths are PeerTube watch URLs. Fixed by adding a transcript branch keyed on catalogue.source = stream.echo6.co, evaluated before the web branch. Also adds badge-transcript CSS (purple) and JS rendering case. Applied consistently to both the Recently Completed and Sources table queries. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-14 23:05:29 +00:00
Matt	df29d598d3	Phase 6a: transcripts mark organized in-place, skip filing Transcripts are derived text from PeerTube videos, not primary source files. They do not belong in library/Domain/Subdomain/ like PDFs. Change: transcript_processor.pre_flight() now sets organized_at = CURRENT_TIMESTAMP at the end of successful processing, marking the transcript as organized in place. The watch URL remains in catalogue.path and Qdrant download_url so users clicking search results go to the PeerTube video. The filing workers path LIKE filter naturally excludes transcripts since their documents.path is the watch URL, not a filesystem path. No filing worker changes needed. Back-fills 2,260 drain items from Phase 5c-2 via one-time SQL. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-14 22:49:21 +00:00
Matt	9fa60f9c86	Fix: stale cleanup in processors must fail loudly on permission errors Phase 5c-2 failed because shutil.rmtree(ignore_errors=True) silently failed to clean up root-owned legacy files in processing/{hash}/, letting the processor proceed into a half-cleaned directory and then crash on subsequent file writes. Changes: removed ignore_errors=True, wrapped in try/except that logs and re-raises, so the processor fails early and visibly if stale cleanup fails. Recovery from Phase 5c-2 failure.	2026-04-14 20:15:48 +00:00
Matt	d9aed35fd7	Phase 5c-1: dispatcher loop, filing worker loop, service rewire Adds dispatch_loop() alongside dispatch_once() for service-thread use. Adds filing_worker_loop() that watches for status=complete items in /opt/recon/data/processing/ and files them to library/Domain/Subdomain/. Rewires cmd_service() to run the new architecture: - Removed: scanner_loop, peertube_scanner_loop, crawler_scheduler_loop, organizer_loop (all replaced by dispatcher + new filing worker) - Kept: enrich and embed stage workers, progress, dashboard - Kept (vestigial): extract stage worker — will be removed in Phase 6 cleanup - Added: dispatcher loop thread, filing worker thread Phase 5c-1 of the refactor. Service not yet started — Phase 5c-2 will do that. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-14 18:30:58 +00:00
Matt	96e1e642c4	Phase 4: PDF processor with layered metadata extraction - Add lib/processors/pdf_processor.py with full pre_flight pipeline - Layered metadata: Source A (PDF dict), Source B (filename), Source C (Gemini) - Field-by-field voting with provenance tracking (metadata_provenance column) - Level-4 strict dedupe (title+author+edition+year) - Content failures route to _review/rejected_pdfs/ - Level-4 duplicates route to _review/duplicate_quarantine/ - Full text extraction using existing extract_text_from_page fallback chain - Schema: added metadata_provenance TEXT to documents table Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-14 16:57:44 +00:00
Matt	9fe6a0a782	Phase 4: Phase 3 cleanup fixes Fix 1.1: filing preserves source file extension instead of defaulting to .pdf Fix 1.2: back-fixed soldering transcript from .pdf to .txt (hash 380dbc78) Fix 1.3: dispatcher logs missing processor modules at DEBUG, not ERROR Fix 1.4: transcript processor cleans stale processing/concepts dirs on entry Also: dispatcher now handles solo content files without .meta.json sidecar Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-14 16:39:57 +00:00
Matt	f69c04a0e3	Phase 3: fix page_count in transcript processor Set page_count on documents row during pre_flight. Without this, enricher comparison `page_count >= 3` fails with TypeError on NULL. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-14 15:43:21 +00:00
Matt	66fadb7487	Phase 3: dispatcher, transcript processor, text_dir resolution - lib/dispatcher.py: one-shot dispatcher that scans acquired/<type>/ for content+sidecar pairs and routes to registered processors - lib/processors/transcript_processor.py: pre_flight() for transcripts (hash, dedupe, split into pages, register in DB, set text_dir) - lib/utils.py: resolve_text_dir() helper for text_dir column fallback - lib/enricher.py: use resolve_text_dir() instead of hardcoded path - lib/embedder.py: use resolve_text_dir() instead of hardcoded path - lib/processors/__init__.py, lib/acquisition/__init__.py: package inits Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-14 15:39:42 +00:00
Matt	de2c59a501	Phase 2: add shared filing function (lib/filing.py) New reusable file_processed_item() that future processors will call to file completed items from /opt/recon/data/processing/{hash}/ into the library. Reuses existing organizer logic for domain classification and collision handling. Not yet wired into the service loop — exists as library code for Phase 3+ to call. Phase 2 of the refactor. See https://forge.echo6.co/matt/refactored-recon Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-14 15:03:36 +00:00
Matt	563c16bb71	Initial commit: RECON codebase baseline Current state of the pipeline code as of 2026-04-14 (Phase 1 scaffolding complete). Config has new_pipeline.enabled=false and crawler.sites=[] per refactor plan. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-14 14:57:23 +00:00

36 commits