Commit graph

5 commits

Author SHA1 Message Date
71e3dc12ed Phase 1: PeerTube plugin and recon_domains module
Single source of truth for the 18 RECON knowledge domains mapped to
PeerTube category IDs 100-117. Replaces duplicate VALID_DOMAINS sets
in enricher.py and embedder.py with imports from lib/recon_domains.py.

Includes PeerTube plugin (peertube-plugin-recon-domains) that registers
custom categories via videoCategoryManager.addConstant(), and a parity
test to verify constants match between RECON and the PeerTube API.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-28 00:04:13 +00:00
b250d0c257 Fix Kiwix download URL generation in embedder
- Add /content/ prefix to wiki.echo6.co URLs (required by kiwix-serve)
- Stop stripping ZIM flavor/date suffix (e.g. _maxi_2025-11) from filename
- Use str.removesuffix instead of regex to strip only .zim extension

Before: https://wiki.echo6.co/appropedia_en_all/Article
After:  https://wiki.echo6.co/content/appropedia_en_all_maxi_2025-11/Article

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-18 00:06:52 +00:00
2635160887 Kiwix integration: ZIM processor, dashboard tab, wiki.echo6.co citations
- ZIM processor: extract articles from ZIM files, feed into existing enrichment pipeline
- Dashboard: Kiwix tab with library table, ingest toggle, upload, remove
- kiwix-serve on port 8430, wiki.echo6.co behind Authentik
- Citation URLs point to wiki.echo6.co/{zimname}/{article_path}
- Dashboard shows WIKI type badge for ZIM-sourced content
- Appropedia EN (19,445 articles) fully ingested as proof of concept
2026-04-17 07:00:24 +00:00
66fadb7487 Phase 3: dispatcher, transcript processor, text_dir resolution
- lib/dispatcher.py: one-shot dispatcher that scans acquired/<type>/
  for content+sidecar pairs and routes to registered processors
- lib/processors/transcript_processor.py: pre_flight() for transcripts
  (hash, dedupe, split into pages, register in DB, set text_dir)
- lib/utils.py: resolve_text_dir() helper for text_dir column fallback
- lib/enricher.py: use resolve_text_dir() instead of hardcoded path
- lib/embedder.py: use resolve_text_dir() instead of hardcoded path
- lib/processors/__init__.py, lib/acquisition/__init__.py: package inits

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-14 15:39:42 +00:00
563c16bb71 Initial commit: RECON codebase baseline
Current state of the pipeline code as of 2026-04-14 (Phase 1 scaffolding complete).
Config has new_pipeline.enabled=false and crawler.sites=[] per refactor plan.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-14 14:57:23 +00:00