- YAML-backed saved locations (config/address_book.yaml)
- Exact/partial alias matching with case-insensitive lookup
- Flask blueprint: /api/address_book/lookup, /api/address_book/list
- Geocoder short-circuits Photon when address book has exact match
- Test suite for lookup behavior
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Extract shared _full_zim_cleanup(source_id) from api_kiwix_remove
- Add SIGHUP to kiwix-serve after kiwix-manage remove
- Delete linked scrape_jobs rows during ZIM removal
- Update api_scraper_delete to do full ZIM cleanup when applicable
- Set chromium_path for single-file browser crawl support
- Add status.db to .gitignore
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New API endpoints: DELETE single job, clear all failed/cancelled.
Dashboard now shows Delete buttons on completed/failed jobs,
Retry+Delete on failed jobs, and a Clear Failed bulk action.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New /kiwix/scraper page with submit form (URL, title, language,
crawl mode), stats cards, and auto-refreshing jobs table with
cancel/retry actions. Kiwix section now has Library/Scraper subnav.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Status was showing COMPLETE after ZIM extraction finished, even when
documents were still queued for enrichment/embedding. Now computes
effective_status by checking actual pipeline state per-source:
- DETECTED: ingest not enabled (gray)
- EXTRACTING: ZIM processor running (blue)
- PROCESSING: extracted but docs still in enricher/embedder queue (amber)
- COMPLETE: all docs fully enriched and embedded in Qdrant (green)
Also fixed _build_kiwix_sources pipeline query to filter by category
per-source instead of returning global kiwix stats for every source.
Progress column now shows "X / Y in Qdrant" when processing, or
"X / Y extracted" otherwise.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Install langdetect package for content-level language detection
- Add _check_language() to enricher.py: reads first 1500 chars of first
page, detects language via langdetect, skips if not in allowed list
- Configurable via config.yaml pipeline.language_filter and
pipeline.allowed_languages (default: en only)
- Catches non-English content from ANY source (PDF, web, ZIM, PeerTube)
before burning Gemini API quota on enrichment
- Add scan_zims retry logic (3 attempts, 2s delay) for upload handler
- Purged 6,483 stale non-English zim_articles rows from DB
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- ZIM processor: extract articles from ZIM files, feed into existing enrichment pipeline
- Dashboard: Kiwix tab with library table, ingest toggle, upload, remove
- kiwix-serve on port 8430, wiki.echo6.co behind Authentik
- Citation URLs point to wiki.echo6.co/{zimname}/{article_path}
- Dashboard shows WIKI type badge for ZIM-sourced content
- Appropedia EN (19,445 articles) fully ingested as proof of concept
Upload handler now writes files to the appropriate hopper subfolder
instead of copying directly to /mnt/library/:
- .pdf -> acquired/pdf/
- .txt -> acquired/text/
- .epub, .doc, .docx, .mobi -> acquired/pdf/ (dispatcher format
normalizer converts to PDF before processing)
The dispatcher picks up files and routes through the appropriate
processor (pdf_processor or text_processor) for full metadata
voting, domain classification, and canonical filing.
Changes to api_upload() / _process_upload():
- Relaxed extension check: PDF, TXT, EPUB, DOC, DOCX, MOBI
- Routes to correct hopper subfolder by extension
- Writes meta.json sidecar with original filename and category hint
- Removed: direct library copy, add_to_catalogue, queue_document
- Added: hopper-level dedup check (catches rapid re-uploads)
- Kept: catalogue dedup check for immediate user feedback
Changes to api_upload_status():
- Added fallback: checks acquired/ and processing/ dirs if hash
not yet in documents table (covers gap between upload and
dispatcher pickup)
Template updated: accept attribute and help text now reflect
multi-format support.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace legacy ingest_channel/ingest_all imports with acquire_batch
from lib.acquisition.peertube. The endpoint now writes flat file pairs
to the hopper and lets the dispatcher handle processing, matching the
Phase 6d architecture. Removes channel/since/process parameters that
were tied to the old direct-ingest path.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two bugs in the Recently Completed table:
1. Title showed "Untitled" for all transcripts because the dashboard
read documents.book_title (populated by PDF metadata voting) which
is NULL for transcripts. Fixed by COALESCE(book_title, filename)
in the SQL query -- falls back to catalogue.filename which holds
the real video title.
2. Type showed "WEB" for all transcripts because the type CASE
expression only had web and pdf branches, with web matching any
http% path -- and transcript paths are PeerTube watch URLs.
Fixed by adding a transcript branch keyed on catalogue.source =
stream.echo6.co, evaluated before the web branch.
Also adds badge-transcript CSS (purple) and JS rendering case.
Applied consistently to both the Recently Completed and Sources
table queries.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Current state of the pipeline code as of 2026-04-14 (Phase 1 scaffolding complete).
Config has new_pipeline.enabled=false and crawler.sites=[] per refactor plan.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>