Phase 6f: text processor for .txt file ingestion

New processor: lib/processors/text_processor.py Handles plain text files (.txt) as primary source documents. Pipeline: acquired/text/ -> dispatcher -> text_processor.pre_flight() -> enrich -> embed -> filing worker -> library/Domain/Subdomain/ Metadata extraction via two-source vote: - Source A: filename parsing (title from filename) - Source B: Gemini LLM extraction (title/author/edition/year from first 3 pages of text) Page splitting reuses chunk_text() from lib/web_scraper.py. Filing behavior matches PDFs (files to library, not organized in-place like transcripts). Config: adds text: text_processor to pipeline.dispatch map. New hopper subfolder: data/acquired/text/ Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-05-20 06:34:40 +02:00 · 2026-04-15 22:39:31 +00:00 · 2026-04-15 22:39:31 +00:00 · 62539861f2
commit 62539861f2
parent 7fe7d03583
2 changed files with 321 additions and 0 deletions
--- a/config.yaml
+++ b/config.yaml
@ -437,5 +437,6 @@ pipeline:
    pdf: pdf_processor
    stream: transcript_processor
    html: html_processor
+    text: text_processor
  # mtime stability threshold for picking up files from acquired/
  mtime_stability_seconds: 10