Phase 6f: text processor for .txt file ingestion

New processor: lib/processors/text_processor.py
Handles plain text files (.txt) as primary source documents.

Pipeline: acquired/text/ -> dispatcher -> text_processor.pre_flight()
-> enrich -> embed -> filing worker -> library/Domain/Subdomain/

Metadata extraction via two-source vote:
- Source A: filename parsing (title from filename)
- Source B: Gemini LLM extraction (title/author/edition/year from
  first 3 pages of text)

Page splitting reuses chunk_text() from lib/web_scraper.py.
Filing behavior matches PDFs (files to library, not organized
in-place like transcripts).

Config: adds text: text_processor to pipeline.dispatch map.
New hopper subfolder: data/acquired/text/

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Matt 2026-04-15 22:39:31 +00:00
commit 62539861f2
2 changed files with 321 additions and 0 deletions

View file

@ -437,5 +437,6 @@ pipeline:
pdf: pdf_processor
stream: transcript_processor
html: html_processor
text: text_processor
# mtime stability threshold for picking up files from acquired/
mtime_stability_seconds: 10