mirror of
https://github.com/zvx-echo6/recon.git
synced 2026-05-20 06:34:40 +02:00
New processor: lib/processors/text_processor.py Handles plain text files (.txt) as primary source documents. Pipeline: acquired/text/ -> dispatcher -> text_processor.pre_flight() -> enrich -> embed -> filing worker -> library/Domain/Subdomain/ Metadata extraction via two-source vote: - Source A: filename parsing (title from filename) - Source B: Gemini LLM extraction (title/author/edition/year from first 3 pages of text) Page splitting reuses chunk_text() from lib/web_scraper.py. Filing behavior matches PDFs (files to library, not organized in-place like transcripts). Config: adds text: text_processor to pipeline.dispatch map. New hopper subfolder: data/acquired/text/ Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|---|---|---|
| .. | ||
| acquisition | ||
| processors | ||
| __init__.py | ||
| api.py | ||
| dispatcher.py | ||
| embedder.py | ||
| enricher.py | ||
| extractor.py | ||
| filing.py | ||
| ingester.py | ||
| key_manager.py | ||
| new_pipeline.py | ||
| organizer.py | ||
| peertube_collector.py | ||
| peertube_scraper.py | ||
| status.py | ||
| utils.py | ||
| web_scraper.py | ||