Commit graph

18 commits

Author SHA1 Message Date
e6224cb279 Migrate dashboard upload to pipeline with multi-format support
Upload handler now writes files to the appropriate hopper subfolder
instead of copying directly to /mnt/library/:
- .pdf -> acquired/pdf/
- .txt -> acquired/text/
- .epub, .doc, .docx, .mobi -> acquired/pdf/ (dispatcher format
  normalizer converts to PDF before processing)

The dispatcher picks up files and routes through the appropriate
processor (pdf_processor or text_processor) for full metadata
voting, domain classification, and canonical filing.

Changes to api_upload() / _process_upload():
- Relaxed extension check: PDF, TXT, EPUB, DOC, DOCX, MOBI
- Routes to correct hopper subfolder by extension
- Writes meta.json sidecar with original filename and category hint
- Removed: direct library copy, add_to_catalogue, queue_document
- Added: hopper-level dedup check (catches rapid re-uploads)
- Kept: catalogue dedup check for immediate user feedback

Changes to api_upload_status():
- Added fallback: checks acquired/ and processing/ dirs if hash
  not yet in documents table (covers gap between upload and
  dispatcher pickup)

Template updated: accept attribute and help text now reflect
multi-format support.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-16 02:18:45 +00:00
999cf37626 Fix: Gemini "null" string bug in pdf_processor metadata voting
Same fix as text_processor — Gemini sometimes returns the literal
string "null" instead of JSON null for empty metadata fields. The
voting logic and Gemini extraction now both treat "null" strings
as None.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-15 23:30:59 +00:00
f4659d155f Phase 6f-2: format normalizer in dispatcher
Adds _normalize_formats() to the dispatcher that converts non-standard
document formats to PDF before dispatch. Supports:
- .epub, .mobi -> PDF via ebook-convert (Calibre)
- .doc, .docx -> PDF via LibreOffice headless

Called per-subfolder before _find_pairs() so _find_pairs() only ever
sees standard content files. Conversion failures are logged and
skipped -- the original file stays in acquired/ for manual review.

Also converts 3 staged epub files and cleans up _staging/.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-15 23:08:19 +00:00
62539861f2 Phase 6f: text processor for .txt file ingestion
New processor: lib/processors/text_processor.py
Handles plain text files (.txt) as primary source documents.

Pipeline: acquired/text/ -> dispatcher -> text_processor.pre_flight()
-> enrich -> embed -> filing worker -> library/Domain/Subdomain/

Metadata extraction via two-source vote:
- Source A: filename parsing (title from filename)
- Source B: Gemini LLM extraction (title/author/edition/year from
  first 3 pages of text)

Page splitting reuses chunk_text() from lib/web_scraper.py.
Filing behavior matches PDFs (files to library, not organized
in-place like transcripts).

Config: adds text: text_processor to pipeline.dispatch map.
New hopper subfolder: data/acquired/text/

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-15 22:39:31 +00:00
7fe7d03583 Revert "Phase 6e: rewire dashboard PeerTube endpoint to acquisition module"
This reverts commit 7e42528d2f.
2026-04-15 03:20:46 +00:00
7e42528d2f Phase 6e: rewire dashboard PeerTube endpoint to acquisition module
Replace legacy ingest_channel/ingest_all imports with acquire_batch
from lib.acquisition.peertube. The endpoint now writes flat file pairs
to the hopper and lets the dispatcher handle processing, matching the
Phase 6d architecture. Removes channel/since/process parameters that
were tied to the old direct-ingest path.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-15 03:15:41 +00:00
277110d999 Phase 6d: PeerTube acquisition module + service thread
New lib/acquisition/peertube.py replaces the removed peertube_scanner_loop.
Polls PeerTube API every 30min, dedupes against catalogue (UUID + title),
writes flat file pairs to data/acquired/stream/ for the dispatcher.

- acquire_batch(): one-shot find-and-acquire with rate limiting
- acquisition_loop(): service thread wrapper (interval from config)
- list_new_videos(): dedup via _build_known_sets() against catalogue
- acquire_one(): fetch VTT, convert, write .tmp then rename atomically

cmd_service(): added peertube-acq daemon thread
cmd_ingest_peertube(): rewired to use acquire_batch(), drops --channel/
  --since/--enrich/--process (dispatcher handles full pipeline)
config.yaml: added peertube.poll_interval: 1800

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-15 03:08:51 +00:00
efae4023f6 Phase 6c: remove vestigial extract worker, dead crawler, .bak files
recon.py:
- Remove extract stage_loop thread from cmd_service(). Confirmed
  vestigial: 0 queued items, silent logs over 24+ hour run. The new
  processors do extraction inline in pre_flight().
- Remove cmd_crawl CLI subcommand and its argparse registration.
- Clean up associated imports and variables.

Deleted:
- lib/crawler.py (432 lines) -- old web crawler subsystem, only
  referenced by the removed CLI subcommand.
- 24 .bak files (untracked pre-edit safety backups, originals
  preserved in git history).

Investigation found the four old loop function definitions
(scanner_loop, peertube_scanner_loop, crawler_scheduler_loop,
organizer_loop) were already deleted in Phase 5c-1.

Modules investigated and KEPT:
- lib/web_scraper.py -- exports chunk_text() used by transcript_processor
- lib/new_pipeline.py -- active Stream B library management CLI tool
- lib/peertube_scraper.py -- only mechanism for transcript ingestion
- lib/extractor.py -- would activate for new PDFs via cmd_run CLI

Service restart verified: 6 threads (dispatcher, enrich, embed,
filing, progress, dashboard), no extract worker, zero errors.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-14 23:46:00 +00:00
70b80cb312 Phase 6b: fix dashboard Untitled/WEB bug for transcripts
Two bugs in the Recently Completed table:

1. Title showed "Untitled" for all transcripts because the dashboard
   read documents.book_title (populated by PDF metadata voting) which
   is NULL for transcripts. Fixed by COALESCE(book_title, filename)
   in the SQL query -- falls back to catalogue.filename which holds
   the real video title.

2. Type showed "WEB" for all transcripts because the type CASE
   expression only had web and pdf branches, with web matching any
   http% path -- and transcript paths are PeerTube watch URLs.
   Fixed by adding a transcript branch keyed on catalogue.source =
   stream.echo6.co, evaluated before the web branch.

Also adds badge-transcript CSS (purple) and JS rendering case.
Applied consistently to both the Recently Completed and Sources
table queries.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-14 23:05:29 +00:00
df29d598d3 Phase 6a: transcripts mark organized in-place, skip filing
Transcripts are derived text from PeerTube videos, not primary source
files. They do not belong in library/Domain/Subdomain/ like PDFs.

Change: transcript_processor.pre_flight() now sets organized_at =
CURRENT_TIMESTAMP at the end of successful processing, marking the
transcript as organized in place. The watch URL remains in
catalogue.path and Qdrant download_url so users clicking search
results go to the PeerTube video.

The filing workers path LIKE filter naturally excludes transcripts
since their documents.path is the watch URL, not a filesystem path.
No filing worker changes needed.

Back-fills 2,260 drain items from Phase 5c-2 via one-time SQL.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-14 22:49:21 +00:00
9fa60f9c86 Fix: stale cleanup in processors must fail loudly on permission errors
Phase 5c-2 failed because shutil.rmtree(ignore_errors=True) silently
failed to clean up root-owned legacy files in processing/{hash}/,
letting the processor proceed into a half-cleaned directory and then
crash on subsequent file writes.

Changes: removed ignore_errors=True, wrapped in try/except that logs
and re-raises, so the processor fails early and visibly if stale
cleanup fails.

Recovery from Phase 5c-2 failure.
2026-04-14 20:15:48 +00:00
d9aed35fd7 Phase 5c-1: dispatcher loop, filing worker loop, service rewire
Adds dispatch_loop() alongside dispatch_once() for service-thread use.
Adds filing_worker_loop() that watches for status=complete items in
/opt/recon/data/processing/ and files them to library/Domain/Subdomain/.

Rewires cmd_service() to run the new architecture:
- Removed: scanner_loop, peertube_scanner_loop, crawler_scheduler_loop,
  organizer_loop (all replaced by dispatcher + new filing worker)
- Kept: enrich and embed stage workers, progress, dashboard
- Kept (vestigial): extract stage worker — will be removed in Phase 6 cleanup
- Added: dispatcher loop thread, filing worker thread

Phase 5c-1 of the refactor. Service not yet started — Phase 5c-2 will do that.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-14 18:30:58 +00:00
96e1e642c4 Phase 4: PDF processor with layered metadata extraction
- Add lib/processors/pdf_processor.py with full pre_flight pipeline
- Layered metadata: Source A (PDF dict), Source B (filename), Source C (Gemini)
- Field-by-field voting with provenance tracking (metadata_provenance column)
- Level-4 strict dedupe (title+author+edition+year)
- Content failures route to _review/rejected_pdfs/
- Level-4 duplicates route to _review/duplicate_quarantine/
- Full text extraction using existing extract_text_from_page fallback chain
- Schema: added metadata_provenance TEXT to documents table

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-14 16:57:44 +00:00
9fe6a0a782 Phase 4: Phase 3 cleanup fixes
Fix 1.1: filing preserves source file extension instead of defaulting to .pdf
Fix 1.2: back-fixed soldering transcript from .pdf to .txt (hash 380dbc78)
Fix 1.3: dispatcher logs missing processor modules at DEBUG, not ERROR
Fix 1.4: transcript processor cleans stale processing/concepts dirs on entry
Also: dispatcher now handles solo content files without .meta.json sidecar

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-14 16:39:57 +00:00
f69c04a0e3 Phase 3: fix page_count in transcript processor
Set page_count on documents row during pre_flight. Without this,
enricher comparison `page_count >= 3` fails with TypeError on NULL.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-14 15:43:21 +00:00
66fadb7487 Phase 3: dispatcher, transcript processor, text_dir resolution
- lib/dispatcher.py: one-shot dispatcher that scans acquired/<type>/
  for content+sidecar pairs and routes to registered processors
- lib/processors/transcript_processor.py: pre_flight() for transcripts
  (hash, dedupe, split into pages, register in DB, set text_dir)
- lib/utils.py: resolve_text_dir() helper for text_dir column fallback
- lib/enricher.py: use resolve_text_dir() instead of hardcoded path
- lib/embedder.py: use resolve_text_dir() instead of hardcoded path
- lib/processors/__init__.py, lib/acquisition/__init__.py: package inits

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-14 15:39:42 +00:00
de2c59a501 Phase 2: add shared filing function (lib/filing.py)
New reusable file_processed_item() that future processors will call to file
completed items from /opt/recon/data/processing/{hash}/ into the library.

Reuses existing organizer logic for domain classification and collision handling.
Not yet wired into the service loop — exists as library code for Phase 3+ to call.

Phase 2 of the refactor. See https://forge.echo6.co/matt/refactored-recon

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-14 15:03:36 +00:00
563c16bb71 Initial commit: RECON codebase baseline
Current state of the pipeline code as of 2026-04-14 (Phase 1 scaffolding complete).
Config has new_pipeline.enabled=false and crawler.sites=[] per refactor plan.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-14 14:57:23 +00:00