refactored-recon/phases/phase-3-transcript-processor.md
Matt 0747cb761f Phase 3: transcript processor end-to-end test doc
Documents dispatcher, transcript processor, text_dir resolution,
and full pipeline test results (172f39ae → skip_unclassified).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-14 15:44:25 +00:00

5.6 KiB

Phase 3: First End-to-End Test (Transcript Processor)

Executed: 2026-04-14T15:40Z UTC


Backup

Item Location MD5 Hash
recon.db (pre-Phase 3) CT 130: /tmp/recon.db.phase2.20260414.bak 20ec1fec2247a999e7d42f6a716481b0
Test row SQL backup CT 130: /tmp/recon_phase3_testrow_172f39ae.sql

What Was Created

1. lib/dispatcher.pydispatch_once() one-shot dispatcher

Scans each configured acquired/<subfolder>/ for stable content+sidecar pairs (.txt + .meta.json sharing a basename). When a pair has been stable for pipeline.mtime_stability_seconds, hands it to the registered processor's pre_flight().

  • Dynamically imports processor modules via importlib
  • Unknown/missing processors log an error and skip (don't crash)
  • Returns list of result dicts from all processor calls

2. lib/processors/transcript_processor.pypre_flight()

Handles transcript content from acquired/stream/:

  1. Reads content file, computes MD5 hash
  2. Deduplicates against catalogue by hash
  3. Reads .meta.json sidecar
  4. Moves pair to processing/{hash}/ (renames content to transcript.txt)
  5. Splits raw text into page_NNNN.txt files using chunk_text() (2000 words/page)
  6. Registers in catalogue + documents tables
  7. Sets documents.text_dir to the processing directory
  8. Sets documents.page_count from page split
  9. Advances status to extracted

TODO (flagged for later phases): Level-4 name deduplication not implemented for transcripts.

3. lib/utils.pyresolve_text_dir()

Helper function that resolves the text directory for a document:

  • If documents.text_dir is set, use that
  • Otherwise fall back to legacy config['paths']['text']/{hash}/

4. Package scaffolding

  • lib/processors/__init__.py — empty
  • lib/acquisition/__init__.py — empty

What Changed in Existing Code

lib/enricher.py (line 28, 349)

  • Added: from .utils import resolve_text_dir
  • Changed: text_dir = os.path.join(config['paths']['text'], file_hash)text_dir = resolve_text_dir(file_hash, config, db)

lib/embedder.py (line 24, 278)

  • Added: from .utils import resolve_text_dir
  • Changed: text_dir = os.path.join(config['paths']['text'], file_hash)text_dir = resolve_text_dir(file_hash, config, db)

Both changes are minimal and additive — the resolve_text_dir() function falls back to the legacy path when text_dir is NULL, so existing documents are unaffected.


End-to-End Test

Test transcript

Field Value
Hash 172f39ae7fc6f5b02e0fabcea450c0e4
Title Welcome to YouTube Memberships!
Channel stefan-sobkowiak-miracle-farms
Source stream.echo6.co
Duration 331 seconds
Pages 1 (3,747 bytes)

Test workflow

Copy-and-unprocess approach: Concatenated the existing page_0001.txt back to raw text, staged in acquired/stream/ with meta.json sidecar, deleted existing catalogue+documents rows (backup in /tmp/).

Pipeline execution (all manual one-shot calls):

Step Command Result
1. Dispatch dispatch_once() action: extracted — pair found, stable, routed to transcript_processor
2. Enrich enrich_single(hash, db, config, rotator) True — text_dir resolved correctly, Gemini returned 0 concepts
3. Embed embed_single(hash, db, config) True — 0 concepts → status=complete, 0 vectors
4. File file_processed_item(hash, source_path, db, config) action: skip_unclassified — no domain from empty concepts

Path traversal

acquired/stream/172f39ae...txt + .meta.json
  → processing/172f39ae.../transcript.txt + meta.json + page_0001.txt
  → (enriched, 0 concepts → skip_unclassified, not filed to library)

What skip_unclassified means

The transcript is a YouTube Memberships announcement with no extractable knowledge concepts. Gemini correctly returned 0 concepts, and the filing function correctly refused to classify it. This is the expected behavior for low-content transcripts.


Bug Found and Fixed During Testing

page_count NULL in documents table: The queue_document() method copies filename/path/size from catalogue but doesn't set page_count. The enricher then fails at line 451: doc.get('page_count', 0) >= 3.get() returns None when the key exists with a NULL value, not the default 0.

Fix: Transcript processor now sets page_count alongside text_dir in the documents UPDATE. Commit f69c04a.


Verification

Check Result
catalogue rows 29,812 (removed 1 old, added 1 new)
documents rows 29,812 (same)
recon.service inactive
recon-watchdog.service inactive
processing/ dir 172f39ae... with transcript.txt, meta.json, page_0001.txt
acquired/stream/ empty (pair consumed)
text_dir set /opt/recon/data/processing/172f39ae7fc6f5b02e0fabcea450c0e4
Import tests all 6 modules pass (dispatcher, transcript_processor, filing, enricher, embedder, resolve_text_dir)

Commits

matt/recon (refactor branch):

Hash Message
66fadb7 Phase 3: dispatcher, transcript processor, text_dir resolution
f69c04a Phase 3: fix page_count in transcript processor

Original text directory preserved

The original /opt/recon/data/text/172f39ae7fc6f5b02e0fabcea450c0e4/ directory was NOT deleted (copy-not-destroy approach). The original concept directory was removed during testing to allow re-enrichment, and the new concept output is in data/concepts/172f39ae.../window_0001.json (empty concepts, as expected).