Documents dispatcher, transcript processor, text_dir resolution, and full pipeline test results (172f39ae → skip_unclassified). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
5.6 KiB
Phase 3: First End-to-End Test (Transcript Processor)
Executed: 2026-04-14T15:40Z UTC
Backup
| Item | Location | MD5 Hash |
|---|---|---|
| recon.db (pre-Phase 3) | CT 130: /tmp/recon.db.phase2.20260414.bak |
20ec1fec2247a999e7d42f6a716481b0 |
| Test row SQL backup | CT 130: /tmp/recon_phase3_testrow_172f39ae.sql |
— |
What Was Created
1. lib/dispatcher.py — dispatch_once() one-shot dispatcher
Scans each configured acquired/<subfolder>/ for stable content+sidecar pairs (.txt + .meta.json sharing a basename). When a pair has been stable for pipeline.mtime_stability_seconds, hands it to the registered processor's pre_flight().
- Dynamically imports processor modules via
importlib - Unknown/missing processors log an error and skip (don't crash)
- Returns list of result dicts from all processor calls
2. lib/processors/transcript_processor.py — pre_flight()
Handles transcript content from acquired/stream/:
- Reads content file, computes MD5 hash
- Deduplicates against catalogue by hash
- Reads
.meta.jsonsidecar - Moves pair to
processing/{hash}/(renames content totranscript.txt) - Splits raw text into
page_NNNN.txtfiles usingchunk_text()(2000 words/page) - Registers in catalogue + documents tables
- Sets
documents.text_dirto the processing directory - Sets
documents.page_countfrom page split - Advances status to
extracted
TODO (flagged for later phases): Level-4 name deduplication not implemented for transcripts.
3. lib/utils.py — resolve_text_dir()
Helper function that resolves the text directory for a document:
- If
documents.text_diris set, use that - Otherwise fall back to legacy
config['paths']['text']/{hash}/
4. Package scaffolding
lib/processors/__init__.py— emptylib/acquisition/__init__.py— empty
What Changed in Existing Code
lib/enricher.py (line 28, 349)
- Added:
from .utils import resolve_text_dir - Changed:
text_dir = os.path.join(config['paths']['text'], file_hash)→text_dir = resolve_text_dir(file_hash, config, db)
lib/embedder.py (line 24, 278)
- Added:
from .utils import resolve_text_dir - Changed:
text_dir = os.path.join(config['paths']['text'], file_hash)→text_dir = resolve_text_dir(file_hash, config, db)
Both changes are minimal and additive — the resolve_text_dir() function falls back to the legacy path when text_dir is NULL, so existing documents are unaffected.
End-to-End Test
Test transcript
| Field | Value |
|---|---|
| Hash | 172f39ae7fc6f5b02e0fabcea450c0e4 |
| Title | Welcome to YouTube Memberships! |
| Channel | stefan-sobkowiak-miracle-farms |
| Source | stream.echo6.co |
| Duration | 331 seconds |
| Pages | 1 (3,747 bytes) |
Test workflow
Copy-and-unprocess approach: Concatenated the existing page_0001.txt back to raw text, staged in acquired/stream/ with meta.json sidecar, deleted existing catalogue+documents rows (backup in /tmp/).
Pipeline execution (all manual one-shot calls):
| Step | Command | Result |
|---|---|---|
| 1. Dispatch | dispatch_once() |
action: extracted — pair found, stable, routed to transcript_processor |
| 2. Enrich | enrich_single(hash, db, config, rotator) |
True — text_dir resolved correctly, Gemini returned 0 concepts |
| 3. Embed | embed_single(hash, db, config) |
True — 0 concepts → status=complete, 0 vectors |
| 4. File | file_processed_item(hash, source_path, db, config) |
action: skip_unclassified — no domain from empty concepts |
Path traversal
acquired/stream/172f39ae...txt + .meta.json
→ processing/172f39ae.../transcript.txt + meta.json + page_0001.txt
→ (enriched, 0 concepts → skip_unclassified, not filed to library)
What skip_unclassified means
The transcript is a YouTube Memberships announcement with no extractable knowledge concepts. Gemini correctly returned 0 concepts, and the filing function correctly refused to classify it. This is the expected behavior for low-content transcripts.
Bug Found and Fixed During Testing
page_count NULL in documents table: The queue_document() method copies filename/path/size from catalogue but doesn't set page_count. The enricher then fails at line 451: doc.get('page_count', 0) >= 3 — .get() returns None when the key exists with a NULL value, not the default 0.
Fix: Transcript processor now sets page_count alongside text_dir in the documents UPDATE. Commit f69c04a.
Verification
| Check | Result |
|---|---|
| catalogue rows | 29,812 (removed 1 old, added 1 new) |
| documents rows | 29,812 (same) |
| recon.service | inactive |
| recon-watchdog.service | inactive |
| processing/ dir | 172f39ae... with transcript.txt, meta.json, page_0001.txt |
| acquired/stream/ | empty (pair consumed) |
| text_dir set | /opt/recon/data/processing/172f39ae7fc6f5b02e0fabcea450c0e4 |
| Import tests | all 6 modules pass (dispatcher, transcript_processor, filing, enricher, embedder, resolve_text_dir) |
Commits
matt/recon (refactor branch):
| Hash | Message |
|---|---|
66fadb7 |
Phase 3: dispatcher, transcript processor, text_dir resolution |
f69c04a |
Phase 3: fix page_count in transcript processor |
Original text directory preserved
The original /opt/recon/data/text/172f39ae7fc6f5b02e0fabcea450c0e4/ directory was NOT deleted (copy-not-destroy approach). The original concept directory was removed during testing to allow re-enrichment, and the new concept output is in data/concepts/172f39ae.../window_0001.json (empty concepts, as expected).