mirror of https://github.com/zvx-echo6/refactored-recon.git synced 2026-05-20 14:44:39 +02:00

Matt 0747cb761f Phase 3: transcript processor end-to-end test doc

Documents dispatcher, transcript processor, text_dir resolution,
and full pipeline test results (172f39ae → skip_unclassified).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-04-14 15:44:25 +00:00

5.6 KiB

Raw Blame History

Phase 3: First End-to-End Test (Transcript Processor)

Executed: 2026-04-14T15:40Z UTC

Backup

Item	Location	MD5 Hash
recon.db (pre-Phase 3)	CT 130: `/tmp/recon.db.phase2.20260414.bak`	`20ec1fec2247a999e7d42f6a716481b0`
Test row SQL backup	CT 130: `/tmp/recon_phase3_testrow_172f39ae.sql`	—

What Was Created

1. `lib/dispatcher.py` — `dispatch_once()` one-shot dispatcher

Scans each configured acquired/<subfolder>/ for stable content+sidecar pairs (.txt + .meta.json sharing a basename). When a pair has been stable for pipeline.mtime_stability_seconds, hands it to the registered processor's pre_flight().

Dynamically imports processor modules via importlib
Unknown/missing processors log an error and skip (don't crash)
Returns list of result dicts from all processor calls

2. `lib/processors/transcript_processor.py` — `pre_flight()`

Handles transcript content from acquired/stream/:

Reads content file, computes MD5 hash
Deduplicates against catalogue by hash
Reads .meta.json sidecar
Moves pair to processing/{hash}/ (renames content to transcript.txt)
Splits raw text into page_NNNN.txt files using chunk_text() (2000 words/page)
Registers in catalogue + documents tables
Sets documents.text_dir to the processing directory
Sets documents.page_count from page split
Advances status to extracted

TODO (flagged for later phases): Level-4 name deduplication not implemented for transcripts.

3. `lib/utils.py` — `resolve_text_dir()`

Helper function that resolves the text directory for a document:

If documents.text_dir is set, use that
Otherwise fall back to legacy config['paths']['text']/{hash}/

4. Package scaffolding

lib/processors/__init__.py — empty
lib/acquisition/__init__.py — empty

What Changed in Existing Code

`lib/enricher.py` (line 28, 349)

Added: from .utils import resolve_text_dir
Changed: text_dir = os.path.join(config['paths']['text'], file_hash) → text_dir = resolve_text_dir(file_hash, config, db)

`lib/embedder.py` (line 24, 278)

Added: from .utils import resolve_text_dir
Changed: text_dir = os.path.join(config['paths']['text'], file_hash) → text_dir = resolve_text_dir(file_hash, config, db)

Both changes are minimal and additive — the resolve_text_dir() function falls back to the legacy path when text_dir is NULL, so existing documents are unaffected.

End-to-End Test

Test transcript

Field	Value
Hash	`172f39ae7fc6f5b02e0fabcea450c0e4`
Title	Welcome to YouTube Memberships!
Channel	stefan-sobkowiak-miracle-farms
Source	stream.echo6.co
Duration	331 seconds
Pages	1 (3,747 bytes)

Test workflow

Copy-and-unprocess approach: Concatenated the existing page_0001.txt back to raw text, staged in acquired/stream/ with meta.json sidecar, deleted existing catalogue+documents rows (backup in /tmp/).

Pipeline execution (all manual one-shot calls):

Step	Command	Result
1. Dispatch	`dispatch_once()`	`action: extracted` — pair found, stable, routed to `transcript_processor`
2. Enrich	`enrich_single(hash, db, config, rotator)`	`True` — text_dir resolved correctly, Gemini returned 0 concepts
3. Embed	`embed_single(hash, db, config)`	`True` — 0 concepts → status=complete, 0 vectors
4. File	`file_processed_item(hash, source_path, db, config)`	`action: skip_unclassified` — no domain from empty concepts

Path traversal

acquired/stream/172f39ae...txt + .meta.json
  → processing/172f39ae.../transcript.txt + meta.json + page_0001.txt
  → (enriched, 0 concepts → skip_unclassified, not filed to library)

What `skip_unclassified` means

The transcript is a YouTube Memberships announcement with no extractable knowledge concepts. Gemini correctly returned 0 concepts, and the filing function correctly refused to classify it. This is the expected behavior for low-content transcripts.

Bug Found and Fixed During Testing

page_count NULL in documents table: The queue_document() method copies filename/path/size from catalogue but doesn't set page_count. The enricher then fails at line 451: doc.get('page_count', 0) >= 3 — .get() returns None when the key exists with a NULL value, not the default 0.

Fix: Transcript processor now sets page_count alongside text_dir in the documents UPDATE. Commit f69c04a.

Verification

Check	Result
catalogue rows	29,812 (removed 1 old, added 1 new)
documents rows	29,812 (same)
recon.service	inactive
recon-watchdog.service	inactive
processing/ dir	`172f39ae...` with transcript.txt, meta.json, page_0001.txt
acquired/stream/	empty (pair consumed)
text_dir set	`/opt/recon/data/processing/172f39ae7fc6f5b02e0fabcea450c0e4`
Import tests	all 6 modules pass (dispatcher, transcript_processor, filing, enricher, embedder, resolve_text_dir)

Commits

matt/recon (refactor branch):

Hash	Message
`66fadb7`	Phase 3: dispatcher, transcript processor, text_dir resolution
`f69c04a`	Phase 3: fix page_count in transcript processor

Original text directory preserved

The original /opt/recon/data/text/172f39ae7fc6f5b02e0fabcea450c0e4/ directory was NOT deleted (copy-not-destroy approach). The original concept directory was removed during testing to allow re-enrichment, and the new concept output is in data/concepts/172f39ae.../window_0001.json (empty concepts, as expected).

5.6 KiB Raw Blame History