mirror of
https://github.com/zvx-echo6/refactored-recon.git
synced 2026-05-20 14:44:39 +02:00
Documents dispatcher, transcript processor, text_dir resolution, and full pipeline test results (172f39ae → skip_unclassified). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
147 lines
5.6 KiB
Markdown
147 lines
5.6 KiB
Markdown
# Phase 3: First End-to-End Test (Transcript Processor)
|
|
|
|
**Executed:** 2026-04-14T15:40Z UTC
|
|
|
|
---
|
|
|
|
## Backup
|
|
|
|
| Item | Location | MD5 Hash |
|
|
|------|----------|----------|
|
|
| recon.db (pre-Phase 3) | CT 130: `/tmp/recon.db.phase2.20260414.bak` | `20ec1fec2247a999e7d42f6a716481b0` |
|
|
| Test row SQL backup | CT 130: `/tmp/recon_phase3_testrow_172f39ae.sql` | — |
|
|
|
|
---
|
|
|
|
## What Was Created
|
|
|
|
### 1. `lib/dispatcher.py` — `dispatch_once()` one-shot dispatcher
|
|
|
|
Scans each configured `acquired/<subfolder>/` for stable content+sidecar pairs (`.txt` + `.meta.json` sharing a basename). When a pair has been stable for `pipeline.mtime_stability_seconds`, hands it to the registered processor's `pre_flight()`.
|
|
|
|
- Dynamically imports processor modules via `importlib`
|
|
- Unknown/missing processors log an error and skip (don't crash)
|
|
- Returns list of result dicts from all processor calls
|
|
|
|
### 2. `lib/processors/transcript_processor.py` — `pre_flight()`
|
|
|
|
Handles transcript content from `acquired/stream/`:
|
|
|
|
1. Reads content file, computes MD5 hash
|
|
2. Deduplicates against catalogue by hash
|
|
3. Reads `.meta.json` sidecar
|
|
4. Moves pair to `processing/{hash}/` (renames content to `transcript.txt`)
|
|
5. Splits raw text into `page_NNNN.txt` files using `chunk_text()` (2000 words/page)
|
|
6. Registers in catalogue + documents tables
|
|
7. Sets `documents.text_dir` to the processing directory
|
|
8. Sets `documents.page_count` from page split
|
|
9. Advances status to `extracted`
|
|
|
|
**TODO (flagged for later phases):** Level-4 name deduplication not implemented for transcripts.
|
|
|
|
### 3. `lib/utils.py` — `resolve_text_dir()`
|
|
|
|
Helper function that resolves the text directory for a document:
|
|
- If `documents.text_dir` is set, use that
|
|
- Otherwise fall back to legacy `config['paths']['text']/{hash}/`
|
|
|
|
### 4. Package scaffolding
|
|
|
|
- `lib/processors/__init__.py` — empty
|
|
- `lib/acquisition/__init__.py` — empty
|
|
|
|
---
|
|
|
|
## What Changed in Existing Code
|
|
|
|
### `lib/enricher.py` (line 28, 349)
|
|
|
|
- Added: `from .utils import resolve_text_dir`
|
|
- Changed: `text_dir = os.path.join(config['paths']['text'], file_hash)` → `text_dir = resolve_text_dir(file_hash, config, db)`
|
|
|
|
### `lib/embedder.py` (line 24, 278)
|
|
|
|
- Added: `from .utils import resolve_text_dir`
|
|
- Changed: `text_dir = os.path.join(config['paths']['text'], file_hash)` → `text_dir = resolve_text_dir(file_hash, config, db)`
|
|
|
|
Both changes are minimal and additive — the `resolve_text_dir()` function falls back to the legacy path when `text_dir` is NULL, so existing documents are unaffected.
|
|
|
|
---
|
|
|
|
## End-to-End Test
|
|
|
|
### Test transcript
|
|
|
|
| Field | Value |
|
|
|-------|-------|
|
|
| Hash | `172f39ae7fc6f5b02e0fabcea450c0e4` |
|
|
| Title | Welcome to YouTube Memberships! |
|
|
| Channel | stefan-sobkowiak-miracle-farms |
|
|
| Source | stream.echo6.co |
|
|
| Duration | 331 seconds |
|
|
| Pages | 1 (3,747 bytes) |
|
|
|
|
### Test workflow
|
|
|
|
**Copy-and-unprocess approach:** Concatenated the existing `page_0001.txt` back to raw text, staged in `acquired/stream/` with meta.json sidecar, deleted existing catalogue+documents rows (backup in `/tmp/`).
|
|
|
|
**Pipeline execution (all manual one-shot calls):**
|
|
|
|
| Step | Command | Result |
|
|
|------|---------|--------|
|
|
| 1. Dispatch | `dispatch_once()` | `action: extracted` — pair found, stable, routed to `transcript_processor` |
|
|
| 2. Enrich | `enrich_single(hash, db, config, rotator)` | `True` — text_dir resolved correctly, Gemini returned 0 concepts |
|
|
| 3. Embed | `embed_single(hash, db, config)` | `True` — 0 concepts → status=complete, 0 vectors |
|
|
| 4. File | `file_processed_item(hash, source_path, db, config)` | `action: skip_unclassified` — no domain from empty concepts |
|
|
|
|
### Path traversal
|
|
|
|
```
|
|
acquired/stream/172f39ae...txt + .meta.json
|
|
→ processing/172f39ae.../transcript.txt + meta.json + page_0001.txt
|
|
→ (enriched, 0 concepts → skip_unclassified, not filed to library)
|
|
```
|
|
|
|
### What `skip_unclassified` means
|
|
|
|
The transcript is a YouTube Memberships announcement with no extractable knowledge concepts. Gemini correctly returned 0 concepts, and the filing function correctly refused to classify it. This is the expected behavior for low-content transcripts.
|
|
|
|
---
|
|
|
|
## Bug Found and Fixed During Testing
|
|
|
|
**`page_count` NULL in documents table:** The `queue_document()` method copies filename/path/size from catalogue but doesn't set `page_count`. The enricher then fails at line 451: `doc.get('page_count', 0) >= 3` — `.get()` returns `None` when the key exists with a NULL value, not the default `0`.
|
|
|
|
**Fix:** Transcript processor now sets `page_count` alongside `text_dir` in the documents UPDATE. Commit `f69c04a`.
|
|
|
|
---
|
|
|
|
## Verification
|
|
|
|
| Check | Result |
|
|
|-------|--------|
|
|
| catalogue rows | 29,812 (removed 1 old, added 1 new) |
|
|
| documents rows | 29,812 (same) |
|
|
| recon.service | inactive |
|
|
| recon-watchdog.service | inactive |
|
|
| processing/ dir | `172f39ae...` with transcript.txt, meta.json, page_0001.txt |
|
|
| acquired/stream/ | empty (pair consumed) |
|
|
| text_dir set | `/opt/recon/data/processing/172f39ae7fc6f5b02e0fabcea450c0e4` |
|
|
| Import tests | all 6 modules pass (dispatcher, transcript_processor, filing, enricher, embedder, resolve_text_dir) |
|
|
|
|
---
|
|
|
|
## Commits
|
|
|
|
**matt/recon (refactor branch):**
|
|
|
|
| Hash | Message |
|
|
|------|---------|
|
|
| `66fadb7` | Phase 3: dispatcher, transcript processor, text_dir resolution |
|
|
| `f69c04a` | Phase 3: fix page_count in transcript processor |
|
|
|
|
---
|
|
|
|
## Original text directory preserved
|
|
|
|
The original `/opt/recon/data/text/172f39ae7fc6f5b02e0fabcea450c0e4/` directory was NOT deleted (copy-not-destroy approach). The original concept directory was removed during testing to allow re-enrichment, and the new concept output is in `data/concepts/172f39ae.../window_0001.json` (empty concepts, as expected).
|