diff --git a/phases/phase-6a-transcript-organized-in-place.md b/phases/phase-6a-transcript-organized-in-place.md new file mode 100644 index 0000000..978078f --- /dev/null +++ b/phases/phase-6a-transcript-organized-in-place.md @@ -0,0 +1,145 @@ +# Phase 6a: Transcript Organized-In-Place + +## Objective + +Fix the transcript filing gap: transcripts completing the pipeline had +`organized_at IS NULL` and would be picked up by the filing worker's query. +Since their `documents.path` is a PeerTube watch URL (not a filesystem path), +the filing worker's `path LIKE '/opt/recon/data/processing/%'` filter naturally +excluded them — but they remained permanently "unfiled" in the DB. + +Resolution: transcripts are not primary source files. They don't belong in +`library/Domain/Subdomain/` like PDFs. Instead, they're marked organized +in-place at the end of `pre_flight()`, preserving the PeerTube watch URL +in `catalogue.path`. + +## Timestamp + +**Started:** 2026-04-14 ~22:40 UTC +**Completed:** 2026-04-14 ~22:50 UTC + +## Backup + +- **File:** `/tmp/recon.db.phase6a.20260414.bak` +- **MD5:** `5eb8fcd3edd73eae864bc38e3fac560f` + +## Architectural Rationale + +Transcripts are **derived text** from PeerTube videos. The video is the +primary source; the transcript is a text representation extracted by Whisper +via peertube-runner on cortex. + +Filing transcripts to `/mnt/library/Domain/Subdomain/` would: +1. Create redundant text files (the content is always recoverable from PeerTube) +2. Overwrite `catalogue.path` with a library path, destroying the watch URL +3. Overwrite `download_url` in Qdrant, breaking search→video linkage +4. Serve no purpose — users clicking search results should land on PeerTube, + not a raw text file + +Instead, transcripts are marked `organized_at = CURRENT_TIMESTAMP` at the +end of successful processing. The filing worker never sees them. + +## Code Change + +**File:** `lib/processors/transcript_processor.py` +**Commit:** `df29d59` on `refactor` branch + +Merged `organized_at = CURRENT_TIMESTAMP` into the existing UPDATE that sets +`text_dir` and `page_count` at the end of `pre_flight()`: + +```python +# Before (Phase 3-5): +conn.execute( + "UPDATE documents SET text_dir = ?, page_count = ? WHERE hash = ?", + (proc_dir, len(pages), file_hash) +) + +# After (Phase 6a): +conn.execute( + "UPDATE documents SET text_dir = ?, page_count = ?, organized_at = CURRENT_TIMESTAMP WHERE hash = ?", + (proc_dir, len(pages), file_hash) +) +``` + +Plus a 6-line comment block explaining the rationale. + +Total diff: +8 lines, -2 lines. + +## Back-fill + +One-time SQL update for the 2,260 drain items from Phase 5c-2: + +```sql +UPDATE documents +SET organized_at = CURRENT_TIMESTAMP +WHERE hash IN ( + SELECT d.hash + FROM documents d + JOIN catalogue c ON d.hash = c.hash + WHERE c.source = 'stream.echo6.co' + AND d.status = 'complete' + AND d.organized_at IS NULL + AND d.text_dir LIKE '/opt/recon/data/processing/%' +); +``` + +**Result:** 2,260 rows updated. Post-count of matching unfiled drain items: 0. + +The `text_dir LIKE '/opt/recon/data/processing/%'` filter scoped the update +to drain items only, excluding the 276 STATE 2 pre-existing transcripts +(which have NULL `text_dir`). + +## Verification + +| Check | Expected | Actual | +|-------|----------|--------| +| catalogue rows | 29,812 | 29,812 | +| documents rows | 29,812 | 29,812 | +| Unfiled drain items | 0 | 0 | +| Unfiled STATE 2 (untouched) | 276 | 276 | +| Watch URLs intact (catalogue.path) | ~2,536 | 2,537 | +| Filed library paths (Phase 5a) | 16,596 | 16,596 | +| Qdrant points | unchanged | 2,322,853 | +| Processing dirs | ~2,262 | 2,263 | +| Hopper | empty | 0 | +| Filing worker pending | 0 | 0 | +| Service errors (last 10 min) | 0 | 0 | + +(+1 on processing dirs and watch URLs from a new transcript processed by the +live service during this phase — expected behavior.) + +## Service Restart + +Service restarted at 22:47:50 UTC to pick up the code change. All threads +came up cleanly: extract, enrich, embed, dispatcher, filing, progress, +dashboard. Zero errors in 60-second stability window. + +## Commit + +- **Commit:** `df29d59` on `refactor` branch +- **Pushed to:** `forge.echo6.co/matt/recon` (origin/refactor) + +## Known Inconsistencies (Backlog Items) + +### Phase 5a's 16,596 filed transcripts + +These were filed into `/mnt/library/Domain/Subdomain/` by Phase 5a's filing +worker before this fix existed. Their `catalogue.path` was overwritten with +a library path — the PeerTube watch URL is permanently lost from the DB. +Their Qdrant `download_url` points to `files.echo6.co` (transcript text file), +not the PeerTube video. + +Recovery would require matching video titles against PeerTube's API. This is +a separate backlog item. + +### 276 STATE 2 pre-existing transcripts + +These live at `/opt/recon/data/text/{hash}/` (old scraper format). They have +`status='complete'`, `organized_at IS NULL`, and NULL `text_dir`. They were +explicitly excluded from Phase 6a's scope and will be handled separately. + +## Files Modified + +| File | Change | +|------|--------| +| `lib/processors/transcript_processor.py` | Set `organized_at` at end of `pre_flight()` |