Phase 6a: transcript organized-in-place documentation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-05-20 14:44:39 +02:00 · 2026-04-14 22:50:18 +00:00 · 2026-04-14 22:50:18 +00:00 · 263a81c1e2
commit 263a81c1e2
parent 1d4106643c
1 changed files with 145 additions and 0 deletions
--- a/phases/phase-6a-transcript-organized-in-place.md
+++ b/phases/phase-6a-transcript-organized-in-place.md
@ -0,0 +1,145 @@
+# Phase 6a: Transcript Organized-In-Place
+
+## Objective
+
+Fix the transcript filing gap: transcripts completing the pipeline had
+`organized_at IS NULL` and would be picked up by the filing worker's query.
+Since their `documents.path` is a PeerTube watch URL (not a filesystem path),
+the filing worker's `path LIKE '/opt/recon/data/processing/%'` filter naturally
+excluded them — but they remained permanently "unfiled" in the DB.
+
+Resolution: transcripts are not primary source files. They don't belong in
+`library/Domain/Subdomain/` like PDFs. Instead, they're marked organized
+in-place at the end of `pre_flight()`, preserving the PeerTube watch URL
+in `catalogue.path`.
+
+## Timestamp
+
+**Started:** 2026-04-14 ~22:40 UTC
+**Completed:** 2026-04-14 ~22:50 UTC
+
+## Backup
+
+- **File:** `/tmp/recon.db.phase6a.20260414.bak`
+- **MD5:** `5eb8fcd3edd73eae864bc38e3fac560f`
+
+## Architectural Rationale
+
+Transcripts are **derived text** from PeerTube videos. The video is the
+primary source; the transcript is a text representation extracted by Whisper
+via peertube-runner on cortex.
+
+Filing transcripts to `/mnt/library/Domain/Subdomain/` would:
+1. Create redundant text files (the content is always recoverable from PeerTube)
+2. Overwrite `catalogue.path` with a library path, destroying the watch URL
+3. Overwrite `download_url` in Qdrant, breaking search→video linkage
+4. Serve no purpose — users clicking search results should land on PeerTube,
+   not a raw text file
+
+Instead, transcripts are marked `organized_at = CURRENT_TIMESTAMP` at the
+end of successful processing. The filing worker never sees them.
+
+## Code Change
+
+**File:** `lib/processors/transcript_processor.py`
+**Commit:** `df29d59` on `refactor` branch
+
+Merged `organized_at = CURRENT_TIMESTAMP` into the existing UPDATE that sets
+`text_dir` and `page_count` at the end of `pre_flight()`:
+
+```python
+# Before (Phase 3-5):
+conn.execute(
+    "UPDATE documents SET text_dir = ?, page_count = ? WHERE hash = ?",
+    (proc_dir, len(pages), file_hash)
+)
+
+# After (Phase 6a):
+conn.execute(
+    "UPDATE documents SET text_dir = ?, page_count = ?, organized_at = CURRENT_TIMESTAMP WHERE hash = ?",
+    (proc_dir, len(pages), file_hash)
+)
+```
+
+Plus a 6-line comment block explaining the rationale.
+
+Total diff: +8 lines, -2 lines.
+
+## Back-fill
+
+One-time SQL update for the 2,260 drain items from Phase 5c-2:
+
+```sql
+UPDATE documents
+SET organized_at = CURRENT_TIMESTAMP
+WHERE hash IN (
+    SELECT d.hash
+    FROM documents d
+    JOIN catalogue c ON d.hash = c.hash
+    WHERE c.source = 'stream.echo6.co'
+      AND d.status = 'complete'
+      AND d.organized_at IS NULL
+      AND d.text_dir LIKE '/opt/recon/data/processing/%'
+);
+```
+
+**Result:** 2,260 rows updated. Post-count of matching unfiled drain items: 0.
+
+The `text_dir LIKE '/opt/recon/data/processing/%'` filter scoped the update
+to drain items only, excluding the 276 STATE 2 pre-existing transcripts
+(which have NULL `text_dir`).
+
+## Verification
+
+| Check | Expected | Actual |
+|-------|----------|--------|
+| catalogue rows | 29,812 | 29,812 |
+| documents rows | 29,812 | 29,812 |
+| Unfiled drain items | 0 | 0 |
+| Unfiled STATE 2 (untouched) | 276 | 276 |
+| Watch URLs intact (catalogue.path) | ~2,536 | 2,537 |
+| Filed library paths (Phase 5a) | 16,596 | 16,596 |
+| Qdrant points | unchanged | 2,322,853 |
+| Processing dirs | ~2,262 | 2,263 |
+| Hopper | empty | 0 |
+| Filing worker pending | 0 | 0 |
+| Service errors (last 10 min) | 0 | 0 |
+
+(+1 on processing dirs and watch URLs from a new transcript processed by the
+live service during this phase — expected behavior.)
+
+## Service Restart
+
+Service restarted at 22:47:50 UTC to pick up the code change. All threads
+came up cleanly: extract, enrich, embed, dispatcher, filing, progress,
+dashboard. Zero errors in 60-second stability window.
+
+## Commit
+
+- **Commit:** `df29d59` on `refactor` branch
+- **Pushed to:** `forge.echo6.co/matt/recon` (origin/refactor)
+
+## Known Inconsistencies (Backlog Items)
+
+### Phase 5a's 16,596 filed transcripts
+
+These were filed into `/mnt/library/Domain/Subdomain/` by Phase 5a's filing
+worker before this fix existed. Their `catalogue.path` was overwritten with
+a library path — the PeerTube watch URL is permanently lost from the DB.
+Their Qdrant `download_url` points to `files.echo6.co` (transcript text file),
+not the PeerTube video.
+
+Recovery would require matching video titles against PeerTube's API. This is
+a separate backlog item.
+
+### 276 STATE 2 pre-existing transcripts
+
+These live at `/opt/recon/data/text/{hash}/` (old scraper format). They have
+`status='complete'`, `organized_at IS NULL`, and NULL `text_dir`. They were
+explicitly excluded from Phase 6a's scope and will be handled separately.
+
+## Files Modified
+
+| File | Change |
+|------|--------|
+| `lib/processors/transcript_processor.py` | Set `organized_at` at end of `pre_flight()` |