4.8 KiB
Phase 6a: Transcript Organized-In-Place
Objective
Fix the transcript filing gap: transcripts completing the pipeline had
organized_at IS NULL and would be picked up by the filing worker's query.
Since their documents.path is a PeerTube watch URL (not a filesystem path),
the filing worker's path LIKE '/opt/recon/data/processing/%' filter naturally
excluded them — but they remained permanently "unfiled" in the DB.
Resolution: transcripts are not primary source files. They don't belong in
library/Domain/Subdomain/ like PDFs. Instead, they're marked organized
in-place at the end of pre_flight(), preserving the PeerTube watch URL
in catalogue.path.
Timestamp
Started: 2026-04-14 ~22:40 UTC Completed: 2026-04-14 ~22:50 UTC
Backup
- File:
/tmp/recon.db.phase6a.20260414.bak - MD5:
5eb8fcd3edd73eae864bc38e3fac560f
Architectural Rationale
Transcripts are derived text from PeerTube videos. The video is the primary source; the transcript is a text representation extracted by Whisper via peertube-runner on cortex.
Filing transcripts to /mnt/library/Domain/Subdomain/ would:
- Create redundant text files (the content is always recoverable from PeerTube)
- Overwrite
catalogue.pathwith a library path, destroying the watch URL - Overwrite
download_urlin Qdrant, breaking search→video linkage - Serve no purpose — users clicking search results should land on PeerTube, not a raw text file
Instead, transcripts are marked organized_at = CURRENT_TIMESTAMP at the
end of successful processing. The filing worker never sees them.
Code Change
File: lib/processors/transcript_processor.py
Commit: df29d59 on refactor branch
Merged organized_at = CURRENT_TIMESTAMP into the existing UPDATE that sets
text_dir and page_count at the end of pre_flight():
# Before (Phase 3-5):
conn.execute(
"UPDATE documents SET text_dir = ?, page_count = ? WHERE hash = ?",
(proc_dir, len(pages), file_hash)
)
# After (Phase 6a):
conn.execute(
"UPDATE documents SET text_dir = ?, page_count = ?, organized_at = CURRENT_TIMESTAMP WHERE hash = ?",
(proc_dir, len(pages), file_hash)
)
Plus a 6-line comment block explaining the rationale.
Total diff: +8 lines, -2 lines.
Back-fill
One-time SQL update for the 2,260 drain items from Phase 5c-2:
UPDATE documents
SET organized_at = CURRENT_TIMESTAMP
WHERE hash IN (
SELECT d.hash
FROM documents d
JOIN catalogue c ON d.hash = c.hash
WHERE c.source = 'stream.echo6.co'
AND d.status = 'complete'
AND d.organized_at IS NULL
AND d.text_dir LIKE '/opt/recon/data/processing/%'
);
Result: 2,260 rows updated. Post-count of matching unfiled drain items: 0.
The text_dir LIKE '/opt/recon/data/processing/%' filter scoped the update
to drain items only, excluding the 276 STATE 2 pre-existing transcripts
(which have NULL text_dir).
Verification
| Check | Expected | Actual |
|---|---|---|
| catalogue rows | 29,812 | 29,812 |
| documents rows | 29,812 | 29,812 |
| Unfiled drain items | 0 | 0 |
| Unfiled STATE 2 (untouched) | 276 | 276 |
| Watch URLs intact (catalogue.path) | ~2,536 | 2,537 |
| Filed library paths (Phase 5a) | 16,596 | 16,596 |
| Qdrant points | unchanged | 2,322,853 |
| Processing dirs | ~2,262 | 2,263 |
| Hopper | empty | 0 |
| Filing worker pending | 0 | 0 |
| Service errors (last 10 min) | 0 | 0 |
(+1 on processing dirs and watch URLs from a new transcript processed by the live service during this phase — expected behavior.)
Service Restart
Service restarted at 22:47:50 UTC to pick up the code change. All threads came up cleanly: extract, enrich, embed, dispatcher, filing, progress, dashboard. Zero errors in 60-second stability window.
Commit
- Commit:
df29d59onrefactorbranch - Pushed to:
forge.echo6.co/matt/recon(origin/refactor)
Known Inconsistencies (Backlog Items)
Phase 5a's 16,596 filed transcripts
These were filed into /mnt/library/Domain/Subdomain/ by Phase 5a's filing
worker before this fix existed. Their catalogue.path was overwritten with
a library path — the PeerTube watch URL is permanently lost from the DB.
Their Qdrant download_url points to files.echo6.co (transcript text file),
not the PeerTube video.
Recovery would require matching video titles against PeerTube's API. This is a separate backlog item.
276 STATE 2 pre-existing transcripts
These live at /opt/recon/data/text/{hash}/ (old scraper format). They have
status='complete', organized_at IS NULL, and NULL text_dir. They were
explicitly excluded from Phase 6a's scope and will be handled separately.
Files Modified
| File | Change |
|---|---|
lib/processors/transcript_processor.py |
Set organized_at at end of pre_flight() |