refactored-recon/phases/phase-6a-transcript-organized-in-place.md
Matt 263a81c1e2 Phase 6a: transcript organized-in-place documentation
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-14 22:50:18 +00:00

4.8 KiB

Phase 6a: Transcript Organized-In-Place

Objective

Fix the transcript filing gap: transcripts completing the pipeline had organized_at IS NULL and would be picked up by the filing worker's query. Since their documents.path is a PeerTube watch URL (not a filesystem path), the filing worker's path LIKE '/opt/recon/data/processing/%' filter naturally excluded them — but they remained permanently "unfiled" in the DB.

Resolution: transcripts are not primary source files. They don't belong in library/Domain/Subdomain/ like PDFs. Instead, they're marked organized in-place at the end of pre_flight(), preserving the PeerTube watch URL in catalogue.path.

Timestamp

Started: 2026-04-14 ~22:40 UTC Completed: 2026-04-14 ~22:50 UTC

Backup

  • File: /tmp/recon.db.phase6a.20260414.bak
  • MD5: 5eb8fcd3edd73eae864bc38e3fac560f

Architectural Rationale

Transcripts are derived text from PeerTube videos. The video is the primary source; the transcript is a text representation extracted by Whisper via peertube-runner on cortex.

Filing transcripts to /mnt/library/Domain/Subdomain/ would:

  1. Create redundant text files (the content is always recoverable from PeerTube)
  2. Overwrite catalogue.path with a library path, destroying the watch URL
  3. Overwrite download_url in Qdrant, breaking search→video linkage
  4. Serve no purpose — users clicking search results should land on PeerTube, not a raw text file

Instead, transcripts are marked organized_at = CURRENT_TIMESTAMP at the end of successful processing. The filing worker never sees them.

Code Change

File: lib/processors/transcript_processor.py Commit: df29d59 on refactor branch

Merged organized_at = CURRENT_TIMESTAMP into the existing UPDATE that sets text_dir and page_count at the end of pre_flight():

# Before (Phase 3-5):
conn.execute(
    "UPDATE documents SET text_dir = ?, page_count = ? WHERE hash = ?",
    (proc_dir, len(pages), file_hash)
)

# After (Phase 6a):
conn.execute(
    "UPDATE documents SET text_dir = ?, page_count = ?, organized_at = CURRENT_TIMESTAMP WHERE hash = ?",
    (proc_dir, len(pages), file_hash)
)

Plus a 6-line comment block explaining the rationale.

Total diff: +8 lines, -2 lines.

Back-fill

One-time SQL update for the 2,260 drain items from Phase 5c-2:

UPDATE documents
SET organized_at = CURRENT_TIMESTAMP
WHERE hash IN (
    SELECT d.hash
    FROM documents d
    JOIN catalogue c ON d.hash = c.hash
    WHERE c.source = 'stream.echo6.co'
      AND d.status = 'complete'
      AND d.organized_at IS NULL
      AND d.text_dir LIKE '/opt/recon/data/processing/%'
);

Result: 2,260 rows updated. Post-count of matching unfiled drain items: 0.

The text_dir LIKE '/opt/recon/data/processing/%' filter scoped the update to drain items only, excluding the 276 STATE 2 pre-existing transcripts (which have NULL text_dir).

Verification

Check Expected Actual
catalogue rows 29,812 29,812
documents rows 29,812 29,812
Unfiled drain items 0 0
Unfiled STATE 2 (untouched) 276 276
Watch URLs intact (catalogue.path) ~2,536 2,537
Filed library paths (Phase 5a) 16,596 16,596
Qdrant points unchanged 2,322,853
Processing dirs ~2,262 2,263
Hopper empty 0
Filing worker pending 0 0
Service errors (last 10 min) 0 0

(+1 on processing dirs and watch URLs from a new transcript processed by the live service during this phase — expected behavior.)

Service Restart

Service restarted at 22:47:50 UTC to pick up the code change. All threads came up cleanly: extract, enrich, embed, dispatcher, filing, progress, dashboard. Zero errors in 60-second stability window.

Commit

  • Commit: df29d59 on refactor branch
  • Pushed to: forge.echo6.co/matt/recon (origin/refactor)

Known Inconsistencies (Backlog Items)

Phase 5a's 16,596 filed transcripts

These were filed into /mnt/library/Domain/Subdomain/ by Phase 5a's filing worker before this fix existed. Their catalogue.path was overwritten with a library path — the PeerTube watch URL is permanently lost from the DB. Their Qdrant download_url points to files.echo6.co (transcript text file), not the PeerTube video.

Recovery would require matching video titles against PeerTube's API. This is a separate backlog item.

276 STATE 2 pre-existing transcripts

These live at /opt/recon/data/text/{hash}/ (old scraper format). They have status='complete', organized_at IS NULL, and NULL text_dir. They were explicitly excluded from Phase 6a's scope and will be handled separately.

Files Modified

File Change
lib/processors/transcript_processor.py Set organized_at at end of pre_flight()