mirror of https://github.com/zvx-echo6/refactored-recon.git synced 2026-05-20 14:44:39 +02:00

Matt 263a81c1e2 Phase 6a: transcript organized-in-place documentation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-04-14 22:50:18 +00:00

4.8 KiB

Raw Blame History

Phase 6a: Transcript Organized-In-Place

Objective

Fix the transcript filing gap: transcripts completing the pipeline had organized_at IS NULL and would be picked up by the filing worker's query. Since their documents.path is a PeerTube watch URL (not a filesystem path), the filing worker's path LIKE '/opt/recon/data/processing/%' filter naturally excluded them — but they remained permanently "unfiled" in the DB.

Resolution: transcripts are not primary source files. They don't belong in library/Domain/Subdomain/ like PDFs. Instead, they're marked organized in-place at the end of pre_flight(), preserving the PeerTube watch URL in catalogue.path.

Timestamp

Started: 2026-04-14 ~22:40 UTC Completed: 2026-04-14 ~22:50 UTC

Backup

File: /tmp/recon.db.phase6a.20260414.bak
MD5: 5eb8fcd3edd73eae864bc38e3fac560f

Architectural Rationale

Transcripts are derived text from PeerTube videos. The video is the primary source; the transcript is a text representation extracted by Whisper via peertube-runner on cortex.

Filing transcripts to /mnt/library/Domain/Subdomain/ would:

Create redundant text files (the content is always recoverable from PeerTube)
Overwrite catalogue.path with a library path, destroying the watch URL
Overwrite download_url in Qdrant, breaking search→video linkage
Serve no purpose — users clicking search results should land on PeerTube, not a raw text file

Instead, transcripts are marked organized_at = CURRENT_TIMESTAMP at the end of successful processing. The filing worker never sees them.

Code Change

File: lib/processors/transcript_processor.py Commit: df29d59 on refactor branch

Merged organized_at = CURRENT_TIMESTAMP into the existing UPDATE that sets text_dir and page_count at the end of pre_flight():

# Before (Phase 3-5):
conn.execute(
    "UPDATE documents SET text_dir = ?, page_count = ? WHERE hash = ?",
    (proc_dir, len(pages), file_hash)
)

# After (Phase 6a):
conn.execute(
    "UPDATE documents SET text_dir = ?, page_count = ?, organized_at = CURRENT_TIMESTAMP WHERE hash = ?",
    (proc_dir, len(pages), file_hash)
)

Plus a 6-line comment block explaining the rationale.

Total diff: +8 lines, -2 lines.

Back-fill

One-time SQL update for the 2,260 drain items from Phase 5c-2:

UPDATE documents
SET organized_at = CURRENT_TIMESTAMP
WHERE hash IN (
    SELECT d.hash
    FROM documents d
    JOIN catalogue c ON d.hash = c.hash
    WHERE c.source = 'stream.echo6.co'
      AND d.status = 'complete'
      AND d.organized_at IS NULL
      AND d.text_dir LIKE '/opt/recon/data/processing/%'
);

Result: 2,260 rows updated. Post-count of matching unfiled drain items: 0.

The text_dir LIKE '/opt/recon/data/processing/%' filter scoped the update to drain items only, excluding the 276 STATE 2 pre-existing transcripts (which have NULL text_dir).

Verification

Check	Expected	Actual
catalogue rows	29,812	29,812
documents rows	29,812	29,812
Unfiled drain items	0	0
Unfiled STATE 2 (untouched)	276	276
Watch URLs intact (catalogue.path)	~2,536	2,537
Filed library paths (Phase 5a)	16,596	16,596
Qdrant points	unchanged	2,322,853
Processing dirs	~2,262	2,263
Hopper	empty	0
Filing worker pending	0	0
Service errors (last 10 min)	0	0

(+1 on processing dirs and watch URLs from a new transcript processed by the live service during this phase — expected behavior.)

Service Restart

Service restarted at 22:47:50 UTC to pick up the code change. All threads came up cleanly: extract, enrich, embed, dispatcher, filing, progress, dashboard. Zero errors in 60-second stability window.

Commit

Commit: df29d59 on refactor branch
Pushed to: forge.echo6.co/matt/recon (origin/refactor)

Known Inconsistencies (Backlog Items)

Phase 5a's 16,596 filed transcripts

These were filed into /mnt/library/Domain/Subdomain/ by Phase 5a's filing worker before this fix existed. Their catalogue.path was overwritten with a library path — the PeerTube watch URL is permanently lost from the DB. Their Qdrant download_url points to files.echo6.co (transcript text file), not the PeerTube video.

Recovery would require matching video titles against PeerTube's API. This is a separate backlog item.

276 STATE 2 pre-existing transcripts

These live at /opt/recon/data/text/{hash}/ (old scraper format). They have status='complete', organized_at IS NULL, and NULL text_dir. They were explicitly excluded from Phase 6a's scope and will be handled separately.

Files Modified

File	Change
`lib/processors/transcript_processor.py`	Set `organized_at` at end of `pre_flight()`

4.8 KiB Raw Blame History