refactored-recon/phases/phase-6a-transcript-organized-in-place.md

# Phase 6a: Transcript Organized-In-Place

## Objective

Fix the transcript filing gap: transcripts completing the pipeline had
`organized_at IS NULL` and would be picked up by the filing worker's query.
Since their `documents.path` is a PeerTube watch URL (not a filesystem path),
the filing worker's `path LIKE '/opt/recon/data/processing/%'` filter naturally
excluded them — but they remained permanently "unfiled" in the DB.

Resolution: transcripts are not primary source files. They don't belong in
`library/Domain/Subdomain/` like PDFs. Instead, they're marked organized
in-place at the end of `pre_flight()`, preserving the PeerTube watch URL
in `catalogue.path`.

## Timestamp

**Started:** 2026-04-14 ~22:40 UTC
**Completed:** 2026-04-14 ~22:50 UTC

## Backup

- **File:** `/tmp/recon.db.phase6a.20260414.bak`
- **MD5:** `5eb8fcd3edd73eae864bc38e3fac560f`

## Architectural Rationale

Transcripts are **derived text** from PeerTube videos. The video is the
primary source; the transcript is a text representation extracted by Whisper
via peertube-runner on cortex.

Filing transcripts to `/mnt/library/Domain/Subdomain/` would:
1. Create redundant text files (the content is always recoverable from PeerTube)
2. Overwrite `catalogue.path` with a library path, destroying the watch URL
3. Overwrite `download_url` in Qdrant, breaking search→video linkage
4. Serve no purpose — users clicking search results should land on PeerTube,
   not a raw text file

Instead, transcripts are marked `organized_at = CURRENT_TIMESTAMP` at the
end of successful processing. The filing worker never sees them.

## Code Change

**File:** `lib/processors/transcript_processor.py`
**Commit:** `df29d59` on `refactor` branch

Merged `organized_at = CURRENT_TIMESTAMP` into the existing UPDATE that sets
`text_dir` and `page_count` at the end of `pre_flight()`:

```python
# Before (Phase 3-5):
conn.execute(
    "UPDATE documents SET text_dir = ?, page_count = ? WHERE hash = ?",
    (proc_dir, len(pages), file_hash)
)

# After (Phase 6a):
conn.execute(
    "UPDATE documents SET text_dir = ?, page_count = ?, organized_at = CURRENT_TIMESTAMP WHERE hash = ?",
    (proc_dir, len(pages), file_hash)
)
```

Plus a 6-line comment block explaining the rationale.

Total diff: +8 lines, -2 lines.

## Back-fill

One-time SQL update for the 2,260 drain items from Phase 5c-2:

```sql
UPDATE documents
SET organized_at = CURRENT_TIMESTAMP
WHERE hash IN (
    SELECT d.hash
    FROM documents d
    JOIN catalogue c ON d.hash = c.hash
    WHERE c.source = 'stream.echo6.co'
      AND d.status = 'complete'
      AND d.organized_at IS NULL
      AND d.text_dir LIKE '/opt/recon/data/processing/%'
);
```

**Result:** 2,260 rows updated. Post-count of matching unfiled drain items: 0.

The `text_dir LIKE '/opt/recon/data/processing/%'` filter scoped the update
to drain items only, excluding the 276 STATE 2 pre-existing transcripts
(which have NULL `text_dir`).

## Verification

| Check | Expected | Actual |
|-------|----------|--------|
| catalogue rows | 29,812 | 29,812 |
| documents rows | 29,812 | 29,812 |
| Unfiled drain items | 0 | 0 |
| Unfiled STATE 2 (untouched) | 276 | 276 |
| Watch URLs intact (catalogue.path) | ~2,536 | 2,537 |
| Filed library paths (Phase 5a) | 16,596 | 16,596 |
| Qdrant points | unchanged | 2,322,853 |
| Processing dirs | ~2,262 | 2,263 |
| Hopper | empty | 0 |
| Filing worker pending | 0 | 0 |
| Service errors (last 10 min) | 0 | 0 |

(+1 on processing dirs and watch URLs from a new transcript processed by the
live service during this phase — expected behavior.)

## Service Restart

Service restarted at 22:47:50 UTC to pick up the code change. All threads
came up cleanly: extract, enrich, embed, dispatcher, filing, progress,
dashboard. Zero errors in 60-second stability window.

## Commit

- **Commit:** `df29d59` on `refactor` branch
- **Pushed to:** `forge.echo6.co/matt/recon` (origin/refactor)

## Known Inconsistencies (Backlog Items)

### Phase 5a's 16,596 filed transcripts

These were filed into `/mnt/library/Domain/Subdomain/` by Phase 5a's filing
worker before this fix existed. Their `catalogue.path` was overwritten with
a library path — the PeerTube watch URL is permanently lost from the DB.
Their Qdrant `download_url` points to `files.echo6.co` (transcript text file),
not the PeerTube video.

Recovery would require matching video titles against PeerTube's API. This is
a separate backlog item.

### 276 STATE 2 pre-existing transcripts

These live at `/opt/recon/data/text/{hash}/` (old scraper format). They have
`status='complete'`, `organized_at IS NULL`, and NULL `text_dir`. They were
explicitly excluded from Phase 6a's scope and will be handled separately.

## Files Modified

| File | Change |
|------|--------|
| `lib/processors/transcript_processor.py` | Set `organized_at` at end of `pre_flight()` |
Phase 6a: transcript organized-in-place documentation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> 2026-04-14 22:50:18 +00:00			`# Phase 6a: Transcript Organized-In-Place`

			`## Objective`

			`Fix the transcript filing gap: transcripts completing the pipeline had`
			`organized_at IS NULL` and would be picked up by the filing worker's query.
			Since their `documents.path` is a PeerTube watch URL (not a filesystem path),
			the filing worker's `path LIKE '/opt/recon/data/processing/%'` filter naturally
			`excluded them — but they remained permanently "unfiled" in the DB.`

			`Resolution: transcripts are not primary source files. They don't belong in`
			`library/Domain/Subdomain/` like PDFs. Instead, they're marked organized
			in-place at the end of `pre_flight()`, preserving the PeerTube watch URL
			in `catalogue.path`.

			`## Timestamp`

			`Started: 2026-04-14 ~22:40 UTC`
			`Completed: 2026-04-14 ~22:50 UTC`

			`## Backup`

			- File: `/tmp/recon.db.phase6a.20260414.bak`
			- MD5: `5eb8fcd3edd73eae864bc38e3fac560f`

			`## Architectural Rationale`

			`Transcripts are derived text from PeerTube videos. The video is the`
			`primary source; the transcript is a text representation extracted by Whisper`
			`via peertube-runner on cortex.`

			Filing transcripts to `/mnt/library/Domain/Subdomain/` would:
			`1. Create redundant text files (the content is always recoverable from PeerTube)`
			2. Overwrite `catalogue.path` with a library path, destroying the watch URL
			3. Overwrite `download_url` in Qdrant, breaking search→video linkage
			`4. Serve no purpose — users clicking search results should land on PeerTube,`
			`not a raw text file`

			Instead, transcripts are marked `organized_at = CURRENT_TIMESTAMP` at the
			`end of successful processing. The filing worker never sees them.`

			`## Code Change`

			File: `lib/processors/transcript_processor.py`
			Commit: `df29d59` on `refactor` branch

			Merged `organized_at = CURRENT_TIMESTAMP` into the existing UPDATE that sets
			`text_dir` and `page_count` at the end of `pre_flight()`:

			```python
			`# Before (Phase 3-5):`
			`conn.execute(`
			`"UPDATE documents SET text_dir = ?, page_count = ? WHERE hash = ?",`
			`(proc_dir, len(pages), file_hash)`
			`)`

			`# After (Phase 6a):`
			`conn.execute(`
			`"UPDATE documents SET text_dir = ?, page_count = ?, organized_at = CURRENT_TIMESTAMP WHERE hash = ?",`
			`(proc_dir, len(pages), file_hash)`
			`)`
			```

			`Plus a 6-line comment block explaining the rationale.`

			`Total diff: +8 lines, -2 lines.`

			`## Back-fill`

			`One-time SQL update for the 2,260 drain items from Phase 5c-2:`

			```sql
			`UPDATE documents`
			`SET organized_at = CURRENT_TIMESTAMP`
			`WHERE hash IN (`
			`SELECT d.hash`
			`FROM documents d`
			`JOIN catalogue c ON d.hash = c.hash`
			`WHERE c.source = 'stream.echo6.co'`
			`AND d.status = 'complete'`
			`AND d.organized_at IS NULL`
			`AND d.text_dir LIKE '/opt/recon/data/processing/%'`
			`);`
			```

			`Result: 2,260 rows updated. Post-count of matching unfiled drain items: 0.`

			The `text_dir LIKE '/opt/recon/data/processing/%'` filter scoped the update
			`to drain items only, excluding the 276 STATE 2 pre-existing transcripts`
			(which have NULL `text_dir`).

			`## Verification`

			`\| Check \| Expected \| Actual \|`
			`\|-------\|----------\|--------\|`
			`\| catalogue rows \| 29,812 \| 29,812 \|`
			`\| documents rows \| 29,812 \| 29,812 \|`
			`\| Unfiled drain items \| 0 \| 0 \|`
			`\| Unfiled STATE 2 (untouched) \| 276 \| 276 \|`
			`\| Watch URLs intact (catalogue.path) \| ~2,536 \| 2,537 \|`
			`\| Filed library paths (Phase 5a) \| 16,596 \| 16,596 \|`
			`\| Qdrant points \| unchanged \| 2,322,853 \|`
			`\| Processing dirs \| ~2,262 \| 2,263 \|`
			`\| Hopper \| empty \| 0 \|`
			`\| Filing worker pending \| 0 \| 0 \|`
			`\| Service errors (last 10 min) \| 0 \| 0 \|`

			`(+1 on processing dirs and watch URLs from a new transcript processed by the`
			`live service during this phase — expected behavior.)`

			`## Service Restart`

			`Service restarted at 22:47:50 UTC to pick up the code change. All threads`
			`came up cleanly: extract, enrich, embed, dispatcher, filing, progress,`
			`dashboard. Zero errors in 60-second stability window.`

			`## Commit`

			- Commit: `df29d59` on `refactor` branch
			- Pushed to: `forge.echo6.co/matt/recon` (origin/refactor)

			`## Known Inconsistencies (Backlog Items)`

			`### Phase 5a's 16,596 filed transcripts`

			These were filed into `/mnt/library/Domain/Subdomain/` by Phase 5a's filing
			worker before this fix existed. Their `catalogue.path` was overwritten with
			`a library path — the PeerTube watch URL is permanently lost from the DB.`
			Their Qdrant `download_url` points to `files.echo6.co` (transcript text file),
			`not the PeerTube video.`

			`Recovery would require matching video titles against PeerTube's API. This is`
			`a separate backlog item.`

			`### 276 STATE 2 pre-existing transcripts`

			These live at `/opt/recon/data/text/{hash}/` (old scraper format). They have
			`status='complete'`, `organized_at IS NULL`, and NULL `text_dir`. They were
			`explicitly excluded from Phase 6a's scope and will be handled separately.`

			`## Files Modified`

			`\| File \| Change \|`
			`\|------\|--------\|`
			\| `lib/processors/transcript_processor.py` \| Set `organized_at` at end of `pre_flight()` \|