mirror of
https://github.com/zvx-echo6/refactored-recon.git
synced 2026-05-20 06:34:34 +02:00
145 lines
4.8 KiB
Markdown
145 lines
4.8 KiB
Markdown
|
|
# Phase 6a: Transcript Organized-In-Place
|
||
|
|
|
||
|
|
## Objective
|
||
|
|
|
||
|
|
Fix the transcript filing gap: transcripts completing the pipeline had
|
||
|
|
`organized_at IS NULL` and would be picked up by the filing worker's query.
|
||
|
|
Since their `documents.path` is a PeerTube watch URL (not a filesystem path),
|
||
|
|
the filing worker's `path LIKE '/opt/recon/data/processing/%'` filter naturally
|
||
|
|
excluded them — but they remained permanently "unfiled" in the DB.
|
||
|
|
|
||
|
|
Resolution: transcripts are not primary source files. They don't belong in
|
||
|
|
`library/Domain/Subdomain/` like PDFs. Instead, they're marked organized
|
||
|
|
in-place at the end of `pre_flight()`, preserving the PeerTube watch URL
|
||
|
|
in `catalogue.path`.
|
||
|
|
|
||
|
|
## Timestamp
|
||
|
|
|
||
|
|
**Started:** 2026-04-14 ~22:40 UTC
|
||
|
|
**Completed:** 2026-04-14 ~22:50 UTC
|
||
|
|
|
||
|
|
## Backup
|
||
|
|
|
||
|
|
- **File:** `/tmp/recon.db.phase6a.20260414.bak`
|
||
|
|
- **MD5:** `5eb8fcd3edd73eae864bc38e3fac560f`
|
||
|
|
|
||
|
|
## Architectural Rationale
|
||
|
|
|
||
|
|
Transcripts are **derived text** from PeerTube videos. The video is the
|
||
|
|
primary source; the transcript is a text representation extracted by Whisper
|
||
|
|
via peertube-runner on cortex.
|
||
|
|
|
||
|
|
Filing transcripts to `/mnt/library/Domain/Subdomain/` would:
|
||
|
|
1. Create redundant text files (the content is always recoverable from PeerTube)
|
||
|
|
2. Overwrite `catalogue.path` with a library path, destroying the watch URL
|
||
|
|
3. Overwrite `download_url` in Qdrant, breaking search→video linkage
|
||
|
|
4. Serve no purpose — users clicking search results should land on PeerTube,
|
||
|
|
not a raw text file
|
||
|
|
|
||
|
|
Instead, transcripts are marked `organized_at = CURRENT_TIMESTAMP` at the
|
||
|
|
end of successful processing. The filing worker never sees them.
|
||
|
|
|
||
|
|
## Code Change
|
||
|
|
|
||
|
|
**File:** `lib/processors/transcript_processor.py`
|
||
|
|
**Commit:** `df29d59` on `refactor` branch
|
||
|
|
|
||
|
|
Merged `organized_at = CURRENT_TIMESTAMP` into the existing UPDATE that sets
|
||
|
|
`text_dir` and `page_count` at the end of `pre_flight()`:
|
||
|
|
|
||
|
|
```python
|
||
|
|
# Before (Phase 3-5):
|
||
|
|
conn.execute(
|
||
|
|
"UPDATE documents SET text_dir = ?, page_count = ? WHERE hash = ?",
|
||
|
|
(proc_dir, len(pages), file_hash)
|
||
|
|
)
|
||
|
|
|
||
|
|
# After (Phase 6a):
|
||
|
|
conn.execute(
|
||
|
|
"UPDATE documents SET text_dir = ?, page_count = ?, organized_at = CURRENT_TIMESTAMP WHERE hash = ?",
|
||
|
|
(proc_dir, len(pages), file_hash)
|
||
|
|
)
|
||
|
|
```
|
||
|
|
|
||
|
|
Plus a 6-line comment block explaining the rationale.
|
||
|
|
|
||
|
|
Total diff: +8 lines, -2 lines.
|
||
|
|
|
||
|
|
## Back-fill
|
||
|
|
|
||
|
|
One-time SQL update for the 2,260 drain items from Phase 5c-2:
|
||
|
|
|
||
|
|
```sql
|
||
|
|
UPDATE documents
|
||
|
|
SET organized_at = CURRENT_TIMESTAMP
|
||
|
|
WHERE hash IN (
|
||
|
|
SELECT d.hash
|
||
|
|
FROM documents d
|
||
|
|
JOIN catalogue c ON d.hash = c.hash
|
||
|
|
WHERE c.source = 'stream.echo6.co'
|
||
|
|
AND d.status = 'complete'
|
||
|
|
AND d.organized_at IS NULL
|
||
|
|
AND d.text_dir LIKE '/opt/recon/data/processing/%'
|
||
|
|
);
|
||
|
|
```
|
||
|
|
|
||
|
|
**Result:** 2,260 rows updated. Post-count of matching unfiled drain items: 0.
|
||
|
|
|
||
|
|
The `text_dir LIKE '/opt/recon/data/processing/%'` filter scoped the update
|
||
|
|
to drain items only, excluding the 276 STATE 2 pre-existing transcripts
|
||
|
|
(which have NULL `text_dir`).
|
||
|
|
|
||
|
|
## Verification
|
||
|
|
|
||
|
|
| Check | Expected | Actual |
|
||
|
|
|-------|----------|--------|
|
||
|
|
| catalogue rows | 29,812 | 29,812 |
|
||
|
|
| documents rows | 29,812 | 29,812 |
|
||
|
|
| Unfiled drain items | 0 | 0 |
|
||
|
|
| Unfiled STATE 2 (untouched) | 276 | 276 |
|
||
|
|
| Watch URLs intact (catalogue.path) | ~2,536 | 2,537 |
|
||
|
|
| Filed library paths (Phase 5a) | 16,596 | 16,596 |
|
||
|
|
| Qdrant points | unchanged | 2,322,853 |
|
||
|
|
| Processing dirs | ~2,262 | 2,263 |
|
||
|
|
| Hopper | empty | 0 |
|
||
|
|
| Filing worker pending | 0 | 0 |
|
||
|
|
| Service errors (last 10 min) | 0 | 0 |
|
||
|
|
|
||
|
|
(+1 on processing dirs and watch URLs from a new transcript processed by the
|
||
|
|
live service during this phase — expected behavior.)
|
||
|
|
|
||
|
|
## Service Restart
|
||
|
|
|
||
|
|
Service restarted at 22:47:50 UTC to pick up the code change. All threads
|
||
|
|
came up cleanly: extract, enrich, embed, dispatcher, filing, progress,
|
||
|
|
dashboard. Zero errors in 60-second stability window.
|
||
|
|
|
||
|
|
## Commit
|
||
|
|
|
||
|
|
- **Commit:** `df29d59` on `refactor` branch
|
||
|
|
- **Pushed to:** `forge.echo6.co/matt/recon` (origin/refactor)
|
||
|
|
|
||
|
|
## Known Inconsistencies (Backlog Items)
|
||
|
|
|
||
|
|
### Phase 5a's 16,596 filed transcripts
|
||
|
|
|
||
|
|
These were filed into `/mnt/library/Domain/Subdomain/` by Phase 5a's filing
|
||
|
|
worker before this fix existed. Their `catalogue.path` was overwritten with
|
||
|
|
a library path — the PeerTube watch URL is permanently lost from the DB.
|
||
|
|
Their Qdrant `download_url` points to `files.echo6.co` (transcript text file),
|
||
|
|
not the PeerTube video.
|
||
|
|
|
||
|
|
Recovery would require matching video titles against PeerTube's API. This is
|
||
|
|
a separate backlog item.
|
||
|
|
|
||
|
|
### 276 STATE 2 pre-existing transcripts
|
||
|
|
|
||
|
|
These live at `/opt/recon/data/text/{hash}/` (old scraper format). They have
|
||
|
|
`status='complete'`, `organized_at IS NULL`, and NULL `text_dir`. They were
|
||
|
|
explicitly excluded from Phase 6a's scope and will be handled separately.
|
||
|
|
|
||
|
|
## Files Modified
|
||
|
|
|
||
|
|
| File | Change |
|
||
|
|
|------|--------|
|
||
|
|
| `lib/processors/transcript_processor.py` | Set `organized_at` at end of `pre_flight()` |
|