3.9 KiB
Phase 5b: Transcript Unprocess
Executed: 2026-04-14T17:35–18:00Z UTC
Backup
| Item | Location | MD5 Hash |
|---|---|---|
| recon.db (pre-Phase 5b) | CT 130: /tmp/recon.db.phase5b.20260414.bak |
ba114d3d733f15c9c292f7daca1052b9 |
| recon.db (pre-Phase 5b) | cortex: /tmp/recon.db.phase5b.20260414.bak |
ba114d3d733f15c9c292f7daca1052b9 |
| Qdrant baseline | cortex:6333 recon_knowledge_hybrid |
status=green, 2,320,710 points |
| Unprocess list | CT 130: /tmp/transcript_unprocess_list.20260414.json |
— |
What This Phase Does
Takes the 2,259 transcripts flagged skip_unclassified_phase5a (no viable domain classification) and resets them to virgin state for re-processing by the refactored pipeline:
- Stage: Concatenate page files into
acquired/stream/{hash}.txt+ copymeta.jsonas{hash}.meta.json - Delete: Remove DB rows (catalogue + documents), Qdrant vectors, source directories, and concepts directories
End state: 2,259 content+sidecar pairs sitting in the hopper, ready for the dispatcher to pick up in Phase 5c.
Discovery: Skip Entries Have Qdrant Vectors
Pre-flight revealed that all 2,259 skip_unclassified entries had vectors_inserted > 0 in the DB (total: 11,450 points, avg 5.1/doc). These were embedded by the original pipeline before classification was attempted.
Confirmed with Qdrant count queries — the vectors exist in the collection. These must be deleted alongside DB rows to avoid orphaned data. Qdrant point deletion was added to the execution script.
Plan Summary
| Metric | Count |
|---|---|
| Transcripts to unprocess | 2,259 |
| Unique channels | 113 |
| Total page files | 3,444 |
| Total Qdrant vectors to delete | 11,450 |
Execution
Section 2: Stage Hopper
Copied all 2,259 transcripts to acquired/stream/ in 23 chunks of 100.
- Time: 0.4 seconds (5,462 entries/sec)
- Errors: 0
- Result: 2,259
.txt+ 2,259.meta.jsonin hopper
Section 3: Delete Sources, DB Rows, Qdrant Vectors
Processed in 23 chunks of 100 with DB count and Qdrant health verification at each boundary.
- Time: 2.2 seconds (1,017 entries/sec)
- Errors: 0
- DB rows deleted: 2,259 catalogue + 2,259 documents
- Qdrant points deleted: 11,450 (2,320,710 → 2,309,260)
- Source directories deleted: 2,259
- Qdrant stayed green/ok throughout
NFS Root Squash Issue
Initial deletion script ran as root (via sudo). NFS root_squash prevented root from deleting zvx-owned files under _sources/streamecho6/. The shutil.rmtree(ignore_errors=True) silently ate the permission errors.
Fix: Ran a separate deletion script as the zvx user. All 2,259 source dirs deleted successfully. Also removed all 131 now-empty channel directories.
Post-Execution Verification
| Check | Expected | Actual |
|---|---|---|
| Catalogue count | 27,553 | 27,553 |
| Documents count | 27,553 | 27,553 |
| skip_unclassified remaining | 0 | 0 |
| Orphaned documents | 0 | 0 |
| Orphaned catalogue | 0 | 0 |
| Hopper .txt files | 2,259 | 2,259 |
| Hopper .meta.json files | 2,259 | 2,259 |
| Hopper spot check (10 random) | All valid | 10/10 OK |
| _sources/streamecho6 transcript dirs | 0 | 0 |
| _sources/streamecho6 channel dirs | 0 | 0 |
| Qdrant status | green/ok | green/ok |
| Qdrant points | 2,309,260 | 2,309,260 |
| Qdrant spot check (5 random deleted) | 0 points each | 5/5 confirmed |
| Services | inactive | inactive |
Anomalies
- NFS root_squash:
sudo python3couldn't delete zvx-owned files on NFS mount. Required a second pass running as zvx.shutil.rmtree(ignore_errors=True)masked the failure — caught during verification. - Vectors existed (correction): Initial assumption was 0 Qdrant vectors for these entries. All 2,259 actually had vectors (11,450 total). Deletion added to execution scope.
No Code Changes
Phase 5b is pure data migration. No files in the recon repo were modified or committed.