refactored-recon/phases/phase-5b-transcript-unprocess.md

107 lines
3.9 KiB
Markdown
Raw Normal View History

# Phase 5b: Transcript Unprocess
**Executed:** 2026-04-14T17:3518:00Z UTC
---
## Backup
| Item | Location | MD5 Hash |
|------|----------|----------|
| recon.db (pre-Phase 5b) | CT 130: `/tmp/recon.db.phase5b.20260414.bak` | `ba114d3d733f15c9c292f7daca1052b9` |
| recon.db (pre-Phase 5b) | cortex: `/tmp/recon.db.phase5b.20260414.bak` | `ba114d3d733f15c9c292f7daca1052b9` |
| Qdrant baseline | cortex:6333 `recon_knowledge_hybrid` | status=green, 2,320,710 points |
| Unprocess list | CT 130: `/tmp/transcript_unprocess_list.20260414.json` | — |
---
## What This Phase Does
Takes the 2,259 transcripts flagged `skip_unclassified_phase5a` (no viable domain classification) and resets them to virgin state for re-processing by the refactored pipeline:
1. **Stage:** Concatenate page files into `acquired/stream/{hash}.txt` + copy `meta.json` as `{hash}.meta.json`
2. **Delete:** Remove DB rows (catalogue + documents), Qdrant vectors, source directories, and concepts directories
End state: 2,259 content+sidecar pairs sitting in the hopper, ready for the dispatcher to pick up in Phase 5c.
---
## Discovery: Skip Entries Have Qdrant Vectors
Pre-flight revealed that all 2,259 skip_unclassified entries had `vectors_inserted > 0` in the DB (total: 11,450 points, avg 5.1/doc). These were embedded by the original pipeline before classification was attempted.
Confirmed with Qdrant count queries — the vectors exist in the collection. These must be deleted alongside DB rows to avoid orphaned data. Qdrant point deletion was added to the execution script.
---
## Plan Summary
| Metric | Count |
|--------|-------|
| Transcripts to unprocess | 2,259 |
| Unique channels | 113 |
| Total page files | 3,444 |
| Total Qdrant vectors to delete | 11,450 |
---
## Execution
### Section 2: Stage Hopper
Copied all 2,259 transcripts to `acquired/stream/` in 23 chunks of 100.
- **Time:** 0.4 seconds (5,462 entries/sec)
- **Errors:** 0
- **Result:** 2,259 `.txt` + 2,259 `.meta.json` in hopper
### Section 3: Delete Sources, DB Rows, Qdrant Vectors
Processed in 23 chunks of 100 with DB count and Qdrant health verification at each boundary.
- **Time:** 2.2 seconds (1,017 entries/sec)
- **Errors:** 0
- **DB rows deleted:** 2,259 catalogue + 2,259 documents
- **Qdrant points deleted:** 11,450 (2,320,710 → 2,309,260)
- **Source directories deleted:** 2,259
- **Qdrant stayed green/ok throughout**
#### NFS Root Squash Issue
Initial deletion script ran as root (via sudo). NFS root_squash prevented root from deleting zvx-owned files under `_sources/streamecho6/`. The `shutil.rmtree(ignore_errors=True)` silently ate the permission errors.
**Fix:** Ran a separate deletion script as the zvx user. All 2,259 source dirs deleted successfully. Also removed all 131 now-empty channel directories.
---
## Post-Execution Verification
| Check | Expected | Actual |
|-------|----------|--------|
| Catalogue count | 27,553 | 27,553 |
| Documents count | 27,553 | 27,553 |
| skip_unclassified remaining | 0 | 0 |
| Orphaned documents | 0 | 0 |
| Orphaned catalogue | 0 | 0 |
| Hopper .txt files | 2,259 | 2,259 |
| Hopper .meta.json files | 2,259 | 2,259 |
| Hopper spot check (10 random) | All valid | 10/10 OK |
| _sources/streamecho6 transcript dirs | 0 | 0 |
| _sources/streamecho6 channel dirs | 0 | 0 |
| Qdrant status | green/ok | green/ok |
| Qdrant points | 2,309,260 | 2,309,260 |
| Qdrant spot check (5 random deleted) | 0 points each | 5/5 confirmed |
| Services | inactive | inactive |
---
## Anomalies
- **NFS root_squash:** `sudo python3` couldn't delete zvx-owned files on NFS mount. Required a second pass running as zvx. `shutil.rmtree(ignore_errors=True)` masked the failure — caught during verification.
- **Vectors existed (correction):** Initial assumption was 0 Qdrant vectors for these entries. All 2,259 actually had vectors (11,450 total). Deletion added to execution scope.
---
## No Code Changes
Phase 5b is pure data migration. No files in the recon repo were modified or committed.