diff --git a/phases/phase-5b-transcript-unprocess.md b/phases/phase-5b-transcript-unprocess.md new file mode 100644 index 0000000..ef5cd54 --- /dev/null +++ b/phases/phase-5b-transcript-unprocess.md @@ -0,0 +1,107 @@ +# Phase 5b: Transcript Unprocess + +**Executed:** 2026-04-14T17:35–18:00Z UTC + +--- + +## Backup + +| Item | Location | MD5 Hash | +|------|----------|----------| +| recon.db (pre-Phase 5b) | CT 130: `/tmp/recon.db.phase5b.20260414.bak` | `ba114d3d733f15c9c292f7daca1052b9` | +| recon.db (pre-Phase 5b) | cortex: `/tmp/recon.db.phase5b.20260414.bak` | `ba114d3d733f15c9c292f7daca1052b9` | +| Qdrant baseline | cortex:6333 `recon_knowledge_hybrid` | status=green, 2,320,710 points | +| Unprocess list | CT 130: `/tmp/transcript_unprocess_list.20260414.json` | — | + +--- + +## What This Phase Does + +Takes the 2,259 transcripts flagged `skip_unclassified_phase5a` (no viable domain classification) and resets them to virgin state for re-processing by the refactored pipeline: + +1. **Stage:** Concatenate page files into `acquired/stream/{hash}.txt` + copy `meta.json` as `{hash}.meta.json` +2. **Delete:** Remove DB rows (catalogue + documents), Qdrant vectors, source directories, and concepts directories + +End state: 2,259 content+sidecar pairs sitting in the hopper, ready for the dispatcher to pick up in Phase 5c. + +--- + +## Discovery: Skip Entries Have Qdrant Vectors + +Pre-flight revealed that all 2,259 skip_unclassified entries had `vectors_inserted > 0` in the DB (total: 11,450 points, avg 5.1/doc). These were embedded by the original pipeline before classification was attempted. + +Confirmed with Qdrant count queries — the vectors exist in the collection. These must be deleted alongside DB rows to avoid orphaned data. Qdrant point deletion was added to the execution script. + +--- + +## Plan Summary + +| Metric | Count | +|--------|-------| +| Transcripts to unprocess | 2,259 | +| Unique channels | 113 | +| Total page files | 3,444 | +| Total Qdrant vectors to delete | 11,450 | + +--- + +## Execution + +### Section 2: Stage Hopper + +Copied all 2,259 transcripts to `acquired/stream/` in 23 chunks of 100. + +- **Time:** 0.4 seconds (5,462 entries/sec) +- **Errors:** 0 +- **Result:** 2,259 `.txt` + 2,259 `.meta.json` in hopper + +### Section 3: Delete Sources, DB Rows, Qdrant Vectors + +Processed in 23 chunks of 100 with DB count and Qdrant health verification at each boundary. + +- **Time:** 2.2 seconds (1,017 entries/sec) +- **Errors:** 0 +- **DB rows deleted:** 2,259 catalogue + 2,259 documents +- **Qdrant points deleted:** 11,450 (2,320,710 → 2,309,260) +- **Source directories deleted:** 2,259 +- **Qdrant stayed green/ok throughout** + +#### NFS Root Squash Issue + +Initial deletion script ran as root (via sudo). NFS root_squash prevented root from deleting zvx-owned files under `_sources/streamecho6/`. The `shutil.rmtree(ignore_errors=True)` silently ate the permission errors. + +**Fix:** Ran a separate deletion script as the zvx user. All 2,259 source dirs deleted successfully. Also removed all 131 now-empty channel directories. + +--- + +## Post-Execution Verification + +| Check | Expected | Actual | +|-------|----------|--------| +| Catalogue count | 27,553 | 27,553 | +| Documents count | 27,553 | 27,553 | +| skip_unclassified remaining | 0 | 0 | +| Orphaned documents | 0 | 0 | +| Orphaned catalogue | 0 | 0 | +| Hopper .txt files | 2,259 | 2,259 | +| Hopper .meta.json files | 2,259 | 2,259 | +| Hopper spot check (10 random) | All valid | 10/10 OK | +| _sources/streamecho6 transcript dirs | 0 | 0 | +| _sources/streamecho6 channel dirs | 0 | 0 | +| Qdrant status | green/ok | green/ok | +| Qdrant points | 2,309,260 | 2,309,260 | +| Qdrant spot check (5 random deleted) | 0 points each | 5/5 confirmed | +| Services | inactive | inactive | + +--- + +## Anomalies + +- **NFS root_squash:** `sudo python3` couldn't delete zvx-owned files on NFS mount. Required a second pass running as zvx. `shutil.rmtree(ignore_errors=True)` masked the failure — caught during verification. +- **Vectors existed (correction):** Initial assumption was 0 Qdrant vectors for these entries. All 2,259 actually had vectors (11,450 total). Deletion added to execution scope. + +--- + +## No Code Changes + +Phase 5b is pure data migration. No files in the recon repo were modified or committed.