refactored-recon/phases/phase-5b-transcript-unprocess.md
2026-04-14 18:19:35 +00:00

107 lines
3.9 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Phase 5b: Transcript Unprocess
**Executed:** 2026-04-14T17:3518:00Z UTC
---
## Backup
| Item | Location | MD5 Hash |
|------|----------|----------|
| recon.db (pre-Phase 5b) | CT 130: `/tmp/recon.db.phase5b.20260414.bak` | `ba114d3d733f15c9c292f7daca1052b9` |
| recon.db (pre-Phase 5b) | cortex: `/tmp/recon.db.phase5b.20260414.bak` | `ba114d3d733f15c9c292f7daca1052b9` |
| Qdrant baseline | cortex:6333 `recon_knowledge_hybrid` | status=green, 2,320,710 points |
| Unprocess list | CT 130: `/tmp/transcript_unprocess_list.20260414.json` | — |
---
## What This Phase Does
Takes the 2,259 transcripts flagged `skip_unclassified_phase5a` (no viable domain classification) and resets them to virgin state for re-processing by the refactored pipeline:
1. **Stage:** Concatenate page files into `acquired/stream/{hash}.txt` + copy `meta.json` as `{hash}.meta.json`
2. **Delete:** Remove DB rows (catalogue + documents), Qdrant vectors, source directories, and concepts directories
End state: 2,259 content+sidecar pairs sitting in the hopper, ready for the dispatcher to pick up in Phase 5c.
---
## Discovery: Skip Entries Have Qdrant Vectors
Pre-flight revealed that all 2,259 skip_unclassified entries had `vectors_inserted > 0` in the DB (total: 11,450 points, avg 5.1/doc). These were embedded by the original pipeline before classification was attempted.
Confirmed with Qdrant count queries — the vectors exist in the collection. These must be deleted alongside DB rows to avoid orphaned data. Qdrant point deletion was added to the execution script.
---
## Plan Summary
| Metric | Count |
|--------|-------|
| Transcripts to unprocess | 2,259 |
| Unique channels | 113 |
| Total page files | 3,444 |
| Total Qdrant vectors to delete | 11,450 |
---
## Execution
### Section 2: Stage Hopper
Copied all 2,259 transcripts to `acquired/stream/` in 23 chunks of 100.
- **Time:** 0.4 seconds (5,462 entries/sec)
- **Errors:** 0
- **Result:** 2,259 `.txt` + 2,259 `.meta.json` in hopper
### Section 3: Delete Sources, DB Rows, Qdrant Vectors
Processed in 23 chunks of 100 with DB count and Qdrant health verification at each boundary.
- **Time:** 2.2 seconds (1,017 entries/sec)
- **Errors:** 0
- **DB rows deleted:** 2,259 catalogue + 2,259 documents
- **Qdrant points deleted:** 11,450 (2,320,710 → 2,309,260)
- **Source directories deleted:** 2,259
- **Qdrant stayed green/ok throughout**
#### NFS Root Squash Issue
Initial deletion script ran as root (via sudo). NFS root_squash prevented root from deleting zvx-owned files under `_sources/streamecho6/`. The `shutil.rmtree(ignore_errors=True)` silently ate the permission errors.
**Fix:** Ran a separate deletion script as the zvx user. All 2,259 source dirs deleted successfully. Also removed all 131 now-empty channel directories.
---
## Post-Execution Verification
| Check | Expected | Actual |
|-------|----------|--------|
| Catalogue count | 27,553 | 27,553 |
| Documents count | 27,553 | 27,553 |
| skip_unclassified remaining | 0 | 0 |
| Orphaned documents | 0 | 0 |
| Orphaned catalogue | 0 | 0 |
| Hopper .txt files | 2,259 | 2,259 |
| Hopper .meta.json files | 2,259 | 2,259 |
| Hopper spot check (10 random) | All valid | 10/10 OK |
| _sources/streamecho6 transcript dirs | 0 | 0 |
| _sources/streamecho6 channel dirs | 0 | 0 |
| Qdrant status | green/ok | green/ok |
| Qdrant points | 2,309,260 | 2,309,260 |
| Qdrant spot check (5 random deleted) | 0 points each | 5/5 confirmed |
| Services | inactive | inactive |
---
## Anomalies
- **NFS root_squash:** `sudo python3` couldn't delete zvx-owned files on NFS mount. Required a second pass running as zvx. `shutil.rmtree(ignore_errors=True)` masked the failure — caught during verification.
- **Vectors existed (correction):** Initial assumption was 0 Qdrant vectors for these entries. All 2,259 actually had vectors (11,450 total). Deletion added to execution scope.
---
## No Code Changes
Phase 5b is pure data migration. No files in the recon repo were modified or committed.