mirror of
https://github.com/zvx-echo6/refactored-recon.git
synced 2026-05-20 06:34:34 +02:00
Phase 5b: transcript unprocess — stage 2,259 skip_unclassified transcripts into hopper
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
parent
71dd0a1182
commit
6b49d4c107
1 changed files with 107 additions and 0 deletions
107
phases/phase-5b-transcript-unprocess.md
Normal file
107
phases/phase-5b-transcript-unprocess.md
Normal file
|
|
@ -0,0 +1,107 @@
|
|||
# Phase 5b: Transcript Unprocess
|
||||
|
||||
**Executed:** 2026-04-14T17:35–18:00Z UTC
|
||||
|
||||
---
|
||||
|
||||
## Backup
|
||||
|
||||
| Item | Location | MD5 Hash |
|
||||
|------|----------|----------|
|
||||
| recon.db (pre-Phase 5b) | CT 130: `/tmp/recon.db.phase5b.20260414.bak` | `ba114d3d733f15c9c292f7daca1052b9` |
|
||||
| recon.db (pre-Phase 5b) | cortex: `/tmp/recon.db.phase5b.20260414.bak` | `ba114d3d733f15c9c292f7daca1052b9` |
|
||||
| Qdrant baseline | cortex:6333 `recon_knowledge_hybrid` | status=green, 2,320,710 points |
|
||||
| Unprocess list | CT 130: `/tmp/transcript_unprocess_list.20260414.json` | — |
|
||||
|
||||
---
|
||||
|
||||
## What This Phase Does
|
||||
|
||||
Takes the 2,259 transcripts flagged `skip_unclassified_phase5a` (no viable domain classification) and resets them to virgin state for re-processing by the refactored pipeline:
|
||||
|
||||
1. **Stage:** Concatenate page files into `acquired/stream/{hash}.txt` + copy `meta.json` as `{hash}.meta.json`
|
||||
2. **Delete:** Remove DB rows (catalogue + documents), Qdrant vectors, source directories, and concepts directories
|
||||
|
||||
End state: 2,259 content+sidecar pairs sitting in the hopper, ready for the dispatcher to pick up in Phase 5c.
|
||||
|
||||
---
|
||||
|
||||
## Discovery: Skip Entries Have Qdrant Vectors
|
||||
|
||||
Pre-flight revealed that all 2,259 skip_unclassified entries had `vectors_inserted > 0` in the DB (total: 11,450 points, avg 5.1/doc). These were embedded by the original pipeline before classification was attempted.
|
||||
|
||||
Confirmed with Qdrant count queries — the vectors exist in the collection. These must be deleted alongside DB rows to avoid orphaned data. Qdrant point deletion was added to the execution script.
|
||||
|
||||
---
|
||||
|
||||
## Plan Summary
|
||||
|
||||
| Metric | Count |
|
||||
|--------|-------|
|
||||
| Transcripts to unprocess | 2,259 |
|
||||
| Unique channels | 113 |
|
||||
| Total page files | 3,444 |
|
||||
| Total Qdrant vectors to delete | 11,450 |
|
||||
|
||||
---
|
||||
|
||||
## Execution
|
||||
|
||||
### Section 2: Stage Hopper
|
||||
|
||||
Copied all 2,259 transcripts to `acquired/stream/` in 23 chunks of 100.
|
||||
|
||||
- **Time:** 0.4 seconds (5,462 entries/sec)
|
||||
- **Errors:** 0
|
||||
- **Result:** 2,259 `.txt` + 2,259 `.meta.json` in hopper
|
||||
|
||||
### Section 3: Delete Sources, DB Rows, Qdrant Vectors
|
||||
|
||||
Processed in 23 chunks of 100 with DB count and Qdrant health verification at each boundary.
|
||||
|
||||
- **Time:** 2.2 seconds (1,017 entries/sec)
|
||||
- **Errors:** 0
|
||||
- **DB rows deleted:** 2,259 catalogue + 2,259 documents
|
||||
- **Qdrant points deleted:** 11,450 (2,320,710 → 2,309,260)
|
||||
- **Source directories deleted:** 2,259
|
||||
- **Qdrant stayed green/ok throughout**
|
||||
|
||||
#### NFS Root Squash Issue
|
||||
|
||||
Initial deletion script ran as root (via sudo). NFS root_squash prevented root from deleting zvx-owned files under `_sources/streamecho6/`. The `shutil.rmtree(ignore_errors=True)` silently ate the permission errors.
|
||||
|
||||
**Fix:** Ran a separate deletion script as the zvx user. All 2,259 source dirs deleted successfully. Also removed all 131 now-empty channel directories.
|
||||
|
||||
---
|
||||
|
||||
## Post-Execution Verification
|
||||
|
||||
| Check | Expected | Actual |
|
||||
|-------|----------|--------|
|
||||
| Catalogue count | 27,553 | 27,553 |
|
||||
| Documents count | 27,553 | 27,553 |
|
||||
| skip_unclassified remaining | 0 | 0 |
|
||||
| Orphaned documents | 0 | 0 |
|
||||
| Orphaned catalogue | 0 | 0 |
|
||||
| Hopper .txt files | 2,259 | 2,259 |
|
||||
| Hopper .meta.json files | 2,259 | 2,259 |
|
||||
| Hopper spot check (10 random) | All valid | 10/10 OK |
|
||||
| _sources/streamecho6 transcript dirs | 0 | 0 |
|
||||
| _sources/streamecho6 channel dirs | 0 | 0 |
|
||||
| Qdrant status | green/ok | green/ok |
|
||||
| Qdrant points | 2,309,260 | 2,309,260 |
|
||||
| Qdrant spot check (5 random deleted) | 0 points each | 5/5 confirmed |
|
||||
| Services | inactive | inactive |
|
||||
|
||||
---
|
||||
|
||||
## Anomalies
|
||||
|
||||
- **NFS root_squash:** `sudo python3` couldn't delete zvx-owned files on NFS mount. Required a second pass running as zvx. `shutil.rmtree(ignore_errors=True)` masked the failure — caught during verification.
|
||||
- **Vectors existed (correction):** Initial assumption was 0 Qdrant vectors for these entries. All 2,259 actually had vectors (11,450 total). Deletion added to execution scope.
|
||||
|
||||
---
|
||||
|
||||
## No Code Changes
|
||||
|
||||
Phase 5b is pure data migration. No files in the recon repo were modified or committed.
|
||||
Loading…
Add table
Add a link
Reference in a new issue