mirror of
https://github.com/zvx-echo6/refactored-recon.git
synced 2026-05-20 06:34:34 +02:00
Phase 5a: transcript resweep (18855 transcripts)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
parent
1d9727f26f
commit
71dd0a1182
1 changed files with 125 additions and 0 deletions
125
phases/phase-5a-transcript-resweep.md
Normal file
125
phases/phase-5a-transcript-resweep.md
Normal file
|
|
@ -0,0 +1,125 @@
|
|||
# Phase 5a: Transcript Resweep
|
||||
|
||||
**Executed:** 2026-04-14T17:00–17:30Z UTC
|
||||
|
||||
---
|
||||
|
||||
## Backup
|
||||
|
||||
| Item | Location | MD5 Hash |
|
||||
|------|----------|----------|
|
||||
| recon.db (pre-Phase 5a) | CT 130: `/tmp/recon.db.phase5a.20260414.bak` | `143f6c887d76a1b6f9a4fe115d2d8284` |
|
||||
| recon.db (pre-Phase 5a) | cortex: `/tmp/recon.db.phase5a.20260414.bak` | `143f6c887d76a1b6f9a4fe115d2d8284` |
|
||||
| Qdrant baseline | cortex:6333 `recon_knowledge_hybrid` | status=green, 2,320,710 points |
|
||||
| Resweep plan | CT 130: `/tmp/transcript_resweep_plan.20260414.json` | — |
|
||||
| Skipped list | CT 130: `/tmp/transcript_resweep_skipped.20260414.txt` | — |
|
||||
|
||||
---
|
||||
|
||||
## What This Phase Does
|
||||
|
||||
Moves 18,855 existing transcript directories from `/mnt/library/_sources/streamecho6/{channel}/{title}__{hash8}/` to `/mnt/library/{Domain}/{Subdomain}/{sanitized_title}.txt` based on their existing concept classifications. No new enrichment, no code changes, no service modifications.
|
||||
|
||||
Each transcript's page files are concatenated into a single `.txt` file at the target location. Source directories are deleted after successful move. DB paths and Qdrant payloads are updated to reflect new locations.
|
||||
|
||||
---
|
||||
|
||||
## Plan Summary
|
||||
|
||||
| Metric | Count |
|
||||
|--------|-------|
|
||||
| Source channels scanned | 131 |
|
||||
| Total transcript directories | 18,855 |
|
||||
| Plan entries: MOVE | 16,596 |
|
||||
| Plan entries: SKIP_UNCLASSIFIED | 2,259 |
|
||||
| Plan errors | 0 |
|
||||
| Intra-plan path collisions fixed | 18 |
|
||||
|
||||
### Domain Breakdown (moves)
|
||||
|
||||
| Domain | Count |
|
||||
|--------|-------|
|
||||
| Foundational Skills | 3,720 |
|
||||
| Sustainment Systems | 3,487 |
|
||||
| Communications | 3,115 |
|
||||
| Defense & Tactics | 2,802 |
|
||||
| Off-Grid Systems | 1,821 |
|
||||
| Medical | 446 |
|
||||
| Agriculture & Livestock | 197 |
|
||||
| Technology | 171 |
|
||||
| Food Systems | 159 |
|
||||
| Tools & Equipment | 114 |
|
||||
| Security | 107 |
|
||||
| Power Systems | 98 |
|
||||
| Shelter & Construction | 72 |
|
||||
| Logistics | 59 |
|
||||
| Vehicles | 50 |
|
||||
| Preservation & Storage | 43 |
|
||||
| Scenario Playbooks | 33 |
|
||||
| Civil Organization | 25 |
|
||||
| Navigation | 22 |
|
||||
| Water Systems | 21 |
|
||||
| Wilderness Skills | 10 |
|
||||
| Operations | 10 |
|
||||
| Community Coordination | 8 |
|
||||
| Leadership | 6 |
|
||||
|
||||
---
|
||||
|
||||
## Execution
|
||||
|
||||
Executed in 34 chunks of 500 entries each (plus skips processed first).
|
||||
|
||||
- **Chunk processing rate:** 15–20 entries/sec
|
||||
- **Total time:** 1,028 seconds (17 minutes)
|
||||
- **Errors:** 0
|
||||
- **Volume moved:** ~0.2 GB (avg 13.5 KB per transcript)
|
||||
|
||||
### Qdrant Status
|
||||
|
||||
Qdrant went from green to yellow after chunk 2 due to optimizer processing payload updates. `optimizer_status` remained `ok` throughout. Points count stable at 2,320,710 across all 34 chunk checkpoints. This is expected behavior — the optimizer is merging segments after many small payload writes.
|
||||
|
||||
### Skip Processing
|
||||
|
||||
2,259 transcripts without domain classification (0 concepts or ambiguous) were flagged with `skip_unclassified_phase5a` in `metadata_provenance` and `organized_at` set to current timestamp. Source directories left in place at `_sources/streamecho6/`.
|
||||
|
||||
---
|
||||
|
||||
## Post-Execution Verification
|
||||
|
||||
| Check | Expected | Actual |
|
||||
|-------|----------|--------|
|
||||
| Catalogue count | 29,812 | 29,812 |
|
||||
| Documents count | 29,812 | 29,812 |
|
||||
| Organized stream transcripts | 18,855 | 18,855 |
|
||||
| Skip-flagged documents | 2,259 | 2,259 |
|
||||
| Qdrant points | 2,320,710 | 2,320,710 |
|
||||
| Qdrant payload sample (10 random) | All updated | 10/10 OK |
|
||||
| Remaining dirs in `_sources/streamecho6/` | 2,259 | 2,259 |
|
||||
| Moved files exist at target paths (10 random) | All exist | 10/10 OK |
|
||||
|
||||
---
|
||||
|
||||
## Sample Moved Transcripts
|
||||
|
||||
| Source | Target | Domain |
|
||||
|--------|--------|--------|
|
||||
| `.../roger-wakefield/Real Plumber Reacts to Laborers Work__8d6e410e` | `/mnt/library/Foundational-Skills/Plumbing/Real Plumber Reacts to Laborer's Work.txt` | Foundational Skills / Plumbing |
|
||||
| `.../pine-hollow-auto/This SHOULD Be Easy...Bonneville No Speedo - Part 2__5a824321` | `/mnt/library/Sustainment-Systems/Automotive/This SHOULD Be Easy.txt` | Sustainment Systems / Automotive |
|
||||
| `.../greatscott/Electronic Basics 6 Standalone Arduino Circuit__292055be` | `/mnt/library/Communications/Microcontrollers/Electronic Basics #6 Standalone Arduino Circuit.txt` | Communications / Microcontrollers |
|
||||
| `.../forgotten-weapons/Prototype Silenced Sten Mk4S at the Range__a37f0683` | `/mnt/library/Defense-and-Tactics/Firearms/Prototype Silenced Sten Mk4(S) at the Range.txt` | Defense & Tactics / Firearms |
|
||||
| `.../huw-richards/Chop Drop for Tomatoes Polyculture Plantings...__bae64ca0` | `/mnt/library/Off-grid-Systems/Gardening/Chop & Drop for Tomatoes & Polyculture Plantings...txt` | Off-Grid Systems / Gardening |
|
||||
|
||||
---
|
||||
|
||||
## Anomalies
|
||||
|
||||
- **Qdrant yellow throughout execution:** Expected for batch payload updates on a 2.3M-point collection. Optimizer healthy, points stable.
|
||||
- **18 intra-plan path collisions:** Resolved pre-execution by appending `[hash6]` suffix to duplicate target filenames. Collisions were from same-titled videos across different channels (e.g., multiple "untitled" transcripts).
|
||||
- **2,259 unclassifiable transcripts:** These have 0 concepts (trivially short or non-knowledge content like vlogs, pranks, music videos). Left at `_sources/` for potential future re-enrichment.
|
||||
|
||||
---
|
||||
|
||||
## No Code Changes
|
||||
|
||||
Phase 5a is pure data migration. No files in the recon repo were modified or committed.
|
||||
Loading…
Add table
Add a link
Reference in a new issue