refactored-recon/phases/phase-5b-transcript-unprocess.md
2026-04-14 18:19:35 +00:00

3.9 KiB
Raw Permalink Blame History

Phase 5b: Transcript Unprocess

Executed: 2026-04-14T17:3518:00Z UTC


Backup

Item Location MD5 Hash
recon.db (pre-Phase 5b) CT 130: /tmp/recon.db.phase5b.20260414.bak ba114d3d733f15c9c292f7daca1052b9
recon.db (pre-Phase 5b) cortex: /tmp/recon.db.phase5b.20260414.bak ba114d3d733f15c9c292f7daca1052b9
Qdrant baseline cortex:6333 recon_knowledge_hybrid status=green, 2,320,710 points
Unprocess list CT 130: /tmp/transcript_unprocess_list.20260414.json

What This Phase Does

Takes the 2,259 transcripts flagged skip_unclassified_phase5a (no viable domain classification) and resets them to virgin state for re-processing by the refactored pipeline:

  1. Stage: Concatenate page files into acquired/stream/{hash}.txt + copy meta.json as {hash}.meta.json
  2. Delete: Remove DB rows (catalogue + documents), Qdrant vectors, source directories, and concepts directories

End state: 2,259 content+sidecar pairs sitting in the hopper, ready for the dispatcher to pick up in Phase 5c.


Discovery: Skip Entries Have Qdrant Vectors

Pre-flight revealed that all 2,259 skip_unclassified entries had vectors_inserted > 0 in the DB (total: 11,450 points, avg 5.1/doc). These were embedded by the original pipeline before classification was attempted.

Confirmed with Qdrant count queries — the vectors exist in the collection. These must be deleted alongside DB rows to avoid orphaned data. Qdrant point deletion was added to the execution script.


Plan Summary

Metric Count
Transcripts to unprocess 2,259
Unique channels 113
Total page files 3,444
Total Qdrant vectors to delete 11,450

Execution

Section 2: Stage Hopper

Copied all 2,259 transcripts to acquired/stream/ in 23 chunks of 100.

  • Time: 0.4 seconds (5,462 entries/sec)
  • Errors: 0
  • Result: 2,259 .txt + 2,259 .meta.json in hopper

Section 3: Delete Sources, DB Rows, Qdrant Vectors

Processed in 23 chunks of 100 with DB count and Qdrant health verification at each boundary.

  • Time: 2.2 seconds (1,017 entries/sec)
  • Errors: 0
  • DB rows deleted: 2,259 catalogue + 2,259 documents
  • Qdrant points deleted: 11,450 (2,320,710 → 2,309,260)
  • Source directories deleted: 2,259
  • Qdrant stayed green/ok throughout

NFS Root Squash Issue

Initial deletion script ran as root (via sudo). NFS root_squash prevented root from deleting zvx-owned files under _sources/streamecho6/. The shutil.rmtree(ignore_errors=True) silently ate the permission errors.

Fix: Ran a separate deletion script as the zvx user. All 2,259 source dirs deleted successfully. Also removed all 131 now-empty channel directories.


Post-Execution Verification

Check Expected Actual
Catalogue count 27,553 27,553
Documents count 27,553 27,553
skip_unclassified remaining 0 0
Orphaned documents 0 0
Orphaned catalogue 0 0
Hopper .txt files 2,259 2,259
Hopper .meta.json files 2,259 2,259
Hopper spot check (10 random) All valid 10/10 OK
_sources/streamecho6 transcript dirs 0 0
_sources/streamecho6 channel dirs 0 0
Qdrant status green/ok green/ok
Qdrant points 2,309,260 2,309,260
Qdrant spot check (5 random deleted) 0 points each 5/5 confirmed
Services inactive inactive

Anomalies

  • NFS root_squash: sudo python3 couldn't delete zvx-owned files on NFS mount. Required a second pass running as zvx. shutil.rmtree(ignore_errors=True) masked the failure — caught during verification.
  • Vectors existed (correction): Initial assumption was 0 Qdrant vectors for these entries. All 2,259 actually had vectors (11,450 total). Deletion added to execution scope.

No Code Changes

Phase 5b is pure data migration. No files in the recon repo were modified or committed.