PROJECT-BIBLE: bring refactor history current through Phase 6k

Updates:
- Fix Phase 5a description (was incorrectly describing the un-file)
- Fix Phase 5b description (2,259 drain cohort)
- Add Phase 6f (text processor)
- Add Phase 6f-2 (format normalizer)
- Add Phase 6g (Gemini null bug fix)
- Add Phase 6h (STATE 2 cleanup + PeerTube transcription trigger)
- Add Phase 6i (dashboard upload migration, multi-format)
- Add Phase 6j (library cleanup, 51G freed)
- Add Phase 6k (Phase 5a un-file, 16,340 transcripts restored)
- Update Open Follow-ups with backlog items identified through Phase 6k
- Update footer to reflect refactor feature-complete state
This commit is contained in:
Matt 2026-04-16 05:21:17 +00:00
commit d1cde5a56d

View file

@ -500,8 +500,8 @@ implementations are in the RECON repo; design lives here.
| 2 | Shared filing function — extract organizer logic into `filing.py` |
| 3 | Transcript processor — first end-to-end test of the new pattern |
| 4 | PDF processor — layered A/B/C metadata vote, level-4 dedupe |
| 5a | Transcript resweep — 16,340 transcripts migrated from `library/*.txt` path to `stream.echo6.co/w/<uuid>` watch URLs; catalogue/documents/Qdrant all updated atomically, physical `.txt` files deleted |
| 5b | Transcript unprocess — clean up stale rows and processing dirs |
| 5a | Transcript resweep — 16,596 transcripts moved from `/mnt/library/_sources/streamecho6/` into `/mnt/library/<Domain>/<Subdomain>/` via concept-driven domain classification; 2,259 skipped as unclassified (these became the 5b drain cohort) |
| 5b | Transcript unprocess — 2,259 skip_unclassified transcripts staged into `data/acquired/stream/` as `.txt`+`.meta.json` pairs; DB rows deleted, Qdrant vectors removed, source dirs cleaned |
| 5c-1 | Service loop rewire — retire old scan_library thread, wire dispatcher in |
| 5c-2 | Service start & transcript drain — clear the hopper backlog |
| 6a | Transcript organized-in-place — set `organized_at` during pre_flight so filing worker ignores transcripts |
@ -509,6 +509,13 @@ implementations are in the RECON repo; design lives here.
| 6c | Code cleanup — dead-code audit |
| 6d | PeerTube acquisition module — replace ad-hoc ingester with `acquisition/peertube.py` |
| 6e | ShadowLib skill + dashboard PeerTube endpoint cleanup (partial — 6e-2 reverted) |
| 6f | Text processor — new `lib/processors/text_processor.py` handles `.txt` files with two-source metadata voting (filename + Gemini); new `data/acquired/text/` hopper subfolder; files to library like PDFs |
| 6f-2 | Format normalizer in dispatcher — converts `.epub`/`.mobi` to PDF via Calibre's `ebook-convert`, `.doc`/`.docx` via `libreoffice --headless`, called per-subfolder before `_find_pairs()` |
| 6g | Gemini "null" string bug fix — both `pdf_processor` and `text_processor` now filter the literal string `"null"` out of Gemini's JSON responses before metadata voting |
| 6h | STATE 2 transcript cleanup — deleted 283 zero-vector transcripts (DB rows, concepts, local text, Qdrant entries) and 1,198 orphan dirs in `data/text/`; triggered PeerTube transcription for 332 videos without captions via `POST /api/v1/videos/{uuid}/captions/generate` |
| 6i | Dashboard upload migration — `POST /api/upload` now routes by extension to the appropriate hopper (pdf/text) with `.meta.json` sidecar, supports PDF/TXT/EPUB/DOC/DOCX/MOBI, removed direct library copy and `add_to_catalogue`/`queue_document` calls, added status endpoint fallback that checks `acquired/` and `processing/` dirs for the upload/dispatch gap |
| 6j | Library cleanup — ~51G freed; 398 duplicate PDFs deleted (Army_Pubs, Acquired, Scenario-Playbooks dupes); 2,274 non-PDF SCL files deleted (user confirmed backups); 57 files in 3 ghost domain folders (Community-Coordination, Leadership, Scenario-Playbooks) refiled through new pipeline; 201 unclassified SCL PDFs refiled; 1,240 `_unclassified/` PDFs refiled; `_ingest/_duplicates/` cleared; 5 loose root PDFs staged |
| 6k | Phase 5a un-file — 16,340 of the 16,596 Phase 5a-filed transcripts had their `catalogue.path` restored from library filesystem path back to PeerTube watch URL via title-matching against PeerTube's video list (98.6% match rate); physical `.txt` files deleted from library; Qdrant `download_url` payload updated; 4,955 empty dirs cleaned up; 223 edge cases (82 MULTI_MATCH + 141 UNMATCHED) documented for later review |
### Baseline pre-refactor (per `current-state.md`)
- 18,855 transcripts in `/mnt/library/_sources/streamecho6/`.
@ -609,9 +616,9 @@ RECON Gemini/PeerTube keys: `/opt/recon/.env` on CT 130.
## 18. Open Follow-ups
- **82 MULTI_MATCH + 141 UNMATCHED** transcript rows still carry
library paths post Phase 5a (audit trail at
`/tmp/phase5a_remaining.txt` on CT 130). Either hand-resolve or
tombstone.
library paths post Phase 5a/6k (audit trail at
`/tmp/phase5a_remaining.txt` on CT 130 — file still present). Either
hand-resolve or tombstone.
- **HTML processor** (`lib/processors/html_processor.py`) is scaffolded
in config but not implemented. Next-up for Kiwix / web ingest.
- **Crawler re-architecture.** The tier-1 sites list in `config.yaml`
@ -623,7 +630,24 @@ RECON Gemini/PeerTube keys: `/opt/recon/.env` on CT 130.
needs a redesign before reinstating.
- **Level-4 dedupe review queue** (`duplicate_review` table) has no UI
yet; items pile up silently.
- **9,478 legacy dirs in `/opt/recon/data/text/`** — historical
extraction output from the pre-refactor pipeline, for documents
still in catalogue. Not touched by current pipeline. Can be cleaned
up once confirmed none are the sole text copy for any document.
- **`lib/new_pipeline.py` is misleadingly named** — it's actually a
library management CLI tool, not the refactor's new pipeline.
Contains `update_qdrant_payload` helper that filing worker depends
on. Should be renamed (e.g., `library_ops.py`) when there's time.
- **SSH key for CT 130 forge access** — currently uses HTTPS with
embedded token in remote URL. Move to SSH key auth.
- **Backup policy for derived data**`/opt/recon/data/concepts/` and
Qdrant snapshots are not in any backup rotation. If CT 130 or cortex
lose their disks, these are the hardest to regenerate (Gemini calls
+ embedding compute).
- **`signal-archive/` in `/mnt/library/`** — 44 Signal/Matrix chat log
files, not library content. Matt intends these to "eventually
contribute" to the knowledge base but no ingestion path exists yet.
---
*Last updated: 2026-04-15 — Phase 5a transcript un-file complete, Phase 6e partial. Living document; edit in place as the system evolves.*
*Last updated: 2026-04-15 — Refactor feature-complete. Phases 0 through 6k landed. Service operational with 7 daemon threads. Outstanding: 223 edge-case transcripts (see Section 18), HTML processor (scaffolded, not implemented), crawler re-architecture (deferred). Living document; edit in place as the system evolves.*