PROJECT-BIBLE: bring refactor history current through Phase 6k

Updates: - Fix Phase 5a description (was incorrectly describing the un-file) - Fix Phase 5b description (2,259 drain cohort) - Add Phase 6f (text processor) - Add Phase 6f-2 (format normalizer) - Add Phase 6g (Gemini null bug fix) - Add Phase 6h (STATE 2 cleanup + PeerTube transcription trigger) - Add Phase 6i (dashboard upload migration, multi-format) - Add Phase 6j (library cleanup, 51G freed) - Add Phase 6k (Phase 5a un-file, 16,340 transcripts restored) - Update Open Follow-ups with backlog items identified through Phase 6k - Update footer to reflect refactor feature-complete state
2026-05-20 06:34:34 +02:00 · 2026-04-16 05:21:17 +00:00 · 2026-04-16 05:21:17 +00:00 · d1cde5a56d
commit d1cde5a56d
parent c9a8f1ecb5
1 changed files with 30 additions and 6 deletions
--- a/PROJECT-BIBLE.md
+++ b/PROJECT-BIBLE.md
@ -500,8 +500,8 @@ implementations are in the RECON repo; design lives here.
 | 2 | Shared filing function — extract organizer logic into `filing.py` |
 | 3 | Transcript processor — first end-to-end test of the new pattern |
 | 4 | PDF processor — layered A/B/C metadata vote, level-4 dedupe |
-| 5a | Transcript resweep — 16,340 transcripts migrated from `library/*.txt` path to `stream.echo6.co/w/<uuid>` watch URLs; catalogue/documents/Qdrant all updated atomically, physical `.txt` files deleted |
-| 5b | Transcript unprocess — clean up stale rows and processing dirs |
+| 5a | Transcript resweep — 16,596 transcripts moved from `/mnt/library/_sources/streamecho6/` into `/mnt/library/<Domain>/<Subdomain>/` via concept-driven domain classification; 2,259 skipped as unclassified (these became the 5b drain cohort) |
+| 5b | Transcript unprocess — 2,259 skip_unclassified transcripts staged into `data/acquired/stream/` as `.txt`+`.meta.json` pairs; DB rows deleted, Qdrant vectors removed, source dirs cleaned |
 | 5c-1 | Service loop rewire — retire old scan_library thread, wire dispatcher in |
 | 5c-2 | Service start & transcript drain — clear the hopper backlog |
 | 6a | Transcript organized-in-place — set `organized_at` during pre_flight so filing worker ignores transcripts |
@ -509,6 +509,13 @@ implementations are in the RECON repo; design lives here.
 | 6c | Code cleanup — dead-code audit |
 | 6d | PeerTube acquisition module — replace ad-hoc ingester with `acquisition/peertube.py` |
 | 6e | ShadowLib skill + dashboard PeerTube endpoint cleanup (partial — 6e-2 reverted) |
+| 6f | Text processor — new `lib/processors/text_processor.py` handles `.txt` files with two-source metadata voting (filename + Gemini); new `data/acquired/text/` hopper subfolder; files to library like PDFs |
+| 6f-2 | Format normalizer in dispatcher — converts `.epub`/`.mobi` to PDF via Calibre's `ebook-convert`, `.doc`/`.docx` via `libreoffice --headless`, called per-subfolder before `_find_pairs()` |
+| 6g | Gemini "null" string bug fix — both `pdf_processor` and `text_processor` now filter the literal string `"null"` out of Gemini's JSON responses before metadata voting |
+| 6h | STATE 2 transcript cleanup — deleted 283 zero-vector transcripts (DB rows, concepts, local text, Qdrant entries) and 1,198 orphan dirs in `data/text/`; triggered PeerTube transcription for 332 videos without captions via `POST /api/v1/videos/{uuid}/captions/generate` |
+| 6i | Dashboard upload migration — `POST /api/upload` now routes by extension to the appropriate hopper (pdf/text) with `.meta.json` sidecar, supports PDF/TXT/EPUB/DOC/DOCX/MOBI, removed direct library copy and `add_to_catalogue`/`queue_document` calls, added status endpoint fallback that checks `acquired/` and `processing/` dirs for the upload/dispatch gap |
+| 6j | Library cleanup — ~51G freed; 398 duplicate PDFs deleted (Army_Pubs, Acquired, Scenario-Playbooks dupes); 2,274 non-PDF SCL files deleted (user confirmed backups); 57 files in 3 ghost domain folders (Community-Coordination, Leadership, Scenario-Playbooks) refiled through new pipeline; 201 unclassified SCL PDFs refiled; 1,240 `_unclassified/` PDFs refiled; `_ingest/_duplicates/` cleared; 5 loose root PDFs staged |
+| 6k | Phase 5a un-file — 16,340 of the 16,596 Phase 5a-filed transcripts had their `catalogue.path` restored from library filesystem path back to PeerTube watch URL via title-matching against PeerTube's video list (98.6% match rate); physical `.txt` files deleted from library; Qdrant `download_url` payload updated; 4,955 empty dirs cleaned up; 223 edge cases (82 MULTI_MATCH + 141 UNMATCHED) documented for later review |

 ### Baseline pre-refactor (per `current-state.md`)
 - 18,855 transcripts in `/mnt/library/_sources/streamecho6/`.
@ -609,9 +616,9 @@ RECON Gemini/PeerTube keys: `/opt/recon/.env` on CT 130.
 ## 18. Open Follow-ups

 - **82 MULTI_MATCH + 141 UNMATCHED** transcript rows still carry
-  library paths post Phase 5a (audit trail at
-  `/tmp/phase5a_remaining.txt` on CT 130). Either hand-resolve or
-  tombstone.
+  library paths post Phase 5a/6k (audit trail at
+  `/tmp/phase5a_remaining.txt` on CT 130 — file still present). Either
+  hand-resolve or tombstone.
 - **HTML processor** (`lib/processors/html_processor.py`) is scaffolded
  in config but not implemented. Next-up for Kiwix / web ingest.
 - **Crawler re-architecture.** The tier-1 sites list in `config.yaml`
@ -623,7 +630,24 @@ RECON Gemini/PeerTube keys: `/opt/recon/.env` on CT 130.
  needs a redesign before reinstating.
 - **Level-4 dedupe review queue** (`duplicate_review` table) has no UI
  yet; items pile up silently.
+- **9,478 legacy dirs in `/opt/recon/data/text/`** — historical
+  extraction output from the pre-refactor pipeline, for documents
+  still in catalogue. Not touched by current pipeline. Can be cleaned
+  up once confirmed none are the sole text copy for any document.
+- **`lib/new_pipeline.py` is misleadingly named** — it's actually a
+  library management CLI tool, not the refactor's new pipeline.
+  Contains `update_qdrant_payload` helper that filing worker depends
+  on. Should be renamed (e.g., `library_ops.py`) when there's time.
+- **SSH key for CT 130 forge access** — currently uses HTTPS with
+  embedded token in remote URL. Move to SSH key auth.
+- **Backup policy for derived data** — `/opt/recon/data/concepts/` and
+  Qdrant snapshots are not in any backup rotation. If CT 130 or cortex
+  lose their disks, these are the hardest to regenerate (Gemini calls
+  + embedding compute).
+- **`signal-archive/` in `/mnt/library/`** — 44 Signal/Matrix chat log
+  files, not library content. Matt intends these to "eventually
+  contribute" to the knowledge base but no ingestion path exists yet.

 ---

-*Last updated: 2026-04-15 — Phase 5a transcript un-file complete, Phase 6e partial. Living document; edit in place as the system evolves.*
+*Last updated: 2026-04-15 — Refactor feature-complete. Phases 0 through 6k landed. Service operational with 7 daemon threads. Outstanding: 223 edge-case transcripts (see Section 18), HTML processor (scaffolded, not implemented), crawler re-architecture (deferred). Living document; edit in place as the system evolves.*