From d1cde5a56dc27a3aae1f168b6d335d0fd24223ed Mon Sep 17 00:00:00 2001 From: Matt Date: Thu, 16 Apr 2026 05:21:17 +0000 Subject: [PATCH] PROJECT-BIBLE: bring refactor history current through Phase 6k Updates: - Fix Phase 5a description (was incorrectly describing the un-file) - Fix Phase 5b description (2,259 drain cohort) - Add Phase 6f (text processor) - Add Phase 6f-2 (format normalizer) - Add Phase 6g (Gemini null bug fix) - Add Phase 6h (STATE 2 cleanup + PeerTube transcription trigger) - Add Phase 6i (dashboard upload migration, multi-format) - Add Phase 6j (library cleanup, 51G freed) - Add Phase 6k (Phase 5a un-file, 16,340 transcripts restored) - Update Open Follow-ups with backlog items identified through Phase 6k - Update footer to reflect refactor feature-complete state --- PROJECT-BIBLE.md | 36 ++++++++++++++++++++++++++++++------ 1 file changed, 30 insertions(+), 6 deletions(-) diff --git a/PROJECT-BIBLE.md b/PROJECT-BIBLE.md index c9edfc5..666dc12 100644 --- a/PROJECT-BIBLE.md +++ b/PROJECT-BIBLE.md @@ -500,8 +500,8 @@ implementations are in the RECON repo; design lives here. | 2 | Shared filing function — extract organizer logic into `filing.py` | | 3 | Transcript processor — first end-to-end test of the new pattern | | 4 | PDF processor — layered A/B/C metadata vote, level-4 dedupe | -| 5a | Transcript resweep — 16,340 transcripts migrated from `library/*.txt` path to `stream.echo6.co/w/` watch URLs; catalogue/documents/Qdrant all updated atomically, physical `.txt` files deleted | -| 5b | Transcript unprocess — clean up stale rows and processing dirs | +| 5a | Transcript resweep — 16,596 transcripts moved from `/mnt/library/_sources/streamecho6/` into `/mnt/library///` via concept-driven domain classification; 2,259 skipped as unclassified (these became the 5b drain cohort) | +| 5b | Transcript unprocess — 2,259 skip_unclassified transcripts staged into `data/acquired/stream/` as `.txt`+`.meta.json` pairs; DB rows deleted, Qdrant vectors removed, source dirs cleaned | | 5c-1 | Service loop rewire — retire old scan_library thread, wire dispatcher in | | 5c-2 | Service start & transcript drain — clear the hopper backlog | | 6a | Transcript organized-in-place — set `organized_at` during pre_flight so filing worker ignores transcripts | @@ -509,6 +509,13 @@ implementations are in the RECON repo; design lives here. | 6c | Code cleanup — dead-code audit | | 6d | PeerTube acquisition module — replace ad-hoc ingester with `acquisition/peertube.py` | | 6e | ShadowLib skill + dashboard PeerTube endpoint cleanup (partial — 6e-2 reverted) | +| 6f | Text processor — new `lib/processors/text_processor.py` handles `.txt` files with two-source metadata voting (filename + Gemini); new `data/acquired/text/` hopper subfolder; files to library like PDFs | +| 6f-2 | Format normalizer in dispatcher — converts `.epub`/`.mobi` to PDF via Calibre's `ebook-convert`, `.doc`/`.docx` via `libreoffice --headless`, called per-subfolder before `_find_pairs()` | +| 6g | Gemini "null" string bug fix — both `pdf_processor` and `text_processor` now filter the literal string `"null"` out of Gemini's JSON responses before metadata voting | +| 6h | STATE 2 transcript cleanup — deleted 283 zero-vector transcripts (DB rows, concepts, local text, Qdrant entries) and 1,198 orphan dirs in `data/text/`; triggered PeerTube transcription for 332 videos without captions via `POST /api/v1/videos/{uuid}/captions/generate` | +| 6i | Dashboard upload migration — `POST /api/upload` now routes by extension to the appropriate hopper (pdf/text) with `.meta.json` sidecar, supports PDF/TXT/EPUB/DOC/DOCX/MOBI, removed direct library copy and `add_to_catalogue`/`queue_document` calls, added status endpoint fallback that checks `acquired/` and `processing/` dirs for the upload/dispatch gap | +| 6j | Library cleanup — ~51G freed; 398 duplicate PDFs deleted (Army_Pubs, Acquired, Scenario-Playbooks dupes); 2,274 non-PDF SCL files deleted (user confirmed backups); 57 files in 3 ghost domain folders (Community-Coordination, Leadership, Scenario-Playbooks) refiled through new pipeline; 201 unclassified SCL PDFs refiled; 1,240 `_unclassified/` PDFs refiled; `_ingest/_duplicates/` cleared; 5 loose root PDFs staged | +| 6k | Phase 5a un-file — 16,340 of the 16,596 Phase 5a-filed transcripts had their `catalogue.path` restored from library filesystem path back to PeerTube watch URL via title-matching against PeerTube's video list (98.6% match rate); physical `.txt` files deleted from library; Qdrant `download_url` payload updated; 4,955 empty dirs cleaned up; 223 edge cases (82 MULTI_MATCH + 141 UNMATCHED) documented for later review | ### Baseline pre-refactor (per `current-state.md`) - 18,855 transcripts in `/mnt/library/_sources/streamecho6/`. @@ -609,9 +616,9 @@ RECON Gemini/PeerTube keys: `/opt/recon/.env` on CT 130. ## 18. Open Follow-ups - **82 MULTI_MATCH + 141 UNMATCHED** transcript rows still carry - library paths post Phase 5a (audit trail at - `/tmp/phase5a_remaining.txt` on CT 130). Either hand-resolve or - tombstone. + library paths post Phase 5a/6k (audit trail at + `/tmp/phase5a_remaining.txt` on CT 130 — file still present). Either + hand-resolve or tombstone. - **HTML processor** (`lib/processors/html_processor.py`) is scaffolded in config but not implemented. Next-up for Kiwix / web ingest. - **Crawler re-architecture.** The tier-1 sites list in `config.yaml` @@ -623,7 +630,24 @@ RECON Gemini/PeerTube keys: `/opt/recon/.env` on CT 130. needs a redesign before reinstating. - **Level-4 dedupe review queue** (`duplicate_review` table) has no UI yet; items pile up silently. +- **9,478 legacy dirs in `/opt/recon/data/text/`** — historical + extraction output from the pre-refactor pipeline, for documents + still in catalogue. Not touched by current pipeline. Can be cleaned + up once confirmed none are the sole text copy for any document. +- **`lib/new_pipeline.py` is misleadingly named** — it's actually a + library management CLI tool, not the refactor's new pipeline. + Contains `update_qdrant_payload` helper that filing worker depends + on. Should be renamed (e.g., `library_ops.py`) when there's time. +- **SSH key for CT 130 forge access** — currently uses HTTPS with + embedded token in remote URL. Move to SSH key auth. +- **Backup policy for derived data** — `/opt/recon/data/concepts/` and + Qdrant snapshots are not in any backup rotation. If CT 130 or cortex + lose their disks, these are the hardest to regenerate (Gemini calls + + embedding compute). +- **`signal-archive/` in `/mnt/library/`** — 44 Signal/Matrix chat log + files, not library content. Matt intends these to "eventually + contribute" to the knowledge base but no ingestion path exists yet. --- -*Last updated: 2026-04-15 — Phase 5a transcript un-file complete, Phase 6e partial. Living document; edit in place as the system evolves.* +*Last updated: 2026-04-15 — Refactor feature-complete. Phases 0 through 6k landed. Service operational with 7 daemon threads. Outstanding: 223 edge-case transcripts (see Section 18), HTML processor (scaffolded, not implemented), crawler re-architecture (deferred). Living document; edit in place as the system evolves.*