mirror of
https://github.com/zvx-echo6/refactored-recon.git
synced 2026-05-20 14:44:39 +02:00
PROJECT-BIBLE: bring refactor history current through Phase 6k
Updates: - Fix Phase 5a description (was incorrectly describing the un-file) - Fix Phase 5b description (2,259 drain cohort) - Add Phase 6f (text processor) - Add Phase 6f-2 (format normalizer) - Add Phase 6g (Gemini null bug fix) - Add Phase 6h (STATE 2 cleanup + PeerTube transcription trigger) - Add Phase 6i (dashboard upload migration, multi-format) - Add Phase 6j (library cleanup, 51G freed) - Add Phase 6k (Phase 5a un-file, 16,340 transcripts restored) - Update Open Follow-ups with backlog items identified through Phase 6k - Update footer to reflect refactor feature-complete state
This commit is contained in:
parent
c9a8f1ecb5
commit
d1cde5a56d
1 changed files with 30 additions and 6 deletions
|
|
@ -500,8 +500,8 @@ implementations are in the RECON repo; design lives here.
|
||||||
| 2 | Shared filing function — extract organizer logic into `filing.py` |
|
| 2 | Shared filing function — extract organizer logic into `filing.py` |
|
||||||
| 3 | Transcript processor — first end-to-end test of the new pattern |
|
| 3 | Transcript processor — first end-to-end test of the new pattern |
|
||||||
| 4 | PDF processor — layered A/B/C metadata vote, level-4 dedupe |
|
| 4 | PDF processor — layered A/B/C metadata vote, level-4 dedupe |
|
||||||
| 5a | Transcript resweep — 16,340 transcripts migrated from `library/*.txt` path to `stream.echo6.co/w/<uuid>` watch URLs; catalogue/documents/Qdrant all updated atomically, physical `.txt` files deleted |
|
| 5a | Transcript resweep — 16,596 transcripts moved from `/mnt/library/_sources/streamecho6/` into `/mnt/library/<Domain>/<Subdomain>/` via concept-driven domain classification; 2,259 skipped as unclassified (these became the 5b drain cohort) |
|
||||||
| 5b | Transcript unprocess — clean up stale rows and processing dirs |
|
| 5b | Transcript unprocess — 2,259 skip_unclassified transcripts staged into `data/acquired/stream/` as `.txt`+`.meta.json` pairs; DB rows deleted, Qdrant vectors removed, source dirs cleaned |
|
||||||
| 5c-1 | Service loop rewire — retire old scan_library thread, wire dispatcher in |
|
| 5c-1 | Service loop rewire — retire old scan_library thread, wire dispatcher in |
|
||||||
| 5c-2 | Service start & transcript drain — clear the hopper backlog |
|
| 5c-2 | Service start & transcript drain — clear the hopper backlog |
|
||||||
| 6a | Transcript organized-in-place — set `organized_at` during pre_flight so filing worker ignores transcripts |
|
| 6a | Transcript organized-in-place — set `organized_at` during pre_flight so filing worker ignores transcripts |
|
||||||
|
|
@ -509,6 +509,13 @@ implementations are in the RECON repo; design lives here.
|
||||||
| 6c | Code cleanup — dead-code audit |
|
| 6c | Code cleanup — dead-code audit |
|
||||||
| 6d | PeerTube acquisition module — replace ad-hoc ingester with `acquisition/peertube.py` |
|
| 6d | PeerTube acquisition module — replace ad-hoc ingester with `acquisition/peertube.py` |
|
||||||
| 6e | ShadowLib skill + dashboard PeerTube endpoint cleanup (partial — 6e-2 reverted) |
|
| 6e | ShadowLib skill + dashboard PeerTube endpoint cleanup (partial — 6e-2 reverted) |
|
||||||
|
| 6f | Text processor — new `lib/processors/text_processor.py` handles `.txt` files with two-source metadata voting (filename + Gemini); new `data/acquired/text/` hopper subfolder; files to library like PDFs |
|
||||||
|
| 6f-2 | Format normalizer in dispatcher — converts `.epub`/`.mobi` to PDF via Calibre's `ebook-convert`, `.doc`/`.docx` via `libreoffice --headless`, called per-subfolder before `_find_pairs()` |
|
||||||
|
| 6g | Gemini "null" string bug fix — both `pdf_processor` and `text_processor` now filter the literal string `"null"` out of Gemini's JSON responses before metadata voting |
|
||||||
|
| 6h | STATE 2 transcript cleanup — deleted 283 zero-vector transcripts (DB rows, concepts, local text, Qdrant entries) and 1,198 orphan dirs in `data/text/`; triggered PeerTube transcription for 332 videos without captions via `POST /api/v1/videos/{uuid}/captions/generate` |
|
||||||
|
| 6i | Dashboard upload migration — `POST /api/upload` now routes by extension to the appropriate hopper (pdf/text) with `.meta.json` sidecar, supports PDF/TXT/EPUB/DOC/DOCX/MOBI, removed direct library copy and `add_to_catalogue`/`queue_document` calls, added status endpoint fallback that checks `acquired/` and `processing/` dirs for the upload/dispatch gap |
|
||||||
|
| 6j | Library cleanup — ~51G freed; 398 duplicate PDFs deleted (Army_Pubs, Acquired, Scenario-Playbooks dupes); 2,274 non-PDF SCL files deleted (user confirmed backups); 57 files in 3 ghost domain folders (Community-Coordination, Leadership, Scenario-Playbooks) refiled through new pipeline; 201 unclassified SCL PDFs refiled; 1,240 `_unclassified/` PDFs refiled; `_ingest/_duplicates/` cleared; 5 loose root PDFs staged |
|
||||||
|
| 6k | Phase 5a un-file — 16,340 of the 16,596 Phase 5a-filed transcripts had their `catalogue.path` restored from library filesystem path back to PeerTube watch URL via title-matching against PeerTube's video list (98.6% match rate); physical `.txt` files deleted from library; Qdrant `download_url` payload updated; 4,955 empty dirs cleaned up; 223 edge cases (82 MULTI_MATCH + 141 UNMATCHED) documented for later review |
|
||||||
|
|
||||||
### Baseline pre-refactor (per `current-state.md`)
|
### Baseline pre-refactor (per `current-state.md`)
|
||||||
- 18,855 transcripts in `/mnt/library/_sources/streamecho6/`.
|
- 18,855 transcripts in `/mnt/library/_sources/streamecho6/`.
|
||||||
|
|
@ -609,9 +616,9 @@ RECON Gemini/PeerTube keys: `/opt/recon/.env` on CT 130.
|
||||||
## 18. Open Follow-ups
|
## 18. Open Follow-ups
|
||||||
|
|
||||||
- **82 MULTI_MATCH + 141 UNMATCHED** transcript rows still carry
|
- **82 MULTI_MATCH + 141 UNMATCHED** transcript rows still carry
|
||||||
library paths post Phase 5a (audit trail at
|
library paths post Phase 5a/6k (audit trail at
|
||||||
`/tmp/phase5a_remaining.txt` on CT 130). Either hand-resolve or
|
`/tmp/phase5a_remaining.txt` on CT 130 — file still present). Either
|
||||||
tombstone.
|
hand-resolve or tombstone.
|
||||||
- **HTML processor** (`lib/processors/html_processor.py`) is scaffolded
|
- **HTML processor** (`lib/processors/html_processor.py`) is scaffolded
|
||||||
in config but not implemented. Next-up for Kiwix / web ingest.
|
in config but not implemented. Next-up for Kiwix / web ingest.
|
||||||
- **Crawler re-architecture.** The tier-1 sites list in `config.yaml`
|
- **Crawler re-architecture.** The tier-1 sites list in `config.yaml`
|
||||||
|
|
@ -623,7 +630,24 @@ RECON Gemini/PeerTube keys: `/opt/recon/.env` on CT 130.
|
||||||
needs a redesign before reinstating.
|
needs a redesign before reinstating.
|
||||||
- **Level-4 dedupe review queue** (`duplicate_review` table) has no UI
|
- **Level-4 dedupe review queue** (`duplicate_review` table) has no UI
|
||||||
yet; items pile up silently.
|
yet; items pile up silently.
|
||||||
|
- **9,478 legacy dirs in `/opt/recon/data/text/`** — historical
|
||||||
|
extraction output from the pre-refactor pipeline, for documents
|
||||||
|
still in catalogue. Not touched by current pipeline. Can be cleaned
|
||||||
|
up once confirmed none are the sole text copy for any document.
|
||||||
|
- **`lib/new_pipeline.py` is misleadingly named** — it's actually a
|
||||||
|
library management CLI tool, not the refactor's new pipeline.
|
||||||
|
Contains `update_qdrant_payload` helper that filing worker depends
|
||||||
|
on. Should be renamed (e.g., `library_ops.py`) when there's time.
|
||||||
|
- **SSH key for CT 130 forge access** — currently uses HTTPS with
|
||||||
|
embedded token in remote URL. Move to SSH key auth.
|
||||||
|
- **Backup policy for derived data** — `/opt/recon/data/concepts/` and
|
||||||
|
Qdrant snapshots are not in any backup rotation. If CT 130 or cortex
|
||||||
|
lose their disks, these are the hardest to regenerate (Gemini calls
|
||||||
|
+ embedding compute).
|
||||||
|
- **`signal-archive/` in `/mnt/library/`** — 44 Signal/Matrix chat log
|
||||||
|
files, not library content. Matt intends these to "eventually
|
||||||
|
contribute" to the knowledge base but no ingestion path exists yet.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
*Last updated: 2026-04-15 — Phase 5a transcript un-file complete, Phase 6e partial. Living document; edit in place as the system evolves.*
|
*Last updated: 2026-04-15 — Refactor feature-complete. Phases 0 through 6k landed. Service operational with 7 daemon threads. Outstanding: 223 edge-case transcripts (see Section 18), HTML processor (scaffolded, not implemented), crawler re-architecture (deferred). Living document; edit in place as the system evolves.*
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue