diff --git a/phases/phase-6c-code-cleanup.md b/phases/phase-6c-code-cleanup.md new file mode 100644 index 0000000..7a6079e --- /dev/null +++ b/phases/phase-6c-code-cleanup.md @@ -0,0 +1,110 @@ +# Phase 6c: Code Cleanup + +## Objective + +Remove dead code paths left over from the refactor. Investigation first, +deletion second — only remove what's confirmed dead. + +## Investigation Findings + +### Expected dead code vs reality + +| Item | Expected status | Actual status | +|------|----------------|---------------| +| `scanner_loop` | Dead function in recon.py | **Already removed** in Phase 5c-1 | +| `peertube_scanner_loop` | Dead function in recon.py | **Already removed** in Phase 5c-1 | +| `crawler_scheduler_loop` | Dead function in recon.py | **Already removed** in Phase 5c-1 | +| `organizer_loop` | Dead function in recon.py | **Already removed** in Phase 5c-1 | +| Extract worker thread | Vestigial in cmd_service() | **Confirmed dead** — 0 items queued, silent 24h+ | +| `lib/crawler.py` | Legacy module | **Confirmed dead** — only used by CLI subcommand | +| `lib/web_scraper.py` | Legacy module | **ALIVE** — `chunk_text()` used by transcript_processor | +| `lib/new_pipeline.py` | Legacy module | **ALIVE** — active Stream B library management tool (1,637 lines, created Apr 13) | +| `lib/peertube_scraper.py` | Legacy module | **ALIVE** — only mechanism for transcript ingestion | +| `lib/extractor.py` | Dead module | **ALIVE** — used by `cmd_run` CLI for batch processing | + +### Additional findings + +- **24 `.bak` files** found across `/opt/recon/` (untracked, manual pre-edit safety backups from Feb-Apr 2026). All originals preserved in git history. +- **File ownership**: All 21 `.py` files + `recon.py` correctly owned by zvx. No corrections needed. +- **No TODO/DEPRECATED comments** found in any lib/ file. +- **All imports in recon.py** confirmed used (no dead imports at module level). +- **PeerTube transcript ingestion** has no automatic mechanism since Phase 5c-1 removed `peertube_scanner_loop`. Ingestion is manual only (CLI or dashboard API endpoint). + +## What Was Removed + +### recon.py edits (-89 lines, +3 lines) + +1. **Extract worker thread** removed from `cmd_service()`: + - `from lib.extractor import run_extraction` import + - `extract_workers` variable + - `'extract': 0` from totals dict + - Extract `threading.Thread(target=stage_loop, ...)` from thread list + - Extract workers from startup log message + +2. **`cmd_crawl` function** deleted (65 lines) — CLI handler for `recon crawl` + +3. **Crawl argparse subparser** deleted (15 lines) — `recon crawl` subcommand registration + +4. **Docstring** updated to remove `crawl` from subcommand list + +### Files deleted + +| File | Lines | Reason | +|------|-------|--------| +| `lib/crawler.py` | 432 | Only referenced by deleted `cmd_crawl` CLI subcommand | + +### .bak files deleted (24 files, untracked) + +| File | Size | +|------|------| +| `recon.py.bak-pre-streamb` | 48K | +| `recon.py.bak-pre-ux` | 35K | +| `recon.py.bak-pre-crawler` | 35K | +| `recon.py.bak.202602171647` | 33K | +| `config.yaml.bak-pre-crawler` | 4K | +| `config.yaml.bak-pre-streamb` | 13K | +| `lib/api.py.bak` + 5 more api.py backups | 498K total | +| `lib/embedder.py.bak` | 15K | +| `lib/enricher.py.bak` | 17K | +| `lib/extractor.py.bak` | 18K | +| `lib/status.py.bak-pre-ux` | 10K | +| `lib/status.py.bak-pre-streamb` | 13K | +| `scripts/validate.py.bak` | 6K | +| `scripts/rebuild_qdrant.py.bak` | 6K | +| `static/js/dashboard.js.bak` | 11K | +| `static/js/peertube.js.bak.20260223` | 5K | +| `templates/search.html.bak` | 2K | +| `templates/knowledge/dashboard.html.bak` | 3K | + +## What Was Kept (and why) + +| Module | Lines | Why kept | +|--------|-------|----------| +| `lib/web_scraper.py` | 324 | `transcript_processor.py` imports `chunk_text()` | +| `lib/new_pipeline.py` | 1,637 | Active Stream B library management CLI (created Apr 13) | +| `lib/peertube_scraper.py` | 580 | Only way to ingest PeerTube transcripts | +| `lib/extractor.py` | 601 | Used by `cmd_run` CLI for batch PDF processing | + +## Verification + +| Check | Result | +|-------|--------| +| Compile (recon.py) | OK | +| Import (recon module) | OK | +| Import (dispatcher, filing, processors) | OK | +| cmd_service assertions | extract worker absent, dispatch_loop present, filing_worker_loop present | +| Zero crawler references in .py files | Confirmed | +| Service restart | Clean, active | +| Thread count | 13 tasks (was 14 — extract removed) | +| Threads started | enrich, embed, dispatcher, filing, progress, dashboard, metrics | +| Extract thread | Absent (confirmed by logs: no `[extract] Stage started`) | +| Errors (60s window) | 0 | +| DB rows | catalogue=29,812, documents=29,812 (unchanged) | +| Dashboard | Responsive | +| Hopper | Empty | + +## Commit + +- **Commit:** `efae402` on `refactor` branch +- **Diff:** 2 files changed, 3 insertions(+), 521 deletions(-) +- **Pushed to:** `forge.echo6.co/matt/recon` (origin/refactor)