refactored-recon/phases/phase-6c-code-cleanup.md

110 lines
4.6 KiB
Markdown
Raw Permalink Normal View History

# Phase 6c: Code Cleanup
## Objective
Remove dead code paths left over from the refactor. Investigation first,
deletion second — only remove what's confirmed dead.
## Investigation Findings
### Expected dead code vs reality
| Item | Expected status | Actual status |
|------|----------------|---------------|
| `scanner_loop` | Dead function in recon.py | **Already removed** in Phase 5c-1 |
| `peertube_scanner_loop` | Dead function in recon.py | **Already removed** in Phase 5c-1 |
| `crawler_scheduler_loop` | Dead function in recon.py | **Already removed** in Phase 5c-1 |
| `organizer_loop` | Dead function in recon.py | **Already removed** in Phase 5c-1 |
| Extract worker thread | Vestigial in cmd_service() | **Confirmed dead** — 0 items queued, silent 24h+ |
| `lib/crawler.py` | Legacy module | **Confirmed dead** — only used by CLI subcommand |
| `lib/web_scraper.py` | Legacy module | **ALIVE**`chunk_text()` used by transcript_processor |
| `lib/new_pipeline.py` | Legacy module | **ALIVE** — active Stream B library management tool (1,637 lines, created Apr 13) |
| `lib/peertube_scraper.py` | Legacy module | **ALIVE** — only mechanism for transcript ingestion |
| `lib/extractor.py` | Dead module | **ALIVE** — used by `cmd_run` CLI for batch processing |
### Additional findings
- **24 `.bak` files** found across `/opt/recon/` (untracked, manual pre-edit safety backups from Feb-Apr 2026). All originals preserved in git history.
- **File ownership**: All 21 `.py` files + `recon.py` correctly owned by zvx. No corrections needed.
- **No TODO/DEPRECATED comments** found in any lib/ file.
- **All imports in recon.py** confirmed used (no dead imports at module level).
- **PeerTube transcript ingestion** has no automatic mechanism since Phase 5c-1 removed `peertube_scanner_loop`. Ingestion is manual only (CLI or dashboard API endpoint).
## What Was Removed
### recon.py edits (-89 lines, +3 lines)
1. **Extract worker thread** removed from `cmd_service()`:
- `from lib.extractor import run_extraction` import
- `extract_workers` variable
- `'extract': 0` from totals dict
- Extract `threading.Thread(target=stage_loop, ...)` from thread list
- Extract workers from startup log message
2. **`cmd_crawl` function** deleted (65 lines) — CLI handler for `recon crawl`
3. **Crawl argparse subparser** deleted (15 lines) — `recon crawl` subcommand registration
4. **Docstring** updated to remove `crawl` from subcommand list
### Files deleted
| File | Lines | Reason |
|------|-------|--------|
| `lib/crawler.py` | 432 | Only referenced by deleted `cmd_crawl` CLI subcommand |
### .bak files deleted (24 files, untracked)
| File | Size |
|------|------|
| `recon.py.bak-pre-streamb` | 48K |
| `recon.py.bak-pre-ux` | 35K |
| `recon.py.bak-pre-crawler` | 35K |
| `recon.py.bak.202602171647` | 33K |
| `config.yaml.bak-pre-crawler` | 4K |
| `config.yaml.bak-pre-streamb` | 13K |
| `lib/api.py.bak` + 5 more api.py backups | 498K total |
| `lib/embedder.py.bak` | 15K |
| `lib/enricher.py.bak` | 17K |
| `lib/extractor.py.bak` | 18K |
| `lib/status.py.bak-pre-ux` | 10K |
| `lib/status.py.bak-pre-streamb` | 13K |
| `scripts/validate.py.bak` | 6K |
| `scripts/rebuild_qdrant.py.bak` | 6K |
| `static/js/dashboard.js.bak` | 11K |
| `static/js/peertube.js.bak.20260223` | 5K |
| `templates/search.html.bak` | 2K |
| `templates/knowledge/dashboard.html.bak` | 3K |
## What Was Kept (and why)
| Module | Lines | Why kept |
|--------|-------|----------|
| `lib/web_scraper.py` | 324 | `transcript_processor.py` imports `chunk_text()` |
| `lib/new_pipeline.py` | 1,637 | Active Stream B library management CLI (created Apr 13) |
| `lib/peertube_scraper.py` | 580 | Only way to ingest PeerTube transcripts |
| `lib/extractor.py` | 601 | Used by `cmd_run` CLI for batch PDF processing |
## Verification
| Check | Result |
|-------|--------|
| Compile (recon.py) | OK |
| Import (recon module) | OK |
| Import (dispatcher, filing, processors) | OK |
| cmd_service assertions | extract worker absent, dispatch_loop present, filing_worker_loop present |
| Zero crawler references in .py files | Confirmed |
| Service restart | Clean, active |
| Thread count | 13 tasks (was 14 — extract removed) |
| Threads started | enrich, embed, dispatcher, filing, progress, dashboard, metrics |
| Extract thread | Absent (confirmed by logs: no `[extract] Stage started`) |
| Errors (60s window) | 0 |
| DB rows | catalogue=29,812, documents=29,812 (unchanged) |
| Dashboard | Responsive |
| Hopper | Empty |
## Commit
- **Commit:** `efae402` on `refactor` branch
- **Diff:** 2 files changed, 3 insertions(+), 521 deletions(-)
- **Pushed to:** `forge.echo6.co/matt/recon` (origin/refactor)