From 1d4106643cffdf86712594f28dfa6d16b9d8d593 Mon Sep 17 00:00:00 2001 From: Matt Date: Tue, 14 Apr 2026 20:23:37 +0000 Subject: [PATCH] Add Phase 5c-2 doc: service start, failure analysis, and recovery Documents the initial 5c-2 failure (ignore_errors=True + root-owned legacy files), the recovery procedure (hopper reconstitution, orphan cleanup, processor fix), and the successful retry with pipeline drain in progress. --- phases/phase-5c2-service-start.md | 138 ++++++++++++++++++++++++++++++ 1 file changed, 138 insertions(+) create mode 100644 phases/phase-5c2-service-start.md diff --git a/phases/phase-5c2-service-start.md b/phases/phase-5c2-service-start.md new file mode 100644 index 0000000..8d05fab --- /dev/null +++ b/phases/phase-5c2-service-start.md @@ -0,0 +1,138 @@ +# Phase 5c-2: Service Start & Transcript Drain + +## Objective + +Start `recon.service` with the refactored dispatch loop and drain the 2,259 +PeerTube transcripts (unprocessed in Phase 5b) through the full pipeline: +dispatch → extract → enrich → embed → complete. + +## Initial Attempt (2026-04-14 ~18:42 UTC) + +### What happened + +The service started cleanly — all threads (extract, enrich, embed, dispatcher, +filing, progress, dashboard) came up. The dispatcher found 2,259 `.txt` + +`.meta.json` pairs in `acquired/stream/` and began dispatching them to +`transcript_processor.pre_flight()`. + +**All 2,259 items crashed** with: +``` +Permission denied: '/opt/recon/data/processing/{hash}/meta.json' +``` + +### Root cause + +The Phase 5b staging script (`stage_hopper.py`) used `shutil.copy2()` to copy +`meta.json` sidecars from `/mnt/library/_sources/streamecho6/` into the hopper. +The source files were root-owned, and `copy2` preserves permissions — so the +hopper copies were also root-owned. + +When `transcript_processor.pre_flight()` ran: +1. It hashed the content file (worked — `.txt` was world-readable) +2. It ran stale cleanup: `shutil.rmtree(proc_dir, ignore_errors=True)` +3. The `ignore_errors=True` **silently failed** to remove root-owned files + inside processing dirs that already existed from a previous (legacy) + PeerTube scraper run +4. It tried to read the (now root-owned) `meta.json` inside the half-cleaned + processing dir → **Permission denied** + +The `ignore_errors=True` was the critical bug — it turned a loud, diagnosable +failure into a silent state corruption that cascaded into a different error +2 steps later. + +### Resulting state + +- 2,259 "dirty" processing dirs with root-owned `meta.json` + `transcript.txt` + and zvx-owned `page_*.txt` (from the partially-completed page split) +- Zero DB rows for any of the 2,259 items (pre_flight crashed before + `add_to_catalogue` / `queue_document`) +- 3 "clean" processing dirs from Phase 3/4 test items (intact, with DB rows) +- Hopper drained (all files consumed by dispatcher before crash) + +--- + +## Recovery (2026-04-14 ~20:14 UTC) + +### Pre-recovery verification + +| Check | Expected | Actual | +|-------|----------|--------| +| recon.service | inactive | inactive | +| processing/ dirs | 2,262 | 2,262 | +| acquired/stream/ | empty | empty | +| catalogue rows | 27,553 | 27,553 | +| documents rows | 27,553 | 27,553 | + +DB backup: `/tmp/recon.db.phase5c2recovery.20260414.bak` +(md5: `99031c0994160e3a844375a9a635d616`) + +### Section 1: Classify processing dirs + +- **3 PRESERVE** — have catalogue rows (Phase 3/4 test items): + `172f39ae`, `380dbc78`, `93aad72f` +- **2,259 ORPHAN** — no catalogue or documents rows + +### Section 2: Reconstitute hopper + +Copied `transcript.txt` → `{hash}.txt` and `meta.json` → `{hash}.meta.json` +from each orphan processing dir back into `acquired/stream/`. + +Result: 2,259 `.txt` + 2,259 `.meta.json`, all zvx-owned, 0 errors. + +### Section 3: Delete orphan processing dirs + +`sudo rm -rf` on all 2,259 orphan dirs (sudo required for root-owned files +on local disk). 3 preserved dirs confirmed intact. + +### Section 4: Fix the bug + +Replaced `shutil.rmtree(proc_dir, ignore_errors=True)` with explicit +try/except that logs and re-raises: + +```python +if os.path.exists(proc_dir): + try: + shutil.rmtree(proc_dir) + except Exception as e: + logger.error("Stale cleanup failed for %s: %s", proc_dir, e) + raise +``` + +Applied to both `transcript_processor.py` and `pdf_processor.py`. + +### Section 5: Commit + +- **Commit:** `9fa60f9` on `refactor` branch +- **Message:** "Fix: stale cleanup in processors must fail loudly on permission errors" +- **Pushed to:** `forge.echo6.co/matt/recon` (origin/refactor) + +### Section 6: Retry service start + +Service started at **20:15:57 UTC**. All threads came up cleanly. + +Dispatcher consumed all 2,259 hopper items on first cycle (~20:16). Items +immediately began flowing through the pipeline: + +| Time (UTC) | extracted | enriching | enriched | complete | +|------------|-----------|-----------|----------|----------| +| 20:17:51 | 2,137 | 16 | 31 | 27,634 | +| 20:18:51 | 2,061 | 16 | 15 | 27,717 | +| 20:19:51 | 1,987 | 16 | 3 | 27,803 | +| 20:20:51 | 1,917 | 16 | 36 | 27,840 | +| 20:21:51 | 1,849 | 16 | 26 | 27,918 | + +Throughput: ~57 items/min completing. Zero errors in 5-minute stability window. + +ETA for full drain: ~32 minutes from 20:22 UTC (~20:54 UTC). + +Background monitor running (PID 554723), writing to +`/tmp/phase5c2recovery_initial_log.txt`. + +--- + +## Files Modified + +| File | Change | +|------|--------| +| `lib/processors/transcript_processor.py` | Stale cleanup: fail loudly | +| `lib/processors/pdf_processor.py` | Stale cleanup: fail loudly |