mirror of
https://github.com/zvx-echo6/refactored-recon.git
synced 2026-05-20 14:44:39 +02:00
Documents the initial 5c-2 failure (ignore_errors=True + root-owned legacy files), the recovery procedure (hopper reconstitution, orphan cleanup, processor fix), and the successful retry with pipeline drain in progress.
138 lines
4.6 KiB
Markdown
138 lines
4.6 KiB
Markdown
# Phase 5c-2: Service Start & Transcript Drain
|
|
|
|
## Objective
|
|
|
|
Start `recon.service` with the refactored dispatch loop and drain the 2,259
|
|
PeerTube transcripts (unprocessed in Phase 5b) through the full pipeline:
|
|
dispatch → extract → enrich → embed → complete.
|
|
|
|
## Initial Attempt (2026-04-14 ~18:42 UTC)
|
|
|
|
### What happened
|
|
|
|
The service started cleanly — all threads (extract, enrich, embed, dispatcher,
|
|
filing, progress, dashboard) came up. The dispatcher found 2,259 `.txt` +
|
|
`.meta.json` pairs in `acquired/stream/` and began dispatching them to
|
|
`transcript_processor.pre_flight()`.
|
|
|
|
**All 2,259 items crashed** with:
|
|
```
|
|
Permission denied: '/opt/recon/data/processing/{hash}/meta.json'
|
|
```
|
|
|
|
### Root cause
|
|
|
|
The Phase 5b staging script (`stage_hopper.py`) used `shutil.copy2()` to copy
|
|
`meta.json` sidecars from `/mnt/library/_sources/streamecho6/` into the hopper.
|
|
The source files were root-owned, and `copy2` preserves permissions — so the
|
|
hopper copies were also root-owned.
|
|
|
|
When `transcript_processor.pre_flight()` ran:
|
|
1. It hashed the content file (worked — `.txt` was world-readable)
|
|
2. It ran stale cleanup: `shutil.rmtree(proc_dir, ignore_errors=True)`
|
|
3. The `ignore_errors=True` **silently failed** to remove root-owned files
|
|
inside processing dirs that already existed from a previous (legacy)
|
|
PeerTube scraper run
|
|
4. It tried to read the (now root-owned) `meta.json` inside the half-cleaned
|
|
processing dir → **Permission denied**
|
|
|
|
The `ignore_errors=True` was the critical bug — it turned a loud, diagnosable
|
|
failure into a silent state corruption that cascaded into a different error
|
|
2 steps later.
|
|
|
|
### Resulting state
|
|
|
|
- 2,259 "dirty" processing dirs with root-owned `meta.json` + `transcript.txt`
|
|
and zvx-owned `page_*.txt` (from the partially-completed page split)
|
|
- Zero DB rows for any of the 2,259 items (pre_flight crashed before
|
|
`add_to_catalogue` / `queue_document`)
|
|
- 3 "clean" processing dirs from Phase 3/4 test items (intact, with DB rows)
|
|
- Hopper drained (all files consumed by dispatcher before crash)
|
|
|
|
---
|
|
|
|
## Recovery (2026-04-14 ~20:14 UTC)
|
|
|
|
### Pre-recovery verification
|
|
|
|
| Check | Expected | Actual |
|
|
|-------|----------|--------|
|
|
| recon.service | inactive | inactive |
|
|
| processing/ dirs | 2,262 | 2,262 |
|
|
| acquired/stream/ | empty | empty |
|
|
| catalogue rows | 27,553 | 27,553 |
|
|
| documents rows | 27,553 | 27,553 |
|
|
|
|
DB backup: `/tmp/recon.db.phase5c2recovery.20260414.bak`
|
|
(md5: `99031c0994160e3a844375a9a635d616`)
|
|
|
|
### Section 1: Classify processing dirs
|
|
|
|
- **3 PRESERVE** — have catalogue rows (Phase 3/4 test items):
|
|
`172f39ae`, `380dbc78`, `93aad72f`
|
|
- **2,259 ORPHAN** — no catalogue or documents rows
|
|
|
|
### Section 2: Reconstitute hopper
|
|
|
|
Copied `transcript.txt` → `{hash}.txt` and `meta.json` → `{hash}.meta.json`
|
|
from each orphan processing dir back into `acquired/stream/`.
|
|
|
|
Result: 2,259 `.txt` + 2,259 `.meta.json`, all zvx-owned, 0 errors.
|
|
|
|
### Section 3: Delete orphan processing dirs
|
|
|
|
`sudo rm -rf` on all 2,259 orphan dirs (sudo required for root-owned files
|
|
on local disk). 3 preserved dirs confirmed intact.
|
|
|
|
### Section 4: Fix the bug
|
|
|
|
Replaced `shutil.rmtree(proc_dir, ignore_errors=True)` with explicit
|
|
try/except that logs and re-raises:
|
|
|
|
```python
|
|
if os.path.exists(proc_dir):
|
|
try:
|
|
shutil.rmtree(proc_dir)
|
|
except Exception as e:
|
|
logger.error("Stale cleanup failed for %s: %s", proc_dir, e)
|
|
raise
|
|
```
|
|
|
|
Applied to both `transcript_processor.py` and `pdf_processor.py`.
|
|
|
|
### Section 5: Commit
|
|
|
|
- **Commit:** `9fa60f9` on `refactor` branch
|
|
- **Message:** "Fix: stale cleanup in processors must fail loudly on permission errors"
|
|
- **Pushed to:** `forge.echo6.co/matt/recon` (origin/refactor)
|
|
|
|
### Section 6: Retry service start
|
|
|
|
Service started at **20:15:57 UTC**. All threads came up cleanly.
|
|
|
|
Dispatcher consumed all 2,259 hopper items on first cycle (~20:16). Items
|
|
immediately began flowing through the pipeline:
|
|
|
|
| Time (UTC) | extracted | enriching | enriched | complete |
|
|
|------------|-----------|-----------|----------|----------|
|
|
| 20:17:51 | 2,137 | 16 | 31 | 27,634 |
|
|
| 20:18:51 | 2,061 | 16 | 15 | 27,717 |
|
|
| 20:19:51 | 1,987 | 16 | 3 | 27,803 |
|
|
| 20:20:51 | 1,917 | 16 | 36 | 27,840 |
|
|
| 20:21:51 | 1,849 | 16 | 26 | 27,918 |
|
|
|
|
Throughput: ~57 items/min completing. Zero errors in 5-minute stability window.
|
|
|
|
ETA for full drain: ~32 minutes from 20:22 UTC (~20:54 UTC).
|
|
|
|
Background monitor running (PID 554723), writing to
|
|
`/tmp/phase5c2recovery_initial_log.txt`.
|
|
|
|
---
|
|
|
|
## Files Modified
|
|
|
|
| File | Change |
|
|
|------|--------|
|
|
| `lib/processors/transcript_processor.py` | Stale cleanup: fail loudly |
|
|
| `lib/processors/pdf_processor.py` | Stale cleanup: fail loudly |
|