mirror of
https://github.com/zvx-echo6/refactored-recon.git
synced 2026-05-20 14:44:39 +02:00
138 lines
4.6 KiB
Markdown
138 lines
4.6 KiB
Markdown
|
|
# Phase 5c-2: Service Start & Transcript Drain
|
||
|
|
|
||
|
|
## Objective
|
||
|
|
|
||
|
|
Start `recon.service` with the refactored dispatch loop and drain the 2,259
|
||
|
|
PeerTube transcripts (unprocessed in Phase 5b) through the full pipeline:
|
||
|
|
dispatch → extract → enrich → embed → complete.
|
||
|
|
|
||
|
|
## Initial Attempt (2026-04-14 ~18:42 UTC)
|
||
|
|
|
||
|
|
### What happened
|
||
|
|
|
||
|
|
The service started cleanly — all threads (extract, enrich, embed, dispatcher,
|
||
|
|
filing, progress, dashboard) came up. The dispatcher found 2,259 `.txt` +
|
||
|
|
`.meta.json` pairs in `acquired/stream/` and began dispatching them to
|
||
|
|
`transcript_processor.pre_flight()`.
|
||
|
|
|
||
|
|
**All 2,259 items crashed** with:
|
||
|
|
```
|
||
|
|
Permission denied: '/opt/recon/data/processing/{hash}/meta.json'
|
||
|
|
```
|
||
|
|
|
||
|
|
### Root cause
|
||
|
|
|
||
|
|
The Phase 5b staging script (`stage_hopper.py`) used `shutil.copy2()` to copy
|
||
|
|
`meta.json` sidecars from `/mnt/library/_sources/streamecho6/` into the hopper.
|
||
|
|
The source files were root-owned, and `copy2` preserves permissions — so the
|
||
|
|
hopper copies were also root-owned.
|
||
|
|
|
||
|
|
When `transcript_processor.pre_flight()` ran:
|
||
|
|
1. It hashed the content file (worked — `.txt` was world-readable)
|
||
|
|
2. It ran stale cleanup: `shutil.rmtree(proc_dir, ignore_errors=True)`
|
||
|
|
3. The `ignore_errors=True` **silently failed** to remove root-owned files
|
||
|
|
inside processing dirs that already existed from a previous (legacy)
|
||
|
|
PeerTube scraper run
|
||
|
|
4. It tried to read the (now root-owned) `meta.json` inside the half-cleaned
|
||
|
|
processing dir → **Permission denied**
|
||
|
|
|
||
|
|
The `ignore_errors=True` was the critical bug — it turned a loud, diagnosable
|
||
|
|
failure into a silent state corruption that cascaded into a different error
|
||
|
|
2 steps later.
|
||
|
|
|
||
|
|
### Resulting state
|
||
|
|
|
||
|
|
- 2,259 "dirty" processing dirs with root-owned `meta.json` + `transcript.txt`
|
||
|
|
and zvx-owned `page_*.txt` (from the partially-completed page split)
|
||
|
|
- Zero DB rows for any of the 2,259 items (pre_flight crashed before
|
||
|
|
`add_to_catalogue` / `queue_document`)
|
||
|
|
- 3 "clean" processing dirs from Phase 3/4 test items (intact, with DB rows)
|
||
|
|
- Hopper drained (all files consumed by dispatcher before crash)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Recovery (2026-04-14 ~20:14 UTC)
|
||
|
|
|
||
|
|
### Pre-recovery verification
|
||
|
|
|
||
|
|
| Check | Expected | Actual |
|
||
|
|
|-------|----------|--------|
|
||
|
|
| recon.service | inactive | inactive |
|
||
|
|
| processing/ dirs | 2,262 | 2,262 |
|
||
|
|
| acquired/stream/ | empty | empty |
|
||
|
|
| catalogue rows | 27,553 | 27,553 |
|
||
|
|
| documents rows | 27,553 | 27,553 |
|
||
|
|
|
||
|
|
DB backup: `/tmp/recon.db.phase5c2recovery.20260414.bak`
|
||
|
|
(md5: `99031c0994160e3a844375a9a635d616`)
|
||
|
|
|
||
|
|
### Section 1: Classify processing dirs
|
||
|
|
|
||
|
|
- **3 PRESERVE** — have catalogue rows (Phase 3/4 test items):
|
||
|
|
`172f39ae`, `380dbc78`, `93aad72f`
|
||
|
|
- **2,259 ORPHAN** — no catalogue or documents rows
|
||
|
|
|
||
|
|
### Section 2: Reconstitute hopper
|
||
|
|
|
||
|
|
Copied `transcript.txt` → `{hash}.txt` and `meta.json` → `{hash}.meta.json`
|
||
|
|
from each orphan processing dir back into `acquired/stream/`.
|
||
|
|
|
||
|
|
Result: 2,259 `.txt` + 2,259 `.meta.json`, all zvx-owned, 0 errors.
|
||
|
|
|
||
|
|
### Section 3: Delete orphan processing dirs
|
||
|
|
|
||
|
|
`sudo rm -rf` on all 2,259 orphan dirs (sudo required for root-owned files
|
||
|
|
on local disk). 3 preserved dirs confirmed intact.
|
||
|
|
|
||
|
|
### Section 4: Fix the bug
|
||
|
|
|
||
|
|
Replaced `shutil.rmtree(proc_dir, ignore_errors=True)` with explicit
|
||
|
|
try/except that logs and re-raises:
|
||
|
|
|
||
|
|
```python
|
||
|
|
if os.path.exists(proc_dir):
|
||
|
|
try:
|
||
|
|
shutil.rmtree(proc_dir)
|
||
|
|
except Exception as e:
|
||
|
|
logger.error("Stale cleanup failed for %s: %s", proc_dir, e)
|
||
|
|
raise
|
||
|
|
```
|
||
|
|
|
||
|
|
Applied to both `transcript_processor.py` and `pdf_processor.py`.
|
||
|
|
|
||
|
|
### Section 5: Commit
|
||
|
|
|
||
|
|
- **Commit:** `9fa60f9` on `refactor` branch
|
||
|
|
- **Message:** "Fix: stale cleanup in processors must fail loudly on permission errors"
|
||
|
|
- **Pushed to:** `forge.echo6.co/matt/recon` (origin/refactor)
|
||
|
|
|
||
|
|
### Section 6: Retry service start
|
||
|
|
|
||
|
|
Service started at **20:15:57 UTC**. All threads came up cleanly.
|
||
|
|
|
||
|
|
Dispatcher consumed all 2,259 hopper items on first cycle (~20:16). Items
|
||
|
|
immediately began flowing through the pipeline:
|
||
|
|
|
||
|
|
| Time (UTC) | extracted | enriching | enriched | complete |
|
||
|
|
|------------|-----------|-----------|----------|----------|
|
||
|
|
| 20:17:51 | 2,137 | 16 | 31 | 27,634 |
|
||
|
|
| 20:18:51 | 2,061 | 16 | 15 | 27,717 |
|
||
|
|
| 20:19:51 | 1,987 | 16 | 3 | 27,803 |
|
||
|
|
| 20:20:51 | 1,917 | 16 | 36 | 27,840 |
|
||
|
|
| 20:21:51 | 1,849 | 16 | 26 | 27,918 |
|
||
|
|
|
||
|
|
Throughput: ~57 items/min completing. Zero errors in 5-minute stability window.
|
||
|
|
|
||
|
|
ETA for full drain: ~32 minutes from 20:22 UTC (~20:54 UTC).
|
||
|
|
|
||
|
|
Background monitor running (PID 554723), writing to
|
||
|
|
`/tmp/phase5c2recovery_initial_log.txt`.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Files Modified
|
||
|
|
|
||
|
|
| File | Change |
|
||
|
|
|------|--------|
|
||
|
|
| `lib/processors/transcript_processor.py` | Stale cleanup: fail loudly |
|
||
|
|
| `lib/processors/pdf_processor.py` | Stale cleanup: fail loudly |
|