refactored-recon/phases/phase-5c2-service-start.md

# Phase 5c-2: Service Start & Transcript Drain

## Objective

Start `recon.service` with the refactored dispatch loop and drain the 2,259
PeerTube transcripts (unprocessed in Phase 5b) through the full pipeline:
dispatch → extract → enrich → embed → complete.

## Initial Attempt (2026-04-14 ~18:42 UTC)

### What happened

The service started cleanly — all threads (extract, enrich, embed, dispatcher,
filing, progress, dashboard) came up. The dispatcher found 2,259 `.txt` +
`.meta.json` pairs in `acquired/stream/` and began dispatching them to
`transcript_processor.pre_flight()`.

**All 2,259 items crashed** with:
```
Permission denied: '/opt/recon/data/processing/{hash}/meta.json'
```

### Root cause

The Phase 5b staging script (`stage_hopper.py`) used `shutil.copy2()` to copy
`meta.json` sidecars from `/mnt/library/_sources/streamecho6/` into the hopper.
The source files were root-owned, and `copy2` preserves permissions — so the
hopper copies were also root-owned.

When `transcript_processor.pre_flight()` ran:
1. It hashed the content file (worked — `.txt` was world-readable)
2. It ran stale cleanup: `shutil.rmtree(proc_dir, ignore_errors=True)`
3. The `ignore_errors=True` **silently failed** to remove root-owned files
   inside processing dirs that already existed from a previous (legacy)
   PeerTube scraper run
4. It tried to read the (now root-owned) `meta.json` inside the half-cleaned
   processing dir → **Permission denied**

The `ignore_errors=True` was the critical bug — it turned a loud, diagnosable
failure into a silent state corruption that cascaded into a different error
2 steps later.

### Resulting state

- 2,259 "dirty" processing dirs with root-owned `meta.json` + `transcript.txt`
  and zvx-owned `page_*.txt` (from the partially-completed page split)
- Zero DB rows for any of the 2,259 items (pre_flight crashed before
  `add_to_catalogue` / `queue_document`)
- 3 "clean" processing dirs from Phase 3/4 test items (intact, with DB rows)
- Hopper drained (all files consumed by dispatcher before crash)

---

## Recovery (2026-04-14 ~20:14 UTC)

### Pre-recovery verification

| Check | Expected | Actual |
|-------|----------|--------|
| recon.service | inactive | inactive |
| processing/ dirs | 2,262 | 2,262 |
| acquired/stream/ | empty | empty |
| catalogue rows | 27,553 | 27,553 |
| documents rows | 27,553 | 27,553 |

DB backup: `/tmp/recon.db.phase5c2recovery.20260414.bak`
(md5: `99031c0994160e3a844375a9a635d616`)

### Section 1: Classify processing dirs

- **3 PRESERVE** — have catalogue rows (Phase 3/4 test items):
  `172f39ae`, `380dbc78`, `93aad72f`
- **2,259 ORPHAN** — no catalogue or documents rows

### Section 2: Reconstitute hopper

Copied `transcript.txt` → `{hash}.txt` and `meta.json` → `{hash}.meta.json`
from each orphan processing dir back into `acquired/stream/`.

Result: 2,259 `.txt` + 2,259 `.meta.json`, all zvx-owned, 0 errors.

### Section 3: Delete orphan processing dirs

`sudo rm -rf` on all 2,259 orphan dirs (sudo required for root-owned files
on local disk). 3 preserved dirs confirmed intact.

### Section 4: Fix the bug

Replaced `shutil.rmtree(proc_dir, ignore_errors=True)` with explicit
try/except that logs and re-raises:

```python
if os.path.exists(proc_dir):
    try:
        shutil.rmtree(proc_dir)
    except Exception as e:
        logger.error("Stale cleanup failed for %s: %s", proc_dir, e)
        raise
```

Applied to both `transcript_processor.py` and `pdf_processor.py`.

### Section 5: Commit

- **Commit:** `9fa60f9` on `refactor` branch
- **Message:** "Fix: stale cleanup in processors must fail loudly on permission errors"
- **Pushed to:** `forge.echo6.co/matt/recon` (origin/refactor)

### Section 6: Retry service start

Service started at **20:15:57 UTC**. All threads came up cleanly.

Dispatcher consumed all 2,259 hopper items on first cycle (~20:16). Items
immediately began flowing through the pipeline:

| Time (UTC) | extracted | enriching | enriched | complete |
|------------|-----------|-----------|----------|----------|
| 20:17:51 | 2,137 | 16 | 31 | 27,634 |
| 20:18:51 | 2,061 | 16 | 15 | 27,717 |
| 20:19:51 | 1,987 | 16 | 3 | 27,803 |
| 20:20:51 | 1,917 | 16 | 36 | 27,840 |
| 20:21:51 | 1,849 | 16 | 26 | 27,918 |

Throughput: ~57 items/min completing. Zero errors in 5-minute stability window.

ETA for full drain: ~32 minutes from 20:22 UTC (~20:54 UTC).

Background monitor running (PID 554723), writing to
`/tmp/phase5c2recovery_initial_log.txt`.

---

## Files Modified

| File | Change |
|------|--------|
| `lib/processors/transcript_processor.py` | Stale cleanup: fail loudly |
| `lib/processors/pdf_processor.py` | Stale cleanup: fail loudly |
Add Phase 5c-2 doc: service start, failure analysis, and recovery Documents the initial 5c-2 failure (ignore_errors=True + root-owned legacy files), the recovery procedure (hopper reconstitution, orphan cleanup, processor fix), and the successful retry with pipeline drain in progress. 2026-04-14 20:23:37 +00:00			`# Phase 5c-2: Service Start & Transcript Drain`

			`## Objective`

			Start `recon.service` with the refactored dispatch loop and drain the 2,259
			`PeerTube transcripts (unprocessed in Phase 5b) through the full pipeline:`
			`dispatch → extract → enrich → embed → complete.`

			`## Initial Attempt (2026-04-14 ~18:42 UTC)`

			`### What happened`

			`The service started cleanly — all threads (extract, enrich, embed, dispatcher,`
			filing, progress, dashboard) came up. The dispatcher found 2,259 `.txt` +
			`.meta.json` pairs in `acquired/stream/` and began dispatching them to
			`transcript_processor.pre_flight()`.

			`All 2,259 items crashed with:`
			```
			`Permission denied: '/opt/recon/data/processing/{hash}/meta.json'`
			```

			`### Root cause`

			The Phase 5b staging script (`stage_hopper.py`) used `shutil.copy2()` to copy
			`meta.json` sidecars from `/mnt/library/_sources/streamecho6/` into the hopper.
			The source files were root-owned, and `copy2` preserves permissions — so the
			`hopper copies were also root-owned.`

			When `transcript_processor.pre_flight()` ran:
			1. It hashed the content file (worked — `.txt` was world-readable)
			2. It ran stale cleanup: `shutil.rmtree(proc_dir, ignore_errors=True)`
			3. The `ignore_errors=True` silently failed to remove root-owned files
			`inside processing dirs that already existed from a previous (legacy)`
			`PeerTube scraper run`
			4. It tried to read the (now root-owned) `meta.json` inside the half-cleaned
			`processing dir → Permission denied`

			The `ignore_errors=True` was the critical bug — it turned a loud, diagnosable
			`failure into a silent state corruption that cascaded into a different error`
			`2 steps later.`

			`### Resulting state`

			- 2,259 "dirty" processing dirs with root-owned `meta.json` + `transcript.txt`
			and zvx-owned `page_*.txt` (from the partially-completed page split)
			`- Zero DB rows for any of the 2,259 items (pre_flight crashed before`
			`add_to_catalogue` / `queue_document`)
			`- 3 "clean" processing dirs from Phase 3/4 test items (intact, with DB rows)`
			`- Hopper drained (all files consumed by dispatcher before crash)`

			`---`

			`## Recovery (2026-04-14 ~20:14 UTC)`

			`### Pre-recovery verification`

			`\| Check \| Expected \| Actual \|`
			`\|-------\|----------\|--------\|`
			`\| recon.service \| inactive \| inactive \|`
			`\| processing/ dirs \| 2,262 \| 2,262 \|`
			`\| acquired/stream/ \| empty \| empty \|`
			`\| catalogue rows \| 27,553 \| 27,553 \|`
			`\| documents rows \| 27,553 \| 27,553 \|`

			DB backup: `/tmp/recon.db.phase5c2recovery.20260414.bak`
			(md5: `99031c0994160e3a844375a9a635d616`)

			`### Section 1: Classify processing dirs`

			`- 3 PRESERVE — have catalogue rows (Phase 3/4 test items):`
			`172f39ae`, `380dbc78`, `93aad72f`
			`- 2,259 ORPHAN — no catalogue or documents rows`

			`### Section 2: Reconstitute hopper`

			Copied `transcript.txt` → `{hash}.txt` and `meta.json` → `{hash}.meta.json`
			from each orphan processing dir back into `acquired/stream/`.

			Result: 2,259 `.txt` + 2,259 `.meta.json`, all zvx-owned, 0 errors.

			`### Section 3: Delete orphan processing dirs`

			`sudo rm -rf` on all 2,259 orphan dirs (sudo required for root-owned files
			`on local disk). 3 preserved dirs confirmed intact.`

			`### Section 4: Fix the bug`

			Replaced `shutil.rmtree(proc_dir, ignore_errors=True)` with explicit
			`try/except that logs and re-raises:`

			```python
			`if os.path.exists(proc_dir):`
			`try:`
			`shutil.rmtree(proc_dir)`
			`except Exception as e:`
			`logger.error("Stale cleanup failed for %s: %s", proc_dir, e)`
			`raise`
			```

			Applied to both `transcript_processor.py` and `pdf_processor.py`.

			`### Section 5: Commit`

			- Commit: `9fa60f9` on `refactor` branch
			`- Message: "Fix: stale cleanup in processors must fail loudly on permission errors"`
			- Pushed to: `forge.echo6.co/matt/recon` (origin/refactor)

			`### Section 6: Retry service start`

			`Service started at 20:15:57 UTC. All threads came up cleanly.`

			`Dispatcher consumed all 2,259 hopper items on first cycle (~20:16). Items`
			`immediately began flowing through the pipeline:`

			`\| Time (UTC) \| extracted \| enriching \| enriched \| complete \|`
			`\|------------\|-----------\|-----------\|----------\|----------\|`
			`\| 20:17:51 \| 2,137 \| 16 \| 31 \| 27,634 \|`
			`\| 20:18:51 \| 2,061 \| 16 \| 15 \| 27,717 \|`
			`\| 20:19:51 \| 1,987 \| 16 \| 3 \| 27,803 \|`
			`\| 20:20:51 \| 1,917 \| 16 \| 36 \| 27,840 \|`
			`\| 20:21:51 \| 1,849 \| 16 \| 26 \| 27,918 \|`

			`Throughput: ~57 items/min completing. Zero errors in 5-minute stability window.`

			`ETA for full drain: ~32 minutes from 20:22 UTC (~20:54 UTC).`

			`Background monitor running (PID 554723), writing to`
			`/tmp/phase5c2recovery_initial_log.txt`.

			`---`

			`## Files Modified`

			`\| File \| Change \|`
			`\|------\|--------\|`
			\| `lib/processors/transcript_processor.py` \| Stale cleanup: fail loudly \|
			\| `lib/processors/pdf_processor.py` \| Stale cleanup: fail loudly \|