mirror of https://github.com/zvx-echo6/refactored-recon.git synced 2026-05-20 06:34:34 +02:00

Matt 1d4106643c Add Phase 5c-2 doc: service start, failure analysis, and recovery

Documents the initial 5c-2 failure (ignore_errors=True + root-owned
legacy files), the recovery procedure (hopper reconstitution, orphan
cleanup, processor fix), and the successful retry with pipeline drain
in progress.

2026-04-14 20:23:37 +00:00

4.6 KiB

Raw Blame History

Phase 5c-2: Service Start & Transcript Drain

Objective

Start recon.service with the refactored dispatch loop and drain the 2,259 PeerTube transcripts (unprocessed in Phase 5b) through the full pipeline: dispatch → extract → enrich → embed → complete.

Initial Attempt (2026-04-14 ~18:42 UTC)

What happened

The service started cleanly — all threads (extract, enrich, embed, dispatcher, filing, progress, dashboard) came up. The dispatcher found 2,259 .txt + .meta.json pairs in acquired/stream/ and began dispatching them to transcript_processor.pre_flight().

All 2,259 items crashed with:

Permission denied: '/opt/recon/data/processing/{hash}/meta.json'

Root cause

The Phase 5b staging script (stage_hopper.py) used shutil.copy2() to copy meta.json sidecars from /mnt/library/_sources/streamecho6/ into the hopper. The source files were root-owned, and copy2 preserves permissions — so the hopper copies were also root-owned.

When transcript_processor.pre_flight() ran:

It hashed the content file (worked — .txt was world-readable)
It ran stale cleanup: shutil.rmtree(proc_dir, ignore_errors=True)
The ignore_errors=True silently failed to remove root-owned files inside processing dirs that already existed from a previous (legacy) PeerTube scraper run
It tried to read the (now root-owned) meta.json inside the half-cleaned processing dir → Permission denied

The ignore_errors=True was the critical bug — it turned a loud, diagnosable failure into a silent state corruption that cascaded into a different error 2 steps later.

Resulting state

2,259 "dirty" processing dirs with root-owned meta.json + transcript.txt and zvx-owned page_*.txt (from the partially-completed page split)
Zero DB rows for any of the 2,259 items (pre_flight crashed before add_to_catalogue / queue_document)
3 "clean" processing dirs from Phase 3/4 test items (intact, with DB rows)
Hopper drained (all files consumed by dispatcher before crash)

Recovery (2026-04-14 ~20:14 UTC)

Pre-recovery verification

Check	Expected	Actual
recon.service	inactive	inactive
processing/ dirs	2,262	2,262
acquired/stream/	empty	empty
catalogue rows	27,553	27,553
documents rows	27,553	27,553

DB backup: /tmp/recon.db.phase5c2recovery.20260414.bak (md5: 99031c0994160e3a844375a9a635d616)

Section 1: Classify processing dirs

3 PRESERVE — have catalogue rows (Phase 3/4 test items): 172f39ae, 380dbc78, 93aad72f
2,259 ORPHAN — no catalogue or documents rows

Section 2: Reconstitute hopper

Copied transcript.txt → {hash}.txt and meta.json → {hash}.meta.json from each orphan processing dir back into acquired/stream/.

Result: 2,259 .txt + 2,259 .meta.json, all zvx-owned, 0 errors.

Section 3: Delete orphan processing dirs

sudo rm -rf on all 2,259 orphan dirs (sudo required for root-owned files on local disk). 3 preserved dirs confirmed intact.

Section 4: Fix the bug

Replaced shutil.rmtree(proc_dir, ignore_errors=True) with explicit try/except that logs and re-raises:

if os.path.exists(proc_dir):
    try:
        shutil.rmtree(proc_dir)
    except Exception as e:
        logger.error("Stale cleanup failed for %s: %s", proc_dir, e)
        raise

Applied to both transcript_processor.py and pdf_processor.py.

Section 5: Commit

Commit: 9fa60f9 on refactor branch
Message: "Fix: stale cleanup in processors must fail loudly on permission errors"
Pushed to: forge.echo6.co/matt/recon (origin/refactor)

Section 6: Retry service start

Service started at 20:15:57 UTC. All threads came up cleanly.

Dispatcher consumed all 2,259 hopper items on first cycle (~20:16). Items immediately began flowing through the pipeline:

Time (UTC)	extracted	enriching	enriched	complete
20:17:51	2,137	16	31	27,634
20:18:51	2,061	16	15	27,717
20:19:51	1,987	16	3	27,803
20:20:51	1,917	16	36	27,840
20:21:51	1,849	16	26	27,918

Throughput: ~57 items/min completing. Zero errors in 5-minute stability window.

ETA for full drain: ~32 minutes from 20:22 UTC (~20:54 UTC).

Background monitor running (PID 554723), writing to /tmp/phase5c2recovery_initial_log.txt.

Files Modified

File	Change
`lib/processors/transcript_processor.py`	Stale cleanup: fail loudly
`lib/processors/pdf_processor.py`	Stale cleanup: fail loudly

4.6 KiB Raw Blame History