refactored-recon/phases/phase-5c2-service-start.md
Matt 1d4106643c Add Phase 5c-2 doc: service start, failure analysis, and recovery
Documents the initial 5c-2 failure (ignore_errors=True + root-owned
legacy files), the recovery procedure (hopper reconstitution, orphan
cleanup, processor fix), and the successful retry with pipeline drain
in progress.
2026-04-14 20:23:37 +00:00

4.6 KiB

Phase 5c-2: Service Start & Transcript Drain

Objective

Start recon.service with the refactored dispatch loop and drain the 2,259 PeerTube transcripts (unprocessed in Phase 5b) through the full pipeline: dispatch → extract → enrich → embed → complete.

Initial Attempt (2026-04-14 ~18:42 UTC)

What happened

The service started cleanly — all threads (extract, enrich, embed, dispatcher, filing, progress, dashboard) came up. The dispatcher found 2,259 .txt + .meta.json pairs in acquired/stream/ and began dispatching them to transcript_processor.pre_flight().

All 2,259 items crashed with:

Permission denied: '/opt/recon/data/processing/{hash}/meta.json'

Root cause

The Phase 5b staging script (stage_hopper.py) used shutil.copy2() to copy meta.json sidecars from /mnt/library/_sources/streamecho6/ into the hopper. The source files were root-owned, and copy2 preserves permissions — so the hopper copies were also root-owned.

When transcript_processor.pre_flight() ran:

  1. It hashed the content file (worked — .txt was world-readable)
  2. It ran stale cleanup: shutil.rmtree(proc_dir, ignore_errors=True)
  3. The ignore_errors=True silently failed to remove root-owned files inside processing dirs that already existed from a previous (legacy) PeerTube scraper run
  4. It tried to read the (now root-owned) meta.json inside the half-cleaned processing dir → Permission denied

The ignore_errors=True was the critical bug — it turned a loud, diagnosable failure into a silent state corruption that cascaded into a different error 2 steps later.

Resulting state

  • 2,259 "dirty" processing dirs with root-owned meta.json + transcript.txt and zvx-owned page_*.txt (from the partially-completed page split)
  • Zero DB rows for any of the 2,259 items (pre_flight crashed before add_to_catalogue / queue_document)
  • 3 "clean" processing dirs from Phase 3/4 test items (intact, with DB rows)
  • Hopper drained (all files consumed by dispatcher before crash)

Recovery (2026-04-14 ~20:14 UTC)

Pre-recovery verification

Check Expected Actual
recon.service inactive inactive
processing/ dirs 2,262 2,262
acquired/stream/ empty empty
catalogue rows 27,553 27,553
documents rows 27,553 27,553

DB backup: /tmp/recon.db.phase5c2recovery.20260414.bak (md5: 99031c0994160e3a844375a9a635d616)

Section 1: Classify processing dirs

  • 3 PRESERVE — have catalogue rows (Phase 3/4 test items): 172f39ae, 380dbc78, 93aad72f
  • 2,259 ORPHAN — no catalogue or documents rows

Section 2: Reconstitute hopper

Copied transcript.txt{hash}.txt and meta.json{hash}.meta.json from each orphan processing dir back into acquired/stream/.

Result: 2,259 .txt + 2,259 .meta.json, all zvx-owned, 0 errors.

Section 3: Delete orphan processing dirs

sudo rm -rf on all 2,259 orphan dirs (sudo required for root-owned files on local disk). 3 preserved dirs confirmed intact.

Section 4: Fix the bug

Replaced shutil.rmtree(proc_dir, ignore_errors=True) with explicit try/except that logs and re-raises:

if os.path.exists(proc_dir):
    try:
        shutil.rmtree(proc_dir)
    except Exception as e:
        logger.error("Stale cleanup failed for %s: %s", proc_dir, e)
        raise

Applied to both transcript_processor.py and pdf_processor.py.

Section 5: Commit

  • Commit: 9fa60f9 on refactor branch
  • Message: "Fix: stale cleanup in processors must fail loudly on permission errors"
  • Pushed to: forge.echo6.co/matt/recon (origin/refactor)

Section 6: Retry service start

Service started at 20:15:57 UTC. All threads came up cleanly.

Dispatcher consumed all 2,259 hopper items on first cycle (~20:16). Items immediately began flowing through the pipeline:

Time (UTC) extracted enriching enriched complete
20:17:51 2,137 16 31 27,634
20:18:51 2,061 16 15 27,717
20:19:51 1,987 16 3 27,803
20:20:51 1,917 16 36 27,840
20:21:51 1,849 16 26 27,918

Throughput: ~57 items/min completing. Zero errors in 5-minute stability window.

ETA for full drain: ~32 minutes from 20:22 UTC (~20:54 UTC).

Background monitor running (PID 554723), writing to /tmp/phase5c2recovery_initial_log.txt.


Files Modified

File Change
lib/processors/transcript_processor.py Stale cleanup: fail loudly
lib/processors/pdf_processor.py Stale cleanup: fail loudly