refactored-recon/phases/phase-6d-peertube-acquisition.md

2.6 KiB

Phase 6d: PeerTube Acquisition Module

Date: 2026-04-15 Commit: 277110d (refactor branch) Status: Complete

What Changed

Created lib/acquisition/peertube.py — a new module that polls PeerTube for video transcripts and writes them as flat file pairs into data/acquired/stream/ for the dispatcher to pick up. This replaces the peertube_scanner_loop removed in Phase 5c-1.

New File: lib/acquisition/peertube.py (~170 lines)

  • _build_known_sets(db) — queries catalogue for source='stream.echo6.co', builds UUID + title dedup sets
  • list_new_videos(db, config) — calls get_videos(), filters against known sets, checks captions with rate limiting
  • acquire_one(video, caption_path, config) — fetches VTT, converts to text, writes .tmp files, hashes, renames atomically
  • acquire_batch(db, config) — orchestrates list + acquire, returns {acquired, skipped, errors}
  • acquisition_loop(stop_event, db, config, interval) — service loop, polls every interval seconds

Edited: recon.py

  • cmd_service(): Added peertube-acq thread running acquisition_loop (interval from config, default 1800s)
  • cmd_ingest_peertube(): Replaced legacy ingest_channel/ingest_all with acquire_batch
  • Simplified argparse: removed --channel, --since, --enrich, --process; kept --stats

Edited: config.yaml

  • Added poll_interval: 1800 under peertube: section

Architecture

PeerTube API → list_new_videos (dedup) → acquire_one (fetch VTT, hash, write)
    → data/acquired/stream/{hash}.txt + {hash}.meta.json
    → dispatcher _find_pairs() → transcript_processor pre_flight()
    → enrich → embed → complete

Key Design Decisions

  1. No DB writes in acquisitionacquire_one only writes files. pre_flight() handles catalogue registration.
  2. Atomic writes.tmp suffix during writes, rename meta first then content. Dispatcher only sees complete pairs.
  3. Two dedup cohorts — UUID set (from URL paths) and title set (from filename column) cover both legacy and new catalogue entries.
  4. Rate limiting — 0.5s delay between caption API calls to avoid PeerTube 429s.

Verification

  • Import/compile: OK
  • Dry run: list_new_videos returns new videos not in catalogue
  • Real acquisition: hash a8893f3757295e347cb5b529cae350ff acquired and dispatched (returned 'duplicate' — already in catalogue from legacy ingest, confirming dedup works)
  • Service restart: 7 threads, peertube-acq in thread list, 0 errors in 90-second window
  • CLI: recon ingest-peertube --stats still works, recon ingest-peertube uses new path