# Phase 6d: PeerTube Acquisition Module **Date:** 2026-04-15 **Commit:** 277110d (refactor branch) **Status:** Complete ## What Changed Created `lib/acquisition/peertube.py` — a new module that polls PeerTube for video transcripts and writes them as flat file pairs into `data/acquired/stream/` for the dispatcher to pick up. This replaces the `peertube_scanner_loop` removed in Phase 5c-1. ### New File: `lib/acquisition/peertube.py` (~170 lines) - `_build_known_sets(db)` — queries catalogue for `source='stream.echo6.co'`, builds UUID + title dedup sets - `list_new_videos(db, config)` — calls `get_videos()`, filters against known sets, checks captions with rate limiting - `acquire_one(video, caption_path, config)` — fetches VTT, converts to text, writes `.tmp` files, hashes, renames atomically - `acquire_batch(db, config)` — orchestrates list + acquire, returns `{acquired, skipped, errors}` - `acquisition_loop(stop_event, db, config, interval)` — service loop, polls every `interval` seconds ### Edited: `recon.py` - `cmd_service()`: Added `peertube-acq` thread running `acquisition_loop` (interval from config, default 1800s) - `cmd_ingest_peertube()`: Replaced legacy `ingest_channel`/`ingest_all` with `acquire_batch` - Simplified argparse: removed `--channel`, `--since`, `--enrich`, `--process`; kept `--stats` ### Edited: `config.yaml` - Added `poll_interval: 1800` under `peertube:` section ## Architecture ``` PeerTube API → list_new_videos (dedup) → acquire_one (fetch VTT, hash, write) → data/acquired/stream/{hash}.txt + {hash}.meta.json → dispatcher _find_pairs() → transcript_processor pre_flight() → enrich → embed → complete ``` ## Key Design Decisions 1. **No DB writes in acquisition** — `acquire_one` only writes files. `pre_flight()` handles catalogue registration. 2. **Atomic writes** — `.tmp` suffix during writes, rename meta first then content. Dispatcher only sees complete pairs. 3. **Two dedup cohorts** — UUID set (from URL paths) and title set (from filename column) cover both legacy and new catalogue entries. 4. **Rate limiting** — 0.5s delay between caption API calls to avoid PeerTube 429s. ## Verification - Import/compile: OK - Dry run: `list_new_videos` returns new videos not in catalogue - Real acquisition: hash `a8893f3757295e347cb5b529cae350ff` acquired and dispatched (returned 'duplicate' — already in catalogue from legacy ingest, confirming dedup works) - Service restart: 7 threads, `peertube-acq` in thread list, 0 errors in 90-second window - CLI: `recon ingest-peertube --stats` still works, `recon ingest-peertube` uses new path