refactored-recon/phases/phase-6d-peertube-acquisition.md

# Phase 6d: PeerTube Acquisition Module

**Date:** 2026-04-15
**Commit:** 277110d (refactor branch)
**Status:** Complete

## What Changed

Created `lib/acquisition/peertube.py` — a new module that polls PeerTube for
video transcripts and writes them as flat file pairs into `data/acquired/stream/`
for the dispatcher to pick up. This replaces the `peertube_scanner_loop` removed
in Phase 5c-1.

### New File: `lib/acquisition/peertube.py` (~170 lines)

- `_build_known_sets(db)` — queries catalogue for `source='stream.echo6.co'`, builds UUID + title dedup sets
- `list_new_videos(db, config)` — calls `get_videos()`, filters against known sets, checks captions with rate limiting
- `acquire_one(video, caption_path, config)` — fetches VTT, converts to text, writes `.tmp` files, hashes, renames atomically
- `acquire_batch(db, config)` — orchestrates list + acquire, returns `{acquired, skipped, errors}`
- `acquisition_loop(stop_event, db, config, interval)` — service loop, polls every `interval` seconds

### Edited: `recon.py`

- `cmd_service()`: Added `peertube-acq` thread running `acquisition_loop` (interval from config, default 1800s)
- `cmd_ingest_peertube()`: Replaced legacy `ingest_channel`/`ingest_all` with `acquire_batch`
- Simplified argparse: removed `--channel`, `--since`, `--enrich`, `--process`; kept `--stats`

### Edited: `config.yaml`

- Added `poll_interval: 1800` under `peertube:` section

## Architecture

```
PeerTube API → list_new_videos (dedup) → acquire_one (fetch VTT, hash, write)
    → data/acquired/stream/{hash}.txt + {hash}.meta.json
    → dispatcher _find_pairs() → transcript_processor pre_flight()
    → enrich → embed → complete
```

## Key Design Decisions

1. **No DB writes in acquisition** — `acquire_one` only writes files. `pre_flight()` handles catalogue registration.
2. **Atomic writes** — `.tmp` suffix during writes, rename meta first then content. Dispatcher only sees complete pairs.
3. **Two dedup cohorts** — UUID set (from URL paths) and title set (from filename column) cover both legacy and new catalogue entries.
4. **Rate limiting** — 0.5s delay between caption API calls to avoid PeerTube 429s.

## Verification

- Import/compile: OK
- Dry run: `list_new_videos` returns new videos not in catalogue
- Real acquisition: hash `a8893f3757295e347cb5b529cae350ff` acquired and dispatched (returned 'duplicate' — already in catalogue from legacy ingest, confirming dedup works)
- Service restart: 7 threads, `peertube-acq` in thread list, 0 errors in 90-second window
- CLI: `recon ingest-peertube --stats` still works, `recon ingest-peertube` uses new path
checkpoint: pre-audit working tree state — 4 untracked design docs 2026-04-27 02:08:28 +00:00			`# Phase 6d: PeerTube Acquisition Module`

			`Date: 2026-04-15`
			`Commit: 277110d (refactor branch)`
			`Status: Complete`

			`## What Changed`

			Created `lib/acquisition/peertube.py` — a new module that polls PeerTube for
			video transcripts and writes them as flat file pairs into `data/acquired/stream/`
			for the dispatcher to pick up. This replaces the `peertube_scanner_loop` removed
			`in Phase 5c-1.`

			### New File: `lib/acquisition/peertube.py` (~170 lines)

			- `_build_known_sets(db)` — queries catalogue for `source='stream.echo6.co'`, builds UUID + title dedup sets
			- `list_new_videos(db, config)` — calls `get_videos()`, filters against known sets, checks captions with rate limiting
			- `acquire_one(video, caption_path, config)` — fetches VTT, converts to text, writes `.tmp` files, hashes, renames atomically
			- `acquire_batch(db, config)` — orchestrates list + acquire, returns `{acquired, skipped, errors}`
			- `acquisition_loop(stop_event, db, config, interval)` — service loop, polls every `interval` seconds

			### Edited: `recon.py`

			- `cmd_service()`: Added `peertube-acq` thread running `acquisition_loop` (interval from config, default 1800s)
			- `cmd_ingest_peertube()`: Replaced legacy `ingest_channel`/`ingest_all` with `acquire_batch`
			- Simplified argparse: removed `--channel`, `--since`, `--enrich`, `--process`; kept `--stats`

			### Edited: `config.yaml`

			- Added `poll_interval: 1800` under `peertube:` section

			`## Architecture`

			```
			`PeerTube API → list_new_videos (dedup) → acquire_one (fetch VTT, hash, write)`
			`→ data/acquired/stream/{hash}.txt + {hash}.meta.json`
			`→ dispatcher _find_pairs() → transcript_processor pre_flight()`
			`→ enrich → embed → complete`
			```

			`## Key Design Decisions`

			1. No DB writes in acquisition — `acquire_one` only writes files. `pre_flight()` handles catalogue registration.
			2. Atomic writes — `.tmp` suffix during writes, rename meta first then content. Dispatcher only sees complete pairs.
			`3. Two dedup cohorts — UUID set (from URL paths) and title set (from filename column) cover both legacy and new catalogue entries.`
			`4. Rate limiting — 0.5s delay between caption API calls to avoid PeerTube 429s.`

			`## Verification`

			`- Import/compile: OK`
			- Dry run: `list_new_videos` returns new videos not in catalogue
			- Real acquisition: hash `a8893f3757295e347cb5b529cae350ff` acquired and dispatched (returned 'duplicate' — already in catalogue from legacy ingest, confirming dedup works)
			- Service restart: 7 threads, `peertube-acq` in thread list, 0 errors in 90-second window
			- CLI: `recon ingest-peertube --stats` still works, `recon ingest-peertube` uses new path