checkpoint: pre-audit working tree state — 4 untracked design docs

This commit is contained in:
Matt 2026-04-27 02:08:28 +00:00
commit 3b5c24c7e7
4 changed files with 1181 additions and 0 deletions

View file

@ -0,0 +1,54 @@
# Phase 6d: PeerTube Acquisition Module
**Date:** 2026-04-15
**Commit:** 277110d (refactor branch)
**Status:** Complete
## What Changed
Created `lib/acquisition/peertube.py` — a new module that polls PeerTube for
video transcripts and writes them as flat file pairs into `data/acquired/stream/`
for the dispatcher to pick up. This replaces the `peertube_scanner_loop` removed
in Phase 5c-1.
### New File: `lib/acquisition/peertube.py` (~170 lines)
- `_build_known_sets(db)` — queries catalogue for `source='stream.echo6.co'`, builds UUID + title dedup sets
- `list_new_videos(db, config)` — calls `get_videos()`, filters against known sets, checks captions with rate limiting
- `acquire_one(video, caption_path, config)` — fetches VTT, converts to text, writes `.tmp` files, hashes, renames atomically
- `acquire_batch(db, config)` — orchestrates list + acquire, returns `{acquired, skipped, errors}`
- `acquisition_loop(stop_event, db, config, interval)` — service loop, polls every `interval` seconds
### Edited: `recon.py`
- `cmd_service()`: Added `peertube-acq` thread running `acquisition_loop` (interval from config, default 1800s)
- `cmd_ingest_peertube()`: Replaced legacy `ingest_channel`/`ingest_all` with `acquire_batch`
- Simplified argparse: removed `--channel`, `--since`, `--enrich`, `--process`; kept `--stats`
### Edited: `config.yaml`
- Added `poll_interval: 1800` under `peertube:` section
## Architecture
```
PeerTube API → list_new_videos (dedup) → acquire_one (fetch VTT, hash, write)
→ data/acquired/stream/{hash}.txt + {hash}.meta.json
→ dispatcher _find_pairs() → transcript_processor pre_flight()
→ enrich → embed → complete
```
## Key Design Decisions
1. **No DB writes in acquisition**`acquire_one` only writes files. `pre_flight()` handles catalogue registration.
2. **Atomic writes**`.tmp` suffix during writes, rename meta first then content. Dispatcher only sees complete pairs.
3. **Two dedup cohorts** — UUID set (from URL paths) and title set (from filename column) cover both legacy and new catalogue entries.
4. **Rate limiting** — 0.5s delay between caption API calls to avoid PeerTube 429s.
## Verification
- Import/compile: OK
- Dry run: `list_new_videos` returns new videos not in catalogue
- Real acquisition: hash `a8893f3757295e347cb5b529cae350ff` acquired and dispatched (returned 'duplicate' — already in catalogue from legacy ingest, confirming dedup works)
- Service restart: 7 threads, `peertube-acq` in thread list, 0 errors in 90-second window
- CLI: `recon ingest-peertube --stats` still works, `recon ingest-peertube` uses new path