mirror of
https://github.com/zvx-echo6/refactored-recon.git
synced 2026-05-20 14:44:39 +02:00
54 lines
2.6 KiB
Markdown
54 lines
2.6 KiB
Markdown
|
|
# Phase 6d: PeerTube Acquisition Module
|
||
|
|
|
||
|
|
**Date:** 2026-04-15
|
||
|
|
**Commit:** 277110d (refactor branch)
|
||
|
|
**Status:** Complete
|
||
|
|
|
||
|
|
## What Changed
|
||
|
|
|
||
|
|
Created `lib/acquisition/peertube.py` — a new module that polls PeerTube for
|
||
|
|
video transcripts and writes them as flat file pairs into `data/acquired/stream/`
|
||
|
|
for the dispatcher to pick up. This replaces the `peertube_scanner_loop` removed
|
||
|
|
in Phase 5c-1.
|
||
|
|
|
||
|
|
### New File: `lib/acquisition/peertube.py` (~170 lines)
|
||
|
|
|
||
|
|
- `_build_known_sets(db)` — queries catalogue for `source='stream.echo6.co'`, builds UUID + title dedup sets
|
||
|
|
- `list_new_videos(db, config)` — calls `get_videos()`, filters against known sets, checks captions with rate limiting
|
||
|
|
- `acquire_one(video, caption_path, config)` — fetches VTT, converts to text, writes `.tmp` files, hashes, renames atomically
|
||
|
|
- `acquire_batch(db, config)` — orchestrates list + acquire, returns `{acquired, skipped, errors}`
|
||
|
|
- `acquisition_loop(stop_event, db, config, interval)` — service loop, polls every `interval` seconds
|
||
|
|
|
||
|
|
### Edited: `recon.py`
|
||
|
|
|
||
|
|
- `cmd_service()`: Added `peertube-acq` thread running `acquisition_loop` (interval from config, default 1800s)
|
||
|
|
- `cmd_ingest_peertube()`: Replaced legacy `ingest_channel`/`ingest_all` with `acquire_batch`
|
||
|
|
- Simplified argparse: removed `--channel`, `--since`, `--enrich`, `--process`; kept `--stats`
|
||
|
|
|
||
|
|
### Edited: `config.yaml`
|
||
|
|
|
||
|
|
- Added `poll_interval: 1800` under `peertube:` section
|
||
|
|
|
||
|
|
## Architecture
|
||
|
|
|
||
|
|
```
|
||
|
|
PeerTube API → list_new_videos (dedup) → acquire_one (fetch VTT, hash, write)
|
||
|
|
→ data/acquired/stream/{hash}.txt + {hash}.meta.json
|
||
|
|
→ dispatcher _find_pairs() → transcript_processor pre_flight()
|
||
|
|
→ enrich → embed → complete
|
||
|
|
```
|
||
|
|
|
||
|
|
## Key Design Decisions
|
||
|
|
|
||
|
|
1. **No DB writes in acquisition** — `acquire_one` only writes files. `pre_flight()` handles catalogue registration.
|
||
|
|
2. **Atomic writes** — `.tmp` suffix during writes, rename meta first then content. Dispatcher only sees complete pairs.
|
||
|
|
3. **Two dedup cohorts** — UUID set (from URL paths) and title set (from filename column) cover both legacy and new catalogue entries.
|
||
|
|
4. **Rate limiting** — 0.5s delay between caption API calls to avoid PeerTube 429s.
|
||
|
|
|
||
|
|
## Verification
|
||
|
|
|
||
|
|
- Import/compile: OK
|
||
|
|
- Dry run: `list_new_videos` returns new videos not in catalogue
|
||
|
|
- Real acquisition: hash `a8893f3757295e347cb5b529cae350ff` acquired and dispatched (returned 'duplicate' — already in catalogue from legacy ingest, confirming dedup works)
|
||
|
|
- Service restart: 7 threads, `peertube-acq` in thread list, 0 errors in 90-second window
|
||
|
|
- CLI: `recon ingest-peertube --stats` still works, `recon ingest-peertube` uses new path
|