mirror of
https://github.com/zvx-echo6/refactored-recon.git
synced 2026-05-20 14:44:39 +02:00
2.6 KiB
2.6 KiB
Phase 6d: PeerTube Acquisition Module
Date: 2026-04-15 Commit: 277110d (refactor branch) Status: Complete
What Changed
Created lib/acquisition/peertube.py — a new module that polls PeerTube for
video transcripts and writes them as flat file pairs into data/acquired/stream/
for the dispatcher to pick up. This replaces the peertube_scanner_loop removed
in Phase 5c-1.
New File: lib/acquisition/peertube.py (~170 lines)
_build_known_sets(db)— queries catalogue forsource='stream.echo6.co', builds UUID + title dedup setslist_new_videos(db, config)— callsget_videos(), filters against known sets, checks captions with rate limitingacquire_one(video, caption_path, config)— fetches VTT, converts to text, writes.tmpfiles, hashes, renames atomicallyacquire_batch(db, config)— orchestrates list + acquire, returns{acquired, skipped, errors}acquisition_loop(stop_event, db, config, interval)— service loop, polls everyintervalseconds
Edited: recon.py
cmd_service(): Addedpeertube-acqthread runningacquisition_loop(interval from config, default 1800s)cmd_ingest_peertube(): Replaced legacyingest_channel/ingest_allwithacquire_batch- Simplified argparse: removed
--channel,--since,--enrich,--process; kept--stats
Edited: config.yaml
- Added
poll_interval: 1800underpeertube:section
Architecture
PeerTube API → list_new_videos (dedup) → acquire_one (fetch VTT, hash, write)
→ data/acquired/stream/{hash}.txt + {hash}.meta.json
→ dispatcher _find_pairs() → transcript_processor pre_flight()
→ enrich → embed → complete
Key Design Decisions
- No DB writes in acquisition —
acquire_oneonly writes files.pre_flight()handles catalogue registration. - Atomic writes —
.tmpsuffix during writes, rename meta first then content. Dispatcher only sees complete pairs. - Two dedup cohorts — UUID set (from URL paths) and title set (from filename column) cover both legacy and new catalogue entries.
- Rate limiting — 0.5s delay between caption API calls to avoid PeerTube 429s.
Verification
- Import/compile: OK
- Dry run:
list_new_videosreturns new videos not in catalogue - Real acquisition: hash
a8893f3757295e347cb5b529cae350ffacquired and dispatched (returned 'duplicate' — already in catalogue from legacy ingest, confirming dedup works) - Service restart: 7 threads,
peertube-acqin thread list, 0 errors in 90-second window - CLI:
recon ingest-peertube --statsstill works,recon ingest-peertubeuses new path