Replace on-disk concept file reads with Qdrant payload queries for
domain assignment. This unlocks assignment for ~10,120 items that had
missing or legacy-only concept files on disk while Qdrant held the
correct 18-domain taxonomy data.
Changes:
- domain_assigner.py: Replace _count_concept_domains (disk) with
_count_domains_from_qdrant and _count_domains_from_qdrant_batch
(Qdrant scroll queries). Add _get_qdrant_client helper. Remove
pass 3 defensive re-run (Qdrant reads are consistent). Add
no_concepts terminal status for zero-vector documents.
- embedder.py: Post-embed hook passes existing qdrant client to
compute_assignment, avoiding a second connection.
- recon.py: Backfill creates one QdrantClient for the batch. SQL
filter includes existing needs_reprocess items. Dry-run reports
no_concepts as separate bucket. --reprocess-missing removes
concept-dir deletion step (no longer reads from disk).
- docs/domain-assignment.md: Algorithm references Qdrant, documents
no_concepts status, removes pass 3 description.
Dry-run results: 20,453 assigned, 1,392 tied, 298 no_concepts,
0 needs_reprocess, 0 errors (previously 10,416 needs_reprocess).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds assign-categories subcommand with flags:
--backfill Pass 1 domain assignment for all complete stream docs
--tiebreaker-pass Resolve ties via channel concept analysis
--push-pending Push assigned categories to PeerTube API (staged via --limit)
--reprocess-missing Re-queue items with missing/legacy concepts
--dry-run Preview without writes (enhanced for reprocess: shows
concept dir existence and file counts)
--limit N Cap processing count
Includes pre-deletion audit logging for --reprocess-missing (logs path,
file count, and hash before each shutil.rmtree).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New lib/acquisition/peertube.py replaces the removed peertube_scanner_loop.
Polls PeerTube API every 30min, dedupes against catalogue (UUID + title),
writes flat file pairs to data/acquired/stream/ for the dispatcher.
- acquire_batch(): one-shot find-and-acquire with rate limiting
- acquisition_loop(): service thread wrapper (interval from config)
- list_new_videos(): dedup via _build_known_sets() against catalogue
- acquire_one(): fetch VTT, convert, write .tmp then rename atomically
cmd_service(): added peertube-acq daemon thread
cmd_ingest_peertube(): rewired to use acquire_batch(), drops --channel/
--since/--enrich/--process (dispatcher handles full pipeline)
config.yaml: added peertube.poll_interval: 1800
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
recon.py:
- Remove extract stage_loop thread from cmd_service(). Confirmed
vestigial: 0 queued items, silent logs over 24+ hour run. The new
processors do extraction inline in pre_flight().
- Remove cmd_crawl CLI subcommand and its argparse registration.
- Clean up associated imports and variables.
Deleted:
- lib/crawler.py (432 lines) -- old web crawler subsystem, only
referenced by the removed CLI subcommand.
- 24 .bak files (untracked pre-edit safety backups, originals
preserved in git history).
Investigation found the four old loop function definitions
(scanner_loop, peertube_scanner_loop, crawler_scheduler_loop,
organizer_loop) were already deleted in Phase 5c-1.
Modules investigated and KEPT:
- lib/web_scraper.py -- exports chunk_text() used by transcript_processor
- lib/new_pipeline.py -- active Stream B library management CLI tool
- lib/peertube_scraper.py -- only mechanism for transcript ingestion
- lib/extractor.py -- would activate for new PDFs via cmd_run CLI
Service restart verified: 6 threads (dispatcher, enrich, embed,
filing, progress, dashboard), no extract worker, zero errors.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds dispatch_loop() alongside dispatch_once() for service-thread use.
Adds filing_worker_loop() that watches for status=complete items in
/opt/recon/data/processing/ and files them to library/Domain/Subdomain/.
Rewires cmd_service() to run the new architecture:
- Removed: scanner_loop, peertube_scanner_loop, crawler_scheduler_loop,
organizer_loop (all replaced by dispatcher + new filing worker)
- Kept: enrich and embed stage workers, progress, dashboard
- Kept (vestigial): extract stage worker — will be removed in Phase 6 cleanup
- Added: dispatcher loop thread, filing worker thread
Phase 5c-1 of the refactor. Service not yet started — Phase 5c-2 will do that.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Current state of the pipeline code as of 2026-04-14 (Phase 1 scaffolding complete).
Config has new_pipeline.enabled=false and crawler.sites=[] per refactor plan.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>