recon/docs/domain-assignment.md
Matt 3b37d96c4d Switch domain assignment to Qdrant as source of truth
Replace on-disk concept file reads with Qdrant payload queries for
domain assignment. This unlocks assignment for ~10,120 items that had
missing or legacy-only concept files on disk while Qdrant held the
correct 18-domain taxonomy data.

Changes:
- domain_assigner.py: Replace _count_concept_domains (disk) with
  _count_domains_from_qdrant and _count_domains_from_qdrant_batch
  (Qdrant scroll queries). Add _get_qdrant_client helper. Remove
  pass 3 defensive re-run (Qdrant reads are consistent). Add
  no_concepts terminal status for zero-vector documents.
- embedder.py: Post-embed hook passes existing qdrant client to
  compute_assignment, avoiding a second connection.
- recon.py: Backfill creates one QdrantClient for the batch. SQL
  filter includes existing needs_reprocess items. Dry-run reports
  no_concepts as separate bucket. --reprocess-missing removes
  concept-dir deletion step (no longer reads from disk).
- docs/domain-assignment.md: Algorithm references Qdrant, documents
  no_concepts status, removes pass 3 description.

Dry-run results: 20,453 assigned, 1,392 tied, 298 no_concepts,
0 needs_reprocess, 0 errors (previously 10,416 needs_reprocess).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-28 03:59:06 +00:00

5.1 KiB

Domain Assignment — Algorithm & Operations Guide

Overview

RECON's domain assignment feature maps each PeerTube video to one of 18 knowledge domains by analyzing the concept vectors stored in Qdrant. Assignments are pushed to PeerTube as category metadata via a custom plugin.

Data Source

Domain counts are read from the domain payload field on concept vectors in Qdrant (recon_knowledge_hybrid collection on cortex:6333). Each concept vector has a domain string in its payload, set during enrichment and validated at embed time. This provides 100% coverage for all embedded documents with zero legacy domain residue.

Previously, domain counts were read from on-disk concept JSON files (data/concepts/{hash}/window_*.json). This was replaced with Qdrant queries on 2026-04-28 because ~10,000 items had missing or legacy-only concept files on disk while Qdrant had the correct data.

Algorithm

Pass 1: Concept Domain Count (inline, per-document)

Runs automatically via post-embed hook when a video completes the pipeline, or in bulk via --backfill.

  1. Query Qdrant for all points with doc_hash matching the document
  2. Count domain payload occurrences, filtering to VALID_DOMAINS only
  3. If zero concept vectors → no_concepts (terminal)
  4. If single top domain → assigned
  5. If tied → tied_pass_1 (deferred to tiebreaker)

Pass 2: Channel Tiebreaker (batch)

Runs via assign-categories --tiebreaker-pass.

For each tied_pass_1 document:

  1. Identify the tied domains from Qdrant
  2. Look up the document's channel (catalogue.category)
  3. Mega-channel rule: If channel has >500 videos, skip tiebreaking → tied_manual
  4. Query Qdrant for domain counts across all other videos in the same channel (single batch query with MatchAny filter)
  5. Among the tied domains only, pick the one with the highest channel-wide concept count
  6. If resolved → tied_pass_2
  7. If still tied → tied_manual (alphabetical fallback assigned, flagged for review)

Mega-Channel Rule

Channels with >500 videos (like the "Transcript" catch-all with ~9,200 videos) are not topically coherent. Scanning their concepts produces meaningless aggregate data. These go straight to tied_manual for dashboard review.

Status Values

Status Meaning Terminal? Next Action
assigned Clear winner from pass 1 No Push to PeerTube
tied_pass_1 Concept tie, awaiting tiebreaker No Run --tiebreaker-pass
tied_pass_2 Resolved by channel tiebreaker No Push to PeerTube
tied_manual Needs human review No Review at /peertube/review
no_concepts Zero concept vectors in Qdrant Yes None — typically non-topical content (vlogs, giveaways, announcements)
needs_reprocess Transient failure (Qdrant error) No Run --reprocess-missing
manual_assigned Human override from dashboard No Already pushed

"Categorized" filter = {'assigned', 'tied_pass_2', 'manual_assigned'}

CLI Commands

cd /opt/recon && source venv/bin/activate

# Show current assignment status
python3 recon.py assign-categories

# Pass 1: backfill all unassigned complete stream documents
python3 recon.py assign-categories --backfill --dry-run
python3 recon.py assign-categories --backfill

# Pass 2: resolve ties via channel analysis
python3 recon.py assign-categories --tiebreaker-pass

# Push all assigned-but-unpushed categories to PeerTube API
python3 recon.py assign-categories --push-pending

# Re-queue items with transient failures for full re-processing
python3 recon.py assign-categories --reprocess-missing

# Limit processing count
python3 recon.py assign-categories --backfill --limit 100

Dashboard Review

The review UI at recon.echo6.co/peertube/review shows only tied_manual items. Each row displays:

  • Video title and channel
  • Top concept domains with counts
  • Dropdown to select the correct domain
  • Assign button (pushes to PeerTube immediately)

Items with no_concepts or needs_reprocess status do NOT appear in the review UI.

Pipeline Integration

New videos ingested via the PeerTube collector are automatically assigned a domain when they complete the embed stage. The post-embed hook in embedder.py:

  1. Runs compute_assignment() (pass 1 only), reusing the embedder's existing Qdrant client
  2. If clear winner: pushes category to PeerTube immediately
  3. If tied: marks as tied_pass_1 for the next tiebreaker batch run
  4. If no concepts: marks as no_concepts (terminal)
  5. On Qdrant error: logs warning and continues — does not block the pipeline

Source Files

File Purpose
lib/recon_domains.py Domain↔Category ID mapping, VALID_DOMAINS
lib/domain_assigner.py compute_assignment() + run_tiebreaker_pass() + Qdrant helpers
lib/peertube_writer.py OAuth2 client, push_category(), push_pending()
lib/embedder.py Post-embed hook (passes qdrant client)
lib/status.py DB columns + helper methods
lib/api.py Dashboard review routes
recon.py CLI assign-categories command