mirror of
https://github.com/zvx-echo6/recon.git
synced 2026-05-20 06:34:40 +02:00
Switch domain assignment to Qdrant as source of truth
Replace on-disk concept file reads with Qdrant payload queries for domain assignment. This unlocks assignment for ~10,120 items that had missing or legacy-only concept files on disk while Qdrant held the correct 18-domain taxonomy data. Changes: - domain_assigner.py: Replace _count_concept_domains (disk) with _count_domains_from_qdrant and _count_domains_from_qdrant_batch (Qdrant scroll queries). Add _get_qdrant_client helper. Remove pass 3 defensive re-run (Qdrant reads are consistent). Add no_concepts terminal status for zero-vector documents. - embedder.py: Post-embed hook passes existing qdrant client to compute_assignment, avoiding a second connection. - recon.py: Backfill creates one QdrantClient for the batch. SQL filter includes existing needs_reprocess items. Dry-run reports no_concepts as separate bucket. --reprocess-missing removes concept-dir deletion step (no longer reads from disk). - docs/domain-assignment.md: Algorithm references Qdrant, documents no_concepts status, removes pass 3 description. Dry-run results: 20,453 assigned, 1,392 tied, 298 no_concepts, 0 needs_reprocess, 0 errors (previously 10,416 needs_reprocess). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
parent
c04ccc5011
commit
3b37d96c4d
4 changed files with 186 additions and 135 deletions
|
|
@ -2,7 +2,13 @@
|
|||
|
||||
## Overview
|
||||
|
||||
RECON's domain assignment feature maps each PeerTube video to one of 18 knowledge domains by analyzing the concepts extracted from its transcript. Assignments are pushed to PeerTube as category metadata via a custom plugin.
|
||||
RECON's domain assignment feature maps each PeerTube video to one of 18 knowledge domains by analyzing the concept vectors stored in Qdrant. Assignments are pushed to PeerTube as category metadata via a custom plugin.
|
||||
|
||||
## Data Source
|
||||
|
||||
Domain counts are read from the `domain` payload field on concept vectors in Qdrant (`recon_knowledge_hybrid` collection on cortex:6333). Each concept vector has a `domain` string in its payload, set during enrichment and validated at embed time. This provides 100% coverage for all embedded documents with zero legacy domain residue.
|
||||
|
||||
Previously, domain counts were read from on-disk concept JSON files (`data/concepts/{hash}/window_*.json`). This was replaced with Qdrant queries on 2026-04-28 because ~10,000 items had missing or legacy-only concept files on disk while Qdrant had the correct data.
|
||||
|
||||
## Algorithm
|
||||
|
||||
|
|
@ -10,9 +16,9 @@ RECON's domain assignment feature maps each PeerTube video to one of 18 knowledg
|
|||
|
||||
Runs automatically via post-embed hook when a video completes the pipeline, or in bulk via `--backfill`.
|
||||
|
||||
1. Read all `data/concepts/{hash}/window_*.json` files
|
||||
2. Count domain occurrences across all concepts, filtering to `VALID_DOMAINS` only (skips legacy domains)
|
||||
3. If no valid concepts → `needs_reprocess`
|
||||
1. Query Qdrant for all points with `doc_hash` matching the document
|
||||
2. Count `domain` payload occurrences, filtering to `VALID_DOMAINS` only
|
||||
3. If zero concept vectors → `no_concepts` (terminal)
|
||||
4. If single top domain → `assigned`
|
||||
5. If tied → `tied_pass_1` (deferred to tiebreaker)
|
||||
|
||||
|
|
@ -22,20 +28,13 @@ Runs via `assign-categories --tiebreaker-pass`.
|
|||
|
||||
For each `tied_pass_1` document:
|
||||
|
||||
1. Identify the tied domains
|
||||
1. Identify the tied domains from Qdrant
|
||||
2. Look up the document's channel (`catalogue.category`)
|
||||
3. **Mega-channel rule:** If channel has >500 videos, skip tiebreaking → `tied_manual`
|
||||
4. Read concept files for all other videos in the same channel
|
||||
4. Query Qdrant for domain counts across all other videos in the same channel (single batch query with `MatchAny` filter)
|
||||
5. Among the tied domains only, pick the one with the highest channel-wide concept count
|
||||
6. If resolved → `tied_pass_2`
|
||||
7. If still tied → proceed to pass 3
|
||||
|
||||
### Pass 3: Defensive Re-Run
|
||||
|
||||
If pass 2 does not resolve the tie, re-read the same channel concept files and re-run identical counting logic. This catches concept-file changes that occurred mid-run (e.g. concurrent enrichment writing new windows during the batch). In steady state, pass 3 produces the same result as pass 2, but under concurrent writes it can resolve a tie that pass 2 missed.
|
||||
|
||||
- If resolved → `tied_pass_2` (same status — the column tracks "channel scan resolved it")
|
||||
- If still tied → `tied_manual` (alphabetical fallback assigned, flagged for review)
|
||||
7. If still tied → `tied_manual` (alphabetical fallback assigned, flagged for review)
|
||||
|
||||
### Mega-Channel Rule
|
||||
|
||||
|
|
@ -43,14 +42,15 @@ Channels with >500 videos (like the "Transcript" catch-all with ~9,200 videos) a
|
|||
|
||||
## Status Values
|
||||
|
||||
| Status | Meaning | Next Action |
|
||||
|--------|---------|-------------|
|
||||
| `assigned` | Clear winner from pass 1 | Push to PeerTube |
|
||||
| `tied_pass_1` | Concept tie, awaiting tiebreaker | Run `--tiebreaker-pass` |
|
||||
| `tied_pass_2` | Resolved by channel tiebreaker | Push to PeerTube |
|
||||
| `tied_manual` | Needs human review | Review at `/peertube/review` |
|
||||
| `needs_reprocess` | Missing concepts or only legacy domains | Run `--reprocess-missing` |
|
||||
| `manual_assigned` | Human override from dashboard | Already pushed |
|
||||
| Status | Meaning | Terminal? | Next Action |
|
||||
|--------|---------|-----------|-------------|
|
||||
| `assigned` | Clear winner from pass 1 | No | Push to PeerTube |
|
||||
| `tied_pass_1` | Concept tie, awaiting tiebreaker | No | Run `--tiebreaker-pass` |
|
||||
| `tied_pass_2` | Resolved by channel tiebreaker | No | Push to PeerTube |
|
||||
| `tied_manual` | Needs human review | No | Review at `/peertube/review` |
|
||||
| `no_concepts` | Zero concept vectors in Qdrant | **Yes** | None — typically non-topical content (vlogs, giveaways, announcements) |
|
||||
| `needs_reprocess` | Transient failure (Qdrant error) | No | Run `--reprocess-missing` |
|
||||
| `manual_assigned` | Human override from dashboard | No | Already pushed |
|
||||
|
||||
**"Categorized" filter** = `{'assigned', 'tied_pass_2', 'manual_assigned'}`
|
||||
|
||||
|
|
@ -72,7 +72,7 @@ python3 recon.py assign-categories --tiebreaker-pass
|
|||
# Push all assigned-but-unpushed categories to PeerTube API
|
||||
python3 recon.py assign-categories --push-pending
|
||||
|
||||
# Re-queue items with missing/legacy concepts
|
||||
# Re-queue items with transient failures for full re-processing
|
||||
python3 recon.py assign-categories --reprocess-missing
|
||||
|
||||
# Limit processing count
|
||||
|
|
@ -87,25 +87,26 @@ The review UI at `recon.echo6.co/peertube/review` shows only `tied_manual` items
|
|||
- Dropdown to select the correct domain
|
||||
- Assign button (pushes to PeerTube immediately)
|
||||
|
||||
Items with `needs_reprocess` status do NOT appear in the review UI — they are handled exclusively via the CLI `--reprocess-missing` command.
|
||||
Items with `no_concepts` or `needs_reprocess` status do NOT appear in the review UI.
|
||||
|
||||
## Pipeline Integration
|
||||
|
||||
New videos ingested via the PeerTube collector are automatically assigned a domain when they complete the embed stage. The post-embed hook in `embedder.py`:
|
||||
|
||||
1. Runs `compute_assignment()` (pass 1 only)
|
||||
1. Runs `compute_assignment()` (pass 1 only), reusing the embedder's existing Qdrant client
|
||||
2. If clear winner: pushes category to PeerTube immediately
|
||||
3. If tied: marks as `tied_pass_1` for the next tiebreaker batch run
|
||||
4. On error: logs warning and continues — does not block the pipeline
|
||||
4. If no concepts: marks as `no_concepts` (terminal)
|
||||
5. On Qdrant error: logs warning and continues — does not block the pipeline
|
||||
|
||||
## Source Files
|
||||
|
||||
| File | Purpose |
|
||||
|------|---------|
|
||||
| `lib/recon_domains.py` | Domain↔Category ID mapping, VALID_DOMAINS |
|
||||
| `lib/domain_assigner.py` | `compute_assignment()` + `run_tiebreaker_pass()` |
|
||||
| `lib/domain_assigner.py` | `compute_assignment()` + `run_tiebreaker_pass()` + Qdrant helpers |
|
||||
| `lib/peertube_writer.py` | OAuth2 client, `push_category()`, `push_pending()` |
|
||||
| `lib/embedder.py` | Post-embed hook |
|
||||
| `lib/embedder.py` | Post-embed hook (passes qdrant client) |
|
||||
| `lib/status.py` | DB columns + helper methods |
|
||||
| `lib/api.py` | Dashboard review routes |
|
||||
| `recon.py` | CLI `assign-categories` command |
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue