mirror of
https://github.com/zvx-echo6/recon.git
synced 2026-05-20 14:44:54 +02:00
Switch domain assignment to Qdrant as source of truth
Replace on-disk concept file reads with Qdrant payload queries for domain assignment. This unlocks assignment for ~10,120 items that had missing or legacy-only concept files on disk while Qdrant held the correct 18-domain taxonomy data. Changes: - domain_assigner.py: Replace _count_concept_domains (disk) with _count_domains_from_qdrant and _count_domains_from_qdrant_batch (Qdrant scroll queries). Add _get_qdrant_client helper. Remove pass 3 defensive re-run (Qdrant reads are consistent). Add no_concepts terminal status for zero-vector documents. - embedder.py: Post-embed hook passes existing qdrant client to compute_assignment, avoiding a second connection. - recon.py: Backfill creates one QdrantClient for the batch. SQL filter includes existing needs_reprocess items. Dry-run reports no_concepts as separate bucket. --reprocess-missing removes concept-dir deletion step (no longer reads from disk). - docs/domain-assignment.md: Algorithm references Qdrant, documents no_concepts status, removes pass 3 description. Dry-run results: 20,453 assigned, 1,392 tied, 298 no_concepts, 0 needs_reprocess, 0 errors (previously 10,416 needs_reprocess). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
parent
c04ccc5011
commit
3b37d96c4d
4 changed files with 186 additions and 135 deletions
|
|
@ -2,7 +2,13 @@
|
||||||
|
|
||||||
## Overview
|
## Overview
|
||||||
|
|
||||||
RECON's domain assignment feature maps each PeerTube video to one of 18 knowledge domains by analyzing the concepts extracted from its transcript. Assignments are pushed to PeerTube as category metadata via a custom plugin.
|
RECON's domain assignment feature maps each PeerTube video to one of 18 knowledge domains by analyzing the concept vectors stored in Qdrant. Assignments are pushed to PeerTube as category metadata via a custom plugin.
|
||||||
|
|
||||||
|
## Data Source
|
||||||
|
|
||||||
|
Domain counts are read from the `domain` payload field on concept vectors in Qdrant (`recon_knowledge_hybrid` collection on cortex:6333). Each concept vector has a `domain` string in its payload, set during enrichment and validated at embed time. This provides 100% coverage for all embedded documents with zero legacy domain residue.
|
||||||
|
|
||||||
|
Previously, domain counts were read from on-disk concept JSON files (`data/concepts/{hash}/window_*.json`). This was replaced with Qdrant queries on 2026-04-28 because ~10,000 items had missing or legacy-only concept files on disk while Qdrant had the correct data.
|
||||||
|
|
||||||
## Algorithm
|
## Algorithm
|
||||||
|
|
||||||
|
|
@ -10,9 +16,9 @@ RECON's domain assignment feature maps each PeerTube video to one of 18 knowledg
|
||||||
|
|
||||||
Runs automatically via post-embed hook when a video completes the pipeline, or in bulk via `--backfill`.
|
Runs automatically via post-embed hook when a video completes the pipeline, or in bulk via `--backfill`.
|
||||||
|
|
||||||
1. Read all `data/concepts/{hash}/window_*.json` files
|
1. Query Qdrant for all points with `doc_hash` matching the document
|
||||||
2. Count domain occurrences across all concepts, filtering to `VALID_DOMAINS` only (skips legacy domains)
|
2. Count `domain` payload occurrences, filtering to `VALID_DOMAINS` only
|
||||||
3. If no valid concepts → `needs_reprocess`
|
3. If zero concept vectors → `no_concepts` (terminal)
|
||||||
4. If single top domain → `assigned`
|
4. If single top domain → `assigned`
|
||||||
5. If tied → `tied_pass_1` (deferred to tiebreaker)
|
5. If tied → `tied_pass_1` (deferred to tiebreaker)
|
||||||
|
|
||||||
|
|
@ -22,20 +28,13 @@ Runs via `assign-categories --tiebreaker-pass`.
|
||||||
|
|
||||||
For each `tied_pass_1` document:
|
For each `tied_pass_1` document:
|
||||||
|
|
||||||
1. Identify the tied domains
|
1. Identify the tied domains from Qdrant
|
||||||
2. Look up the document's channel (`catalogue.category`)
|
2. Look up the document's channel (`catalogue.category`)
|
||||||
3. **Mega-channel rule:** If channel has >500 videos, skip tiebreaking → `tied_manual`
|
3. **Mega-channel rule:** If channel has >500 videos, skip tiebreaking → `tied_manual`
|
||||||
4. Read concept files for all other videos in the same channel
|
4. Query Qdrant for domain counts across all other videos in the same channel (single batch query with `MatchAny` filter)
|
||||||
5. Among the tied domains only, pick the one with the highest channel-wide concept count
|
5. Among the tied domains only, pick the one with the highest channel-wide concept count
|
||||||
6. If resolved → `tied_pass_2`
|
6. If resolved → `tied_pass_2`
|
||||||
7. If still tied → proceed to pass 3
|
7. If still tied → `tied_manual` (alphabetical fallback assigned, flagged for review)
|
||||||
|
|
||||||
### Pass 3: Defensive Re-Run
|
|
||||||
|
|
||||||
If pass 2 does not resolve the tie, re-read the same channel concept files and re-run identical counting logic. This catches concept-file changes that occurred mid-run (e.g. concurrent enrichment writing new windows during the batch). In steady state, pass 3 produces the same result as pass 2, but under concurrent writes it can resolve a tie that pass 2 missed.
|
|
||||||
|
|
||||||
- If resolved → `tied_pass_2` (same status — the column tracks "channel scan resolved it")
|
|
||||||
- If still tied → `tied_manual` (alphabetical fallback assigned, flagged for review)
|
|
||||||
|
|
||||||
### Mega-Channel Rule
|
### Mega-Channel Rule
|
||||||
|
|
||||||
|
|
@ -43,14 +42,15 @@ Channels with >500 videos (like the "Transcript" catch-all with ~9,200 videos) a
|
||||||
|
|
||||||
## Status Values
|
## Status Values
|
||||||
|
|
||||||
| Status | Meaning | Next Action |
|
| Status | Meaning | Terminal? | Next Action |
|
||||||
|--------|---------|-------------|
|
|--------|---------|-----------|-------------|
|
||||||
| `assigned` | Clear winner from pass 1 | Push to PeerTube |
|
| `assigned` | Clear winner from pass 1 | No | Push to PeerTube |
|
||||||
| `tied_pass_1` | Concept tie, awaiting tiebreaker | Run `--tiebreaker-pass` |
|
| `tied_pass_1` | Concept tie, awaiting tiebreaker | No | Run `--tiebreaker-pass` |
|
||||||
| `tied_pass_2` | Resolved by channel tiebreaker | Push to PeerTube |
|
| `tied_pass_2` | Resolved by channel tiebreaker | No | Push to PeerTube |
|
||||||
| `tied_manual` | Needs human review | Review at `/peertube/review` |
|
| `tied_manual` | Needs human review | No | Review at `/peertube/review` |
|
||||||
| `needs_reprocess` | Missing concepts or only legacy domains | Run `--reprocess-missing` |
|
| `no_concepts` | Zero concept vectors in Qdrant | **Yes** | None — typically non-topical content (vlogs, giveaways, announcements) |
|
||||||
| `manual_assigned` | Human override from dashboard | Already pushed |
|
| `needs_reprocess` | Transient failure (Qdrant error) | No | Run `--reprocess-missing` |
|
||||||
|
| `manual_assigned` | Human override from dashboard | No | Already pushed |
|
||||||
|
|
||||||
**"Categorized" filter** = `{'assigned', 'tied_pass_2', 'manual_assigned'}`
|
**"Categorized" filter** = `{'assigned', 'tied_pass_2', 'manual_assigned'}`
|
||||||
|
|
||||||
|
|
@ -72,7 +72,7 @@ python3 recon.py assign-categories --tiebreaker-pass
|
||||||
# Push all assigned-but-unpushed categories to PeerTube API
|
# Push all assigned-but-unpushed categories to PeerTube API
|
||||||
python3 recon.py assign-categories --push-pending
|
python3 recon.py assign-categories --push-pending
|
||||||
|
|
||||||
# Re-queue items with missing/legacy concepts
|
# Re-queue items with transient failures for full re-processing
|
||||||
python3 recon.py assign-categories --reprocess-missing
|
python3 recon.py assign-categories --reprocess-missing
|
||||||
|
|
||||||
# Limit processing count
|
# Limit processing count
|
||||||
|
|
@ -87,25 +87,26 @@ The review UI at `recon.echo6.co/peertube/review` shows only `tied_manual` items
|
||||||
- Dropdown to select the correct domain
|
- Dropdown to select the correct domain
|
||||||
- Assign button (pushes to PeerTube immediately)
|
- Assign button (pushes to PeerTube immediately)
|
||||||
|
|
||||||
Items with `needs_reprocess` status do NOT appear in the review UI — they are handled exclusively via the CLI `--reprocess-missing` command.
|
Items with `no_concepts` or `needs_reprocess` status do NOT appear in the review UI.
|
||||||
|
|
||||||
## Pipeline Integration
|
## Pipeline Integration
|
||||||
|
|
||||||
New videos ingested via the PeerTube collector are automatically assigned a domain when they complete the embed stage. The post-embed hook in `embedder.py`:
|
New videos ingested via the PeerTube collector are automatically assigned a domain when they complete the embed stage. The post-embed hook in `embedder.py`:
|
||||||
|
|
||||||
1. Runs `compute_assignment()` (pass 1 only)
|
1. Runs `compute_assignment()` (pass 1 only), reusing the embedder's existing Qdrant client
|
||||||
2. If clear winner: pushes category to PeerTube immediately
|
2. If clear winner: pushes category to PeerTube immediately
|
||||||
3. If tied: marks as `tied_pass_1` for the next tiebreaker batch run
|
3. If tied: marks as `tied_pass_1` for the next tiebreaker batch run
|
||||||
4. On error: logs warning and continues — does not block the pipeline
|
4. If no concepts: marks as `no_concepts` (terminal)
|
||||||
|
5. On Qdrant error: logs warning and continues — does not block the pipeline
|
||||||
|
|
||||||
## Source Files
|
## Source Files
|
||||||
|
|
||||||
| File | Purpose |
|
| File | Purpose |
|
||||||
|------|---------|
|
|------|---------|
|
||||||
| `lib/recon_domains.py` | Domain↔Category ID mapping, VALID_DOMAINS |
|
| `lib/recon_domains.py` | Domain↔Category ID mapping, VALID_DOMAINS |
|
||||||
| `lib/domain_assigner.py` | `compute_assignment()` + `run_tiebreaker_pass()` |
|
| `lib/domain_assigner.py` | `compute_assignment()` + `run_tiebreaker_pass()` + Qdrant helpers |
|
||||||
| `lib/peertube_writer.py` | OAuth2 client, `push_category()`, `push_pending()` |
|
| `lib/peertube_writer.py` | OAuth2 client, `push_category()`, `push_pending()` |
|
||||||
| `lib/embedder.py` | Post-embed hook |
|
| `lib/embedder.py` | Post-embed hook (passes qdrant client) |
|
||||||
| `lib/status.py` | DB columns + helper methods |
|
| `lib/status.py` | DB columns + helper methods |
|
||||||
| `lib/api.py` | Dashboard review routes |
|
| `lib/api.py` | Dashboard review routes |
|
||||||
| `recon.py` | CLI `assign-categories` command |
|
| `recon.py` | CLI `assign-categories` command |
|
||||||
|
|
|
||||||
|
|
@ -1,24 +1,30 @@
|
||||||
"""
|
"""
|
||||||
RECON Domain Assigner
|
RECON Domain Assigner
|
||||||
|
|
||||||
Computes per-video domain assignments from concept extraction results.
|
Computes per-video domain assignments from Qdrant vector payloads.
|
||||||
Two functions, two execution modes:
|
Two functions, two execution modes:
|
||||||
|
|
||||||
compute_assignment() — pass 1, inline from post-embed hook
|
compute_assignment() — pass 1, inline from post-embed hook
|
||||||
run_tiebreaker_pass() — batch, resolves ties via channel concept scan
|
run_tiebreaker_pass() — batch, resolves ties via channel concept scan
|
||||||
|
|
||||||
|
Data source: Qdrant `domain` payload field on concept vectors.
|
||||||
|
Previously read on-disk concept JSON files; migrated to Qdrant as
|
||||||
|
single source of truth (2026-04-28).
|
||||||
|
|
||||||
Status values written to documents.recon_domain_status:
|
Status values written to documents.recon_domain_status:
|
||||||
assigned — clear winner from pass 1 concept count
|
assigned — clear winner from pass 1 concept count
|
||||||
tied_pass_1 — concept tie, awaiting channel tiebreaker
|
tied_pass_1 — concept tie, awaiting channel tiebreaker
|
||||||
tied_pass_2 — resolved by channel tiebreaker
|
tied_pass_2 — resolved by channel tiebreaker
|
||||||
tied_manual — needs human review (dashboard)
|
tied_manual — needs human review (dashboard)
|
||||||
needs_reprocess — missing concepts or only legacy domains
|
no_concepts — terminal, zero concept vectors in Qdrant
|
||||||
|
needs_reprocess — transient failure (Qdrant error, etc.)
|
||||||
manual_assigned — human override from dashboard
|
manual_assigned — human override from dashboard
|
||||||
"""
|
"""
|
||||||
import json
|
|
||||||
import os
|
|
||||||
from collections import Counter
|
from collections import Counter
|
||||||
|
|
||||||
|
from qdrant_client import QdrantClient
|
||||||
|
from qdrant_client.models import Filter, FieldCondition, MatchValue, MatchAny
|
||||||
|
|
||||||
from .recon_domains import VALID_DOMAINS, DOMAIN_CATEGORY_MAP
|
from .recon_domains import VALID_DOMAINS, DOMAIN_CATEGORY_MAP
|
||||||
from .utils import setup_logging
|
from .utils import setup_logging
|
||||||
|
|
||||||
|
|
@ -28,40 +34,51 @@ logger = setup_logging('recon.domain_assigner')
|
||||||
MEGA_CHANNEL_THRESHOLD = 500
|
MEGA_CHANNEL_THRESHOLD = 500
|
||||||
|
|
||||||
|
|
||||||
def _count_concept_domains(concepts_dir, file_hash):
|
def _get_qdrant_client(config):
|
||||||
"""Read concept files and count valid domain occurrences.
|
"""Create a QdrantClient from RECON config.
|
||||||
|
|
||||||
|
Callers should create one client and pass it through rather than
|
||||||
|
calling this repeatedly.
|
||||||
|
"""
|
||||||
|
logger.debug("Creating new QdrantClient (caller did not pass one)")
|
||||||
|
return QdrantClient(
|
||||||
|
host=config['vector_db']['host'],
|
||||||
|
port=config['vector_db']['port'],
|
||||||
|
timeout=60
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def _count_domains_from_qdrant(qdrant, collection, doc_hash):
|
||||||
|
"""Count valid domain occurrences for a single document from Qdrant.
|
||||||
|
|
||||||
|
Scrolls all points matching doc_hash and counts domain values.
|
||||||
|
|
||||||
Args:
|
Args:
|
||||||
concepts_dir: Base concepts directory (e.g. /opt/recon/data/concepts)
|
qdrant: QdrantClient instance
|
||||||
file_hash: Document hash
|
collection: Qdrant collection name
|
||||||
|
doc_hash: Document hash to query
|
||||||
|
|
||||||
Returns:
|
Returns:
|
||||||
Counter of {domain_name: count} for valid domains only,
|
Counter of {domain_name: count} for valid domains.
|
||||||
or None if no concept directory exists.
|
Empty Counter if no points found (never None).
|
||||||
"""
|
"""
|
||||||
doc_concepts_dir = os.path.join(concepts_dir, file_hash)
|
|
||||||
if not os.path.isdir(doc_concepts_dir):
|
|
||||||
return None
|
|
||||||
|
|
||||||
domain_counter = Counter()
|
domain_counter = Counter()
|
||||||
|
offset = None
|
||||||
|
|
||||||
for fname in os.listdir(doc_concepts_dir):
|
while True:
|
||||||
if not fname.startswith('window_') or not fname.endswith('.json'):
|
results, next_offset = qdrant.scroll(
|
||||||
continue
|
collection_name=collection,
|
||||||
fpath = os.path.join(doc_concepts_dir, fname)
|
scroll_filter=Filter(must=[
|
||||||
try:
|
FieldCondition(key="doc_hash", match=MatchValue(value=doc_hash))
|
||||||
with open(fpath, 'r') as f:
|
]),
|
||||||
concepts = json.load(f)
|
with_payload=["domain"],
|
||||||
except (json.JSONDecodeError, OSError):
|
with_vectors=False,
|
||||||
continue
|
limit=200,
|
||||||
|
offset=offset,
|
||||||
|
)
|
||||||
|
|
||||||
if not isinstance(concepts, list):
|
for point in results:
|
||||||
continue
|
dom = point.payload.get('domain')
|
||||||
|
|
||||||
for concept in concepts:
|
|
||||||
if not isinstance(concept, dict):
|
|
||||||
continue
|
|
||||||
dom = concept.get('domain')
|
|
||||||
if isinstance(dom, str) and dom in VALID_DOMAINS:
|
if isinstance(dom, str) and dom in VALID_DOMAINS:
|
||||||
domain_counter[dom] += 1
|
domain_counter[dom] += 1
|
||||||
elif isinstance(dom, list):
|
elif isinstance(dom, list):
|
||||||
|
|
@ -69,30 +86,95 @@ def _count_concept_domains(concepts_dir, file_hash):
|
||||||
if isinstance(d, str) and d in VALID_DOMAINS:
|
if isinstance(d, str) and d in VALID_DOMAINS:
|
||||||
domain_counter[d] += 1
|
domain_counter[d] += 1
|
||||||
|
|
||||||
|
if next_offset is None:
|
||||||
|
break
|
||||||
|
offset = next_offset
|
||||||
|
|
||||||
return domain_counter
|
return domain_counter
|
||||||
|
|
||||||
|
|
||||||
def compute_assignment(file_hash, db, config):
|
def _count_domains_from_qdrant_batch(qdrant, collection, doc_hashes):
|
||||||
|
"""Count valid domain occurrences across multiple documents from Qdrant.
|
||||||
|
|
||||||
|
Single scroll with MatchAny filter, with offset pagination for large
|
||||||
|
result sets.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
qdrant: QdrantClient instance
|
||||||
|
collection: Qdrant collection name
|
||||||
|
doc_hashes: List of document hashes to query
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Counter of {domain_name: count} aggregated across all matching points.
|
||||||
|
"""
|
||||||
|
if not doc_hashes:
|
||||||
|
return Counter()
|
||||||
|
|
||||||
|
domain_counter = Counter()
|
||||||
|
offset = None
|
||||||
|
|
||||||
|
while True:
|
||||||
|
results, next_offset = qdrant.scroll(
|
||||||
|
collection_name=collection,
|
||||||
|
scroll_filter=Filter(must=[
|
||||||
|
FieldCondition(key="doc_hash", match=MatchAny(any=doc_hashes))
|
||||||
|
]),
|
||||||
|
with_payload=["domain"],
|
||||||
|
with_vectors=False,
|
||||||
|
limit=10000,
|
||||||
|
offset=offset,
|
||||||
|
)
|
||||||
|
|
||||||
|
for point in results:
|
||||||
|
dom = point.payload.get('domain')
|
||||||
|
if isinstance(dom, str) and dom in VALID_DOMAINS:
|
||||||
|
domain_counter[dom] += 1
|
||||||
|
elif isinstance(dom, list):
|
||||||
|
for d in dom:
|
||||||
|
if isinstance(d, str) and d in VALID_DOMAINS:
|
||||||
|
domain_counter[d] += 1
|
||||||
|
|
||||||
|
if next_offset is None:
|
||||||
|
break
|
||||||
|
offset = next_offset
|
||||||
|
|
||||||
|
return domain_counter
|
||||||
|
|
||||||
|
|
||||||
|
def compute_assignment(file_hash, db, config, qdrant=None):
|
||||||
"""Compute domain assignment for a single document (pass 1).
|
"""Compute domain assignment for a single document (pass 1).
|
||||||
|
|
||||||
Counts domain occurrences across all concepts. If a single domain
|
Counts domain occurrences across all concept vectors in Qdrant.
|
||||||
wins, assigns it. If tied, defers to batch tiebreaker.
|
If a single domain wins, assigns it. If tied, defers to batch
|
||||||
|
tiebreaker.
|
||||||
|
|
||||||
Args:
|
Args:
|
||||||
file_hash: Document hash
|
file_hash: Document hash
|
||||||
db: StatusDB instance
|
db: StatusDB instance
|
||||||
config: RECON config dict
|
config: RECON config dict
|
||||||
|
qdrant: Optional QdrantClient (created if not provided)
|
||||||
|
|
||||||
Returns:
|
Returns:
|
||||||
(domain, status) tuple where domain is a string or None,
|
(domain, status) tuple where domain is a string or None,
|
||||||
and status is one of: 'assigned', 'tied_pass_1', 'needs_reprocess'
|
and status is one of: 'assigned', 'tied_pass_1', 'no_concepts',
|
||||||
|
'needs_reprocess'
|
||||||
"""
|
"""
|
||||||
concepts_dir = config['paths']['concepts']
|
owns_client = False
|
||||||
domain_counter = _count_concept_domains(concepts_dir, file_hash)
|
if qdrant is None:
|
||||||
|
qdrant = _get_qdrant_client(config)
|
||||||
|
owns_client = True
|
||||||
|
|
||||||
if domain_counter is None or len(domain_counter) == 0:
|
collection = config['vector_db']['collection']
|
||||||
|
|
||||||
|
try:
|
||||||
|
domain_counter = _count_domains_from_qdrant(qdrant, collection, file_hash)
|
||||||
|
except Exception as e:
|
||||||
|
logger.warning(f"Qdrant query failed for {file_hash[:12]}: {e}")
|
||||||
return (None, 'needs_reprocess')
|
return (None, 'needs_reprocess')
|
||||||
|
|
||||||
|
if len(domain_counter) == 0:
|
||||||
|
return (None, 'no_concepts')
|
||||||
|
|
||||||
top = domain_counter.most_common(2)
|
top = domain_counter.most_common(2)
|
||||||
top_domain = top[0][0]
|
top_domain = top[0][0]
|
||||||
top_count = top[0][1]
|
top_count = top[0][1]
|
||||||
|
|
@ -104,9 +186,9 @@ def compute_assignment(file_hash, db, config):
|
||||||
return (None, 'tied_pass_1')
|
return (None, 'tied_pass_1')
|
||||||
|
|
||||||
|
|
||||||
def _get_tied_domains(concepts_dir, file_hash):
|
def _get_tied_domains(qdrant, collection, file_hash):
|
||||||
"""Get the set of domains tied for first place in a document's concepts."""
|
"""Get the set of domains tied for first place in a document's concepts."""
|
||||||
domain_counter = _count_concept_domains(concepts_dir, file_hash)
|
domain_counter = _count_domains_from_qdrant(qdrant, collection, file_hash)
|
||||||
if not domain_counter:
|
if not domain_counter:
|
||||||
return []
|
return []
|
||||||
|
|
||||||
|
|
@ -150,32 +232,32 @@ def _channel_video_count(db, channel_name):
|
||||||
return row['cnt'] if row else 0
|
return row['cnt'] if row else 0
|
||||||
|
|
||||||
|
|
||||||
def run_tiebreaker_pass(db, config):
|
def run_tiebreaker_pass(db, config, qdrant=None):
|
||||||
"""Resolve tied domain assignments using channel-level concept analysis.
|
"""Resolve tied domain assignments using channel-level Qdrant analysis.
|
||||||
|
|
||||||
Processes all documents where recon_domain_status = 'tied_pass_1'.
|
Processes all documents where recon_domain_status = 'tied_pass_1'.
|
||||||
|
|
||||||
Pass 2: For each tied document, reads concept files from all other
|
For each tied document, queries Qdrant for domain counts from all
|
||||||
videos in the same channel and picks the tied domain with the highest
|
other videos in the same channel and picks the tied domain with the
|
||||||
channel-wide count.
|
highest channel-wide count.
|
||||||
|
|
||||||
Pass 3 (defensive re-run): Re-reads the same channel concept files a
|
Mega-channels (>500 videos) skip tiebreaking and go straight to
|
||||||
second time with identical logic. This catches concept-file changes
|
|
||||||
that occurred mid-run (e.g. concurrent enrichment writing new windows).
|
|
||||||
In steady state pass 3 produces the same result as pass 2, but under
|
|
||||||
concurrent writes it can resolve a tie that pass 2 missed.
|
|
||||||
|
|
||||||
Mega-channels (>500 videos) skip both passes and go straight to
|
|
||||||
'tied_manual' for dashboard review.
|
'tied_manual' for dashboard review.
|
||||||
|
|
||||||
Args:
|
Args:
|
||||||
db: StatusDB instance
|
db: StatusDB instance
|
||||||
config: RECON config dict
|
config: RECON config dict
|
||||||
|
qdrant: Optional QdrantClient (created if not provided)
|
||||||
|
|
||||||
Returns:
|
Returns:
|
||||||
Dict with counts: resolved, manual, skipped, errors
|
Dict with counts: resolved, manual, skipped, errors
|
||||||
"""
|
"""
|
||||||
concepts_dir = config['paths']['concepts']
|
owns_client = False
|
||||||
|
if qdrant is None:
|
||||||
|
qdrant = _get_qdrant_client(config)
|
||||||
|
owns_client = True
|
||||||
|
|
||||||
|
collection = config['vector_db']['collection']
|
||||||
tied_items = db.get_items_by_domain_status('tied_pass_1')
|
tied_items = db.get_items_by_domain_status('tied_pass_1')
|
||||||
|
|
||||||
stats = {'resolved': 0, 'manual': 0, 'skipped': 0, 'errors': 0, 'total': len(tied_items)}
|
stats = {'resolved': 0, 'manual': 0, 'skipped': 0, 'errors': 0, 'total': len(tied_items)}
|
||||||
|
|
@ -189,9 +271,9 @@ def run_tiebreaker_pass(db, config):
|
||||||
channel = item.get('category', '')
|
channel = item.get('category', '')
|
||||||
|
|
||||||
try:
|
try:
|
||||||
tied_domains = _get_tied_domains(concepts_dir, file_hash)
|
tied_domains = _get_tied_domains(qdrant, collection, file_hash)
|
||||||
if not tied_domains:
|
if not tied_domains:
|
||||||
db.set_domain_assignment(file_hash, None, 'needs_reprocess')
|
db.set_domain_assignment(file_hash, None, 'no_concepts')
|
||||||
stats['skipped'] += 1
|
stats['skipped'] += 1
|
||||||
continue
|
continue
|
||||||
|
|
||||||
|
|
@ -215,12 +297,9 @@ def run_tiebreaker_pass(db, config):
|
||||||
|
|
||||||
# Channel tiebreaker: count domains across all other videos in channel
|
# Channel tiebreaker: count domains across all other videos in channel
|
||||||
other_hashes = _channel_video_hashes(db, channel, exclude_hash=file_hash)
|
other_hashes = _channel_video_hashes(db, channel, exclude_hash=file_hash)
|
||||||
channel_domain_counts = Counter()
|
channel_domain_counts = _count_domains_from_qdrant_batch(
|
||||||
|
qdrant, collection, other_hashes
|
||||||
for other_hash in other_hashes:
|
)
|
||||||
other_counts = _count_concept_domains(concepts_dir, other_hash)
|
|
||||||
if other_counts:
|
|
||||||
channel_domain_counts.update(other_counts)
|
|
||||||
|
|
||||||
# Among tied domains only, pick highest channel-wide count
|
# Among tied domains only, pick highest channel-wide count
|
||||||
best_domain = None
|
best_domain = None
|
||||||
|
|
@ -231,48 +310,21 @@ def run_tiebreaker_pass(db, config):
|
||||||
best_count = c
|
best_count = c
|
||||||
best_domain = dom
|
best_domain = dom
|
||||||
|
|
||||||
# Pass 2: check if channel tiebreaker resolved it
|
# Check if channel tiebreaker resolved it
|
||||||
tied_at_channel = [d for d in tied_domains
|
tied_at_channel = [d for d in tied_domains
|
||||||
if channel_domain_counts.get(d, 0) == best_count]
|
if channel_domain_counts.get(d, 0) == best_count]
|
||||||
|
|
||||||
if len(tied_at_channel) == 1:
|
if len(tied_at_channel) == 1:
|
||||||
db.set_domain_assignment(file_hash, best_domain, 'tied_pass_2')
|
db.set_domain_assignment(file_hash, best_domain, 'tied_pass_2')
|
||||||
stats['resolved'] += 1
|
stats['resolved'] += 1
|
||||||
logger.debug(f" {file_hash[:12]}: resolved → {best_domain} (pass 2 channel tiebreaker)")
|
logger.debug(f" {file_hash[:12]}: resolved → {best_domain} (channel tiebreaker)")
|
||||||
continue
|
continue
|
||||||
|
|
||||||
# Pass 3: defensive re-run — re-count channel concepts to catch
|
# Still tied after channel scan — mark for manual review
|
||||||
# concept-file changes that occurred mid-run. Identical logic to
|
|
||||||
# pass 2; resolves races where files were written between the
|
|
||||||
# two reads.
|
|
||||||
channel_domain_counts_p3 = Counter()
|
|
||||||
for other_hash in other_hashes:
|
|
||||||
other_counts = _count_concept_domains(concepts_dir, other_hash)
|
|
||||||
if other_counts:
|
|
||||||
channel_domain_counts_p3.update(other_counts)
|
|
||||||
|
|
||||||
best_domain_p3 = None
|
|
||||||
best_count_p3 = -1
|
|
||||||
for dom in tied_domains:
|
|
||||||
c = channel_domain_counts_p3.get(dom, 0)
|
|
||||||
if c > best_count_p3:
|
|
||||||
best_count_p3 = c
|
|
||||||
best_domain_p3 = dom
|
|
||||||
|
|
||||||
tied_at_p3 = [d for d in tied_domains
|
|
||||||
if channel_domain_counts_p3.get(d, 0) == best_count_p3]
|
|
||||||
|
|
||||||
if len(tied_at_p3) == 1:
|
|
||||||
db.set_domain_assignment(file_hash, best_domain_p3, 'tied_pass_2')
|
|
||||||
stats['resolved'] += 1
|
|
||||||
logger.debug(f" {file_hash[:12]}: resolved → {best_domain_p3} (pass 3 defensive re-run)")
|
|
||||||
continue
|
|
||||||
|
|
||||||
# Still tied after pass 3 — mark for manual review
|
|
||||||
fallback = sorted(tied_domains)[0]
|
fallback = sorted(tied_domains)[0]
|
||||||
db.set_domain_assignment(file_hash, fallback, 'tied_manual')
|
db.set_domain_assignment(file_hash, fallback, 'tied_manual')
|
||||||
stats['manual'] += 1
|
stats['manual'] += 1
|
||||||
logger.debug(f" {file_hash[:12]}: still tied after pass 3, → tied_manual")
|
logger.debug(f" {file_hash[:12]}: still tied after channel scan, → tied_manual")
|
||||||
|
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
logger.warning(f" Tiebreaker error for {file_hash[:12]}: {e}")
|
logger.warning(f" Tiebreaker error for {file_hash[:12]}: {e}")
|
||||||
|
|
|
||||||
|
|
@ -411,7 +411,7 @@ def embed_single(file_hash, db, config):
|
||||||
from .domain_assigner import compute_assignment
|
from .domain_assigner import compute_assignment
|
||||||
from .peertube_writer import push_category, extract_uuid
|
from .peertube_writer import push_category, extract_uuid
|
||||||
from .recon_domains import DOMAIN_CATEGORY_MAP
|
from .recon_domains import DOMAIN_CATEGORY_MAP
|
||||||
domain, status = compute_assignment(file_hash, db, config)
|
domain, status = compute_assignment(file_hash, db, config, qdrant=qdrant)
|
||||||
db.set_domain_assignment(file_hash, domain, status)
|
db.set_domain_assignment(file_hash, domain, status)
|
||||||
if domain and status == 'assigned':
|
if domain and status == 'assigned':
|
||||||
cat_id = DOMAIN_CATEGORY_MAP[domain]
|
cat_id = DOMAIN_CATEGORY_MAP[domain]
|
||||||
|
|
|
||||||
30
recon.py
30
recon.py
|
|
@ -865,6 +865,7 @@ def cmd_ingest(args):
|
||||||
|
|
||||||
def cmd_assign_categories(args):
|
def cmd_assign_categories(args):
|
||||||
"""Assign RECON domains to PeerTube videos and push categories."""
|
"""Assign RECON domains to PeerTube videos and push categories."""
|
||||||
|
from qdrant_client import QdrantClient
|
||||||
from lib.domain_assigner import compute_assignment, run_tiebreaker_pass
|
from lib.domain_assigner import compute_assignment, run_tiebreaker_pass
|
||||||
from lib.peertube_writer import push_pending, extract_uuid
|
from lib.peertube_writer import push_pending, extract_uuid
|
||||||
from lib.recon_domains import DOMAIN_CATEGORY_MAP
|
from lib.recon_domains import DOMAIN_CATEGORY_MAP
|
||||||
|
|
@ -876,11 +877,13 @@ def cmd_assign_categories(args):
|
||||||
|
|
||||||
if args.backfill:
|
if args.backfill:
|
||||||
# Pass 1: assign domains to all complete stream docs with no assignment
|
# Pass 1: assign domains to all complete stream docs with no assignment
|
||||||
|
# or that previously got needs_reprocess
|
||||||
conn = db._get_conn()
|
conn = db._get_conn()
|
||||||
q = """SELECT d.hash FROM documents d
|
q = """SELECT d.hash FROM documents d
|
||||||
LEFT JOIN catalogue c ON d.hash = c.hash
|
LEFT JOIN catalogue c ON d.hash = c.hash
|
||||||
WHERE d.status = 'complete'
|
WHERE d.status = 'complete'
|
||||||
AND d.recon_domain IS NULL
|
AND (d.recon_domain IS NULL
|
||||||
|
OR d.recon_domain_status = 'needs_reprocess')
|
||||||
AND c.source = 'stream.echo6.co'
|
AND c.source = 'stream.echo6.co'
|
||||||
ORDER BY d.discovered_at"""
|
ORDER BY d.discovered_at"""
|
||||||
if limit:
|
if limit:
|
||||||
|
|
@ -895,10 +898,17 @@ def cmd_assign_categories(args):
|
||||||
print(f"Backfill: processing {len(hashes)} documents" +
|
print(f"Backfill: processing {len(hashes)} documents" +
|
||||||
(" [DRY RUN]" if dry_run else ""))
|
(" [DRY RUN]" if dry_run else ""))
|
||||||
|
|
||||||
stats = {'assigned': 0, 'tied_pass_1': 0, 'needs_reprocess': 0, 'errors': 0}
|
# Create one Qdrant client for the entire backfill
|
||||||
|
qdrant = QdrantClient(
|
||||||
|
host=config['vector_db']['host'],
|
||||||
|
port=config['vector_db']['port'],
|
||||||
|
timeout=60
|
||||||
|
)
|
||||||
|
|
||||||
|
stats = {'assigned': 0, 'tied_pass_1': 0, 'no_concepts': 0, 'needs_reprocess': 0, 'errors': 0}
|
||||||
for i, file_hash in enumerate(hashes):
|
for i, file_hash in enumerate(hashes):
|
||||||
try:
|
try:
|
||||||
domain, status = compute_assignment(file_hash, db, config)
|
domain, status = compute_assignment(file_hash, db, config, qdrant=qdrant)
|
||||||
stats[status] = stats.get(status, 0) + 1
|
stats[status] = stats.get(status, 0) + 1
|
||||||
if not dry_run:
|
if not dry_run:
|
||||||
db.set_domain_assignment(file_hash, domain, status)
|
db.set_domain_assignment(file_hash, domain, status)
|
||||||
|
|
@ -946,22 +956,10 @@ def cmd_assign_categories(args):
|
||||||
for item in items:
|
for item in items:
|
||||||
file_hash = item['hash']
|
file_hash = item['hash']
|
||||||
if dry_run:
|
if dry_run:
|
||||||
concepts_dir = os.path.join(config['paths']['concepts'], file_hash)
|
print(f" Would reprocess: {file_hash[:12]} — {item.get('filename', '?')}")
|
||||||
has_concepts = os.path.isdir(concepts_dir)
|
|
||||||
concept_count = len(os.listdir(concepts_dir)) if has_concepts else 0
|
|
||||||
detail = f"DELETE {concept_count} concept files" if has_concepts else "no concept dir"
|
|
||||||
print(f" Would reprocess: {file_hash[:12]} — {item.get('filename', '?')} ({detail})")
|
|
||||||
requeued += 1
|
requeued += 1
|
||||||
continue
|
continue
|
||||||
|
|
||||||
# Remove stale concept files
|
|
||||||
import shutil
|
|
||||||
concepts_dir = os.path.join(config['paths']['concepts'], file_hash)
|
|
||||||
if os.path.isdir(concepts_dir):
|
|
||||||
logger.info(f" Deleting concept dir: {concepts_dir} "
|
|
||||||
f"({len(os.listdir(concepts_dir))} files, hash={file_hash})")
|
|
||||||
shutil.rmtree(concepts_dir)
|
|
||||||
|
|
||||||
# Reset document status to allow re-processing
|
# Reset document status to allow re-processing
|
||||||
conn = db._get_conn()
|
conn = db._get_conn()
|
||||||
conn.execute(
|
conn.execute(
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue