Switch domain assignment to Qdrant as source of truth

Replace on-disk concept file reads with Qdrant payload queries for
domain assignment. This unlocks assignment for ~10,120 items that had
missing or legacy-only concept files on disk while Qdrant held the
correct 18-domain taxonomy data.

Changes:
- domain_assigner.py: Replace _count_concept_domains (disk) with
  _count_domains_from_qdrant and _count_domains_from_qdrant_batch
  (Qdrant scroll queries). Add _get_qdrant_client helper. Remove
  pass 3 defensive re-run (Qdrant reads are consistent). Add
  no_concepts terminal status for zero-vector documents.
- embedder.py: Post-embed hook passes existing qdrant client to
  compute_assignment, avoiding a second connection.
- recon.py: Backfill creates one QdrantClient for the batch. SQL
  filter includes existing needs_reprocess items. Dry-run reports
  no_concepts as separate bucket. --reprocess-missing removes
  concept-dir deletion step (no longer reads from disk).
- docs/domain-assignment.md: Algorithm references Qdrant, documents
  no_concepts status, removes pass 3 description.

Dry-run results: 20,453 assigned, 1,392 tied, 298 no_concepts,
0 needs_reprocess, 0 errors (previously 10,416 needs_reprocess).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Matt 2026-04-28 03:59:06 +00:00
commit 3b37d96c4d
4 changed files with 186 additions and 135 deletions

View file

@ -2,7 +2,13 @@
## Overview ## Overview
RECON's domain assignment feature maps each PeerTube video to one of 18 knowledge domains by analyzing the concepts extracted from its transcript. Assignments are pushed to PeerTube as category metadata via a custom plugin. RECON's domain assignment feature maps each PeerTube video to one of 18 knowledge domains by analyzing the concept vectors stored in Qdrant. Assignments are pushed to PeerTube as category metadata via a custom plugin.
## Data Source
Domain counts are read from the `domain` payload field on concept vectors in Qdrant (`recon_knowledge_hybrid` collection on cortex:6333). Each concept vector has a `domain` string in its payload, set during enrichment and validated at embed time. This provides 100% coverage for all embedded documents with zero legacy domain residue.
Previously, domain counts were read from on-disk concept JSON files (`data/concepts/{hash}/window_*.json`). This was replaced with Qdrant queries on 2026-04-28 because ~10,000 items had missing or legacy-only concept files on disk while Qdrant had the correct data.
## Algorithm ## Algorithm
@ -10,9 +16,9 @@ RECON's domain assignment feature maps each PeerTube video to one of 18 knowledg
Runs automatically via post-embed hook when a video completes the pipeline, or in bulk via `--backfill`. Runs automatically via post-embed hook when a video completes the pipeline, or in bulk via `--backfill`.
1. Read all `data/concepts/{hash}/window_*.json` files 1. Query Qdrant for all points with `doc_hash` matching the document
2. Count domain occurrences across all concepts, filtering to `VALID_DOMAINS` only (skips legacy domains) 2. Count `domain` payload occurrences, filtering to `VALID_DOMAINS` only
3. If no valid concepts → `needs_reprocess` 3. If zero concept vectors → `no_concepts` (terminal)
4. If single top domain → `assigned` 4. If single top domain → `assigned`
5. If tied → `tied_pass_1` (deferred to tiebreaker) 5. If tied → `tied_pass_1` (deferred to tiebreaker)
@ -22,20 +28,13 @@ Runs via `assign-categories --tiebreaker-pass`.
For each `tied_pass_1` document: For each `tied_pass_1` document:
1. Identify the tied domains 1. Identify the tied domains from Qdrant
2. Look up the document's channel (`catalogue.category`) 2. Look up the document's channel (`catalogue.category`)
3. **Mega-channel rule:** If channel has >500 videos, skip tiebreaking → `tied_manual` 3. **Mega-channel rule:** If channel has >500 videos, skip tiebreaking → `tied_manual`
4. Read concept files for all other videos in the same channel 4. Query Qdrant for domain counts across all other videos in the same channel (single batch query with `MatchAny` filter)
5. Among the tied domains only, pick the one with the highest channel-wide concept count 5. Among the tied domains only, pick the one with the highest channel-wide concept count
6. If resolved → `tied_pass_2` 6. If resolved → `tied_pass_2`
7. If still tied → proceed to pass 3 7. If still tied → `tied_manual` (alphabetical fallback assigned, flagged for review)
### Pass 3: Defensive Re-Run
If pass 2 does not resolve the tie, re-read the same channel concept files and re-run identical counting logic. This catches concept-file changes that occurred mid-run (e.g. concurrent enrichment writing new windows during the batch). In steady state, pass 3 produces the same result as pass 2, but under concurrent writes it can resolve a tie that pass 2 missed.
- If resolved → `tied_pass_2` (same status — the column tracks "channel scan resolved it")
- If still tied → `tied_manual` (alphabetical fallback assigned, flagged for review)
### Mega-Channel Rule ### Mega-Channel Rule
@ -43,14 +42,15 @@ Channels with >500 videos (like the "Transcript" catch-all with ~9,200 videos) a
## Status Values ## Status Values
| Status | Meaning | Next Action | | Status | Meaning | Terminal? | Next Action |
|--------|---------|-------------| |--------|---------|-----------|-------------|
| `assigned` | Clear winner from pass 1 | Push to PeerTube | | `assigned` | Clear winner from pass 1 | No | Push to PeerTube |
| `tied_pass_1` | Concept tie, awaiting tiebreaker | Run `--tiebreaker-pass` | | `tied_pass_1` | Concept tie, awaiting tiebreaker | No | Run `--tiebreaker-pass` |
| `tied_pass_2` | Resolved by channel tiebreaker | Push to PeerTube | | `tied_pass_2` | Resolved by channel tiebreaker | No | Push to PeerTube |
| `tied_manual` | Needs human review | Review at `/peertube/review` | | `tied_manual` | Needs human review | No | Review at `/peertube/review` |
| `needs_reprocess` | Missing concepts or only legacy domains | Run `--reprocess-missing` | | `no_concepts` | Zero concept vectors in Qdrant | **Yes** | None — typically non-topical content (vlogs, giveaways, announcements) |
| `manual_assigned` | Human override from dashboard | Already pushed | | `needs_reprocess` | Transient failure (Qdrant error) | No | Run `--reprocess-missing` |
| `manual_assigned` | Human override from dashboard | No | Already pushed |
**"Categorized" filter** = `{'assigned', 'tied_pass_2', 'manual_assigned'}` **"Categorized" filter** = `{'assigned', 'tied_pass_2', 'manual_assigned'}`
@ -72,7 +72,7 @@ python3 recon.py assign-categories --tiebreaker-pass
# Push all assigned-but-unpushed categories to PeerTube API # Push all assigned-but-unpushed categories to PeerTube API
python3 recon.py assign-categories --push-pending python3 recon.py assign-categories --push-pending
# Re-queue items with missing/legacy concepts # Re-queue items with transient failures for full re-processing
python3 recon.py assign-categories --reprocess-missing python3 recon.py assign-categories --reprocess-missing
# Limit processing count # Limit processing count
@ -87,25 +87,26 @@ The review UI at `recon.echo6.co/peertube/review` shows only `tied_manual` items
- Dropdown to select the correct domain - Dropdown to select the correct domain
- Assign button (pushes to PeerTube immediately) - Assign button (pushes to PeerTube immediately)
Items with `needs_reprocess` status do NOT appear in the review UI — they are handled exclusively via the CLI `--reprocess-missing` command. Items with `no_concepts` or `needs_reprocess` status do NOT appear in the review UI.
## Pipeline Integration ## Pipeline Integration
New videos ingested via the PeerTube collector are automatically assigned a domain when they complete the embed stage. The post-embed hook in `embedder.py`: New videos ingested via the PeerTube collector are automatically assigned a domain when they complete the embed stage. The post-embed hook in `embedder.py`:
1. Runs `compute_assignment()` (pass 1 only) 1. Runs `compute_assignment()` (pass 1 only), reusing the embedder's existing Qdrant client
2. If clear winner: pushes category to PeerTube immediately 2. If clear winner: pushes category to PeerTube immediately
3. If tied: marks as `tied_pass_1` for the next tiebreaker batch run 3. If tied: marks as `tied_pass_1` for the next tiebreaker batch run
4. On error: logs warning and continues — does not block the pipeline 4. If no concepts: marks as `no_concepts` (terminal)
5. On Qdrant error: logs warning and continues — does not block the pipeline
## Source Files ## Source Files
| File | Purpose | | File | Purpose |
|------|---------| |------|---------|
| `lib/recon_domains.py` | Domain↔Category ID mapping, VALID_DOMAINS | | `lib/recon_domains.py` | Domain↔Category ID mapping, VALID_DOMAINS |
| `lib/domain_assigner.py` | `compute_assignment()` + `run_tiebreaker_pass()` | | `lib/domain_assigner.py` | `compute_assignment()` + `run_tiebreaker_pass()` + Qdrant helpers |
| `lib/peertube_writer.py` | OAuth2 client, `push_category()`, `push_pending()` | | `lib/peertube_writer.py` | OAuth2 client, `push_category()`, `push_pending()` |
| `lib/embedder.py` | Post-embed hook | | `lib/embedder.py` | Post-embed hook (passes qdrant client) |
| `lib/status.py` | DB columns + helper methods | | `lib/status.py` | DB columns + helper methods |
| `lib/api.py` | Dashboard review routes | | `lib/api.py` | Dashboard review routes |
| `recon.py` | CLI `assign-categories` command | | `recon.py` | CLI `assign-categories` command |

View file

@ -1,24 +1,30 @@
""" """
RECON Domain Assigner RECON Domain Assigner
Computes per-video domain assignments from concept extraction results. Computes per-video domain assignments from Qdrant vector payloads.
Two functions, two execution modes: Two functions, two execution modes:
compute_assignment() pass 1, inline from post-embed hook compute_assignment() pass 1, inline from post-embed hook
run_tiebreaker_pass() batch, resolves ties via channel concept scan run_tiebreaker_pass() batch, resolves ties via channel concept scan
Data source: Qdrant `domain` payload field on concept vectors.
Previously read on-disk concept JSON files; migrated to Qdrant as
single source of truth (2026-04-28).
Status values written to documents.recon_domain_status: Status values written to documents.recon_domain_status:
assigned clear winner from pass 1 concept count assigned clear winner from pass 1 concept count
tied_pass_1 concept tie, awaiting channel tiebreaker tied_pass_1 concept tie, awaiting channel tiebreaker
tied_pass_2 resolved by channel tiebreaker tied_pass_2 resolved by channel tiebreaker
tied_manual needs human review (dashboard) tied_manual needs human review (dashboard)
needs_reprocess missing concepts or only legacy domains no_concepts terminal, zero concept vectors in Qdrant
needs_reprocess transient failure (Qdrant error, etc.)
manual_assigned human override from dashboard manual_assigned human override from dashboard
""" """
import json
import os
from collections import Counter from collections import Counter
from qdrant_client import QdrantClient
from qdrant_client.models import Filter, FieldCondition, MatchValue, MatchAny
from .recon_domains import VALID_DOMAINS, DOMAIN_CATEGORY_MAP from .recon_domains import VALID_DOMAINS, DOMAIN_CATEGORY_MAP
from .utils import setup_logging from .utils import setup_logging
@ -28,40 +34,51 @@ logger = setup_logging('recon.domain_assigner')
MEGA_CHANNEL_THRESHOLD = 500 MEGA_CHANNEL_THRESHOLD = 500
def _count_concept_domains(concepts_dir, file_hash): def _get_qdrant_client(config):
"""Read concept files and count valid domain occurrences. """Create a QdrantClient from RECON config.
Callers should create one client and pass it through rather than
calling this repeatedly.
"""
logger.debug("Creating new QdrantClient (caller did not pass one)")
return QdrantClient(
host=config['vector_db']['host'],
port=config['vector_db']['port'],
timeout=60
)
def _count_domains_from_qdrant(qdrant, collection, doc_hash):
"""Count valid domain occurrences for a single document from Qdrant.
Scrolls all points matching doc_hash and counts domain values.
Args: Args:
concepts_dir: Base concepts directory (e.g. /opt/recon/data/concepts) qdrant: QdrantClient instance
file_hash: Document hash collection: Qdrant collection name
doc_hash: Document hash to query
Returns: Returns:
Counter of {domain_name: count} for valid domains only, Counter of {domain_name: count} for valid domains.
or None if no concept directory exists. Empty Counter if no points found (never None).
""" """
doc_concepts_dir = os.path.join(concepts_dir, file_hash)
if not os.path.isdir(doc_concepts_dir):
return None
domain_counter = Counter() domain_counter = Counter()
offset = None
for fname in os.listdir(doc_concepts_dir): while True:
if not fname.startswith('window_') or not fname.endswith('.json'): results, next_offset = qdrant.scroll(
continue collection_name=collection,
fpath = os.path.join(doc_concepts_dir, fname) scroll_filter=Filter(must=[
try: FieldCondition(key="doc_hash", match=MatchValue(value=doc_hash))
with open(fpath, 'r') as f: ]),
concepts = json.load(f) with_payload=["domain"],
except (json.JSONDecodeError, OSError): with_vectors=False,
continue limit=200,
offset=offset,
)
if not isinstance(concepts, list): for point in results:
continue dom = point.payload.get('domain')
for concept in concepts:
if not isinstance(concept, dict):
continue
dom = concept.get('domain')
if isinstance(dom, str) and dom in VALID_DOMAINS: if isinstance(dom, str) and dom in VALID_DOMAINS:
domain_counter[dom] += 1 domain_counter[dom] += 1
elif isinstance(dom, list): elif isinstance(dom, list):
@ -69,30 +86,95 @@ def _count_concept_domains(concepts_dir, file_hash):
if isinstance(d, str) and d in VALID_DOMAINS: if isinstance(d, str) and d in VALID_DOMAINS:
domain_counter[d] += 1 domain_counter[d] += 1
if next_offset is None:
break
offset = next_offset
return domain_counter return domain_counter
def compute_assignment(file_hash, db, config): def _count_domains_from_qdrant_batch(qdrant, collection, doc_hashes):
"""Count valid domain occurrences across multiple documents from Qdrant.
Single scroll with MatchAny filter, with offset pagination for large
result sets.
Args:
qdrant: QdrantClient instance
collection: Qdrant collection name
doc_hashes: List of document hashes to query
Returns:
Counter of {domain_name: count} aggregated across all matching points.
"""
if not doc_hashes:
return Counter()
domain_counter = Counter()
offset = None
while True:
results, next_offset = qdrant.scroll(
collection_name=collection,
scroll_filter=Filter(must=[
FieldCondition(key="doc_hash", match=MatchAny(any=doc_hashes))
]),
with_payload=["domain"],
with_vectors=False,
limit=10000,
offset=offset,
)
for point in results:
dom = point.payload.get('domain')
if isinstance(dom, str) and dom in VALID_DOMAINS:
domain_counter[dom] += 1
elif isinstance(dom, list):
for d in dom:
if isinstance(d, str) and d in VALID_DOMAINS:
domain_counter[d] += 1
if next_offset is None:
break
offset = next_offset
return domain_counter
def compute_assignment(file_hash, db, config, qdrant=None):
"""Compute domain assignment for a single document (pass 1). """Compute domain assignment for a single document (pass 1).
Counts domain occurrences across all concepts. If a single domain Counts domain occurrences across all concept vectors in Qdrant.
wins, assigns it. If tied, defers to batch tiebreaker. If a single domain wins, assigns it. If tied, defers to batch
tiebreaker.
Args: Args:
file_hash: Document hash file_hash: Document hash
db: StatusDB instance db: StatusDB instance
config: RECON config dict config: RECON config dict
qdrant: Optional QdrantClient (created if not provided)
Returns: Returns:
(domain, status) tuple where domain is a string or None, (domain, status) tuple where domain is a string or None,
and status is one of: 'assigned', 'tied_pass_1', 'needs_reprocess' and status is one of: 'assigned', 'tied_pass_1', 'no_concepts',
'needs_reprocess'
""" """
concepts_dir = config['paths']['concepts'] owns_client = False
domain_counter = _count_concept_domains(concepts_dir, file_hash) if qdrant is None:
qdrant = _get_qdrant_client(config)
owns_client = True
if domain_counter is None or len(domain_counter) == 0: collection = config['vector_db']['collection']
try:
domain_counter = _count_domains_from_qdrant(qdrant, collection, file_hash)
except Exception as e:
logger.warning(f"Qdrant query failed for {file_hash[:12]}: {e}")
return (None, 'needs_reprocess') return (None, 'needs_reprocess')
if len(domain_counter) == 0:
return (None, 'no_concepts')
top = domain_counter.most_common(2) top = domain_counter.most_common(2)
top_domain = top[0][0] top_domain = top[0][0]
top_count = top[0][1] top_count = top[0][1]
@ -104,9 +186,9 @@ def compute_assignment(file_hash, db, config):
return (None, 'tied_pass_1') return (None, 'tied_pass_1')
def _get_tied_domains(concepts_dir, file_hash): def _get_tied_domains(qdrant, collection, file_hash):
"""Get the set of domains tied for first place in a document's concepts.""" """Get the set of domains tied for first place in a document's concepts."""
domain_counter = _count_concept_domains(concepts_dir, file_hash) domain_counter = _count_domains_from_qdrant(qdrant, collection, file_hash)
if not domain_counter: if not domain_counter:
return [] return []
@ -150,32 +232,32 @@ def _channel_video_count(db, channel_name):
return row['cnt'] if row else 0 return row['cnt'] if row else 0
def run_tiebreaker_pass(db, config): def run_tiebreaker_pass(db, config, qdrant=None):
"""Resolve tied domain assignments using channel-level concept analysis. """Resolve tied domain assignments using channel-level Qdrant analysis.
Processes all documents where recon_domain_status = 'tied_pass_1'. Processes all documents where recon_domain_status = 'tied_pass_1'.
Pass 2: For each tied document, reads concept files from all other For each tied document, queries Qdrant for domain counts from all
videos in the same channel and picks the tied domain with the highest other videos in the same channel and picks the tied domain with the
channel-wide count. highest channel-wide count.
Pass 3 (defensive re-run): Re-reads the same channel concept files a Mega-channels (>500 videos) skip tiebreaking and go straight to
second time with identical logic. This catches concept-file changes
that occurred mid-run (e.g. concurrent enrichment writing new windows).
In steady state pass 3 produces the same result as pass 2, but under
concurrent writes it can resolve a tie that pass 2 missed.
Mega-channels (>500 videos) skip both passes and go straight to
'tied_manual' for dashboard review. 'tied_manual' for dashboard review.
Args: Args:
db: StatusDB instance db: StatusDB instance
config: RECON config dict config: RECON config dict
qdrant: Optional QdrantClient (created if not provided)
Returns: Returns:
Dict with counts: resolved, manual, skipped, errors Dict with counts: resolved, manual, skipped, errors
""" """
concepts_dir = config['paths']['concepts'] owns_client = False
if qdrant is None:
qdrant = _get_qdrant_client(config)
owns_client = True
collection = config['vector_db']['collection']
tied_items = db.get_items_by_domain_status('tied_pass_1') tied_items = db.get_items_by_domain_status('tied_pass_1')
stats = {'resolved': 0, 'manual': 0, 'skipped': 0, 'errors': 0, 'total': len(tied_items)} stats = {'resolved': 0, 'manual': 0, 'skipped': 0, 'errors': 0, 'total': len(tied_items)}
@ -189,9 +271,9 @@ def run_tiebreaker_pass(db, config):
channel = item.get('category', '') channel = item.get('category', '')
try: try:
tied_domains = _get_tied_domains(concepts_dir, file_hash) tied_domains = _get_tied_domains(qdrant, collection, file_hash)
if not tied_domains: if not tied_domains:
db.set_domain_assignment(file_hash, None, 'needs_reprocess') db.set_domain_assignment(file_hash, None, 'no_concepts')
stats['skipped'] += 1 stats['skipped'] += 1
continue continue
@ -215,12 +297,9 @@ def run_tiebreaker_pass(db, config):
# Channel tiebreaker: count domains across all other videos in channel # Channel tiebreaker: count domains across all other videos in channel
other_hashes = _channel_video_hashes(db, channel, exclude_hash=file_hash) other_hashes = _channel_video_hashes(db, channel, exclude_hash=file_hash)
channel_domain_counts = Counter() channel_domain_counts = _count_domains_from_qdrant_batch(
qdrant, collection, other_hashes
for other_hash in other_hashes: )
other_counts = _count_concept_domains(concepts_dir, other_hash)
if other_counts:
channel_domain_counts.update(other_counts)
# Among tied domains only, pick highest channel-wide count # Among tied domains only, pick highest channel-wide count
best_domain = None best_domain = None
@ -231,48 +310,21 @@ def run_tiebreaker_pass(db, config):
best_count = c best_count = c
best_domain = dom best_domain = dom
# Pass 2: check if channel tiebreaker resolved it # Check if channel tiebreaker resolved it
tied_at_channel = [d for d in tied_domains tied_at_channel = [d for d in tied_domains
if channel_domain_counts.get(d, 0) == best_count] if channel_domain_counts.get(d, 0) == best_count]
if len(tied_at_channel) == 1: if len(tied_at_channel) == 1:
db.set_domain_assignment(file_hash, best_domain, 'tied_pass_2') db.set_domain_assignment(file_hash, best_domain, 'tied_pass_2')
stats['resolved'] += 1 stats['resolved'] += 1
logger.debug(f" {file_hash[:12]}: resolved → {best_domain} (pass 2 channel tiebreaker)") logger.debug(f" {file_hash[:12]}: resolved → {best_domain} (channel tiebreaker)")
continue continue
# Pass 3: defensive re-run — re-count channel concepts to catch # Still tied after channel scan — mark for manual review
# concept-file changes that occurred mid-run. Identical logic to
# pass 2; resolves races where files were written between the
# two reads.
channel_domain_counts_p3 = Counter()
for other_hash in other_hashes:
other_counts = _count_concept_domains(concepts_dir, other_hash)
if other_counts:
channel_domain_counts_p3.update(other_counts)
best_domain_p3 = None
best_count_p3 = -1
for dom in tied_domains:
c = channel_domain_counts_p3.get(dom, 0)
if c > best_count_p3:
best_count_p3 = c
best_domain_p3 = dom
tied_at_p3 = [d for d in tied_domains
if channel_domain_counts_p3.get(d, 0) == best_count_p3]
if len(tied_at_p3) == 1:
db.set_domain_assignment(file_hash, best_domain_p3, 'tied_pass_2')
stats['resolved'] += 1
logger.debug(f" {file_hash[:12]}: resolved → {best_domain_p3} (pass 3 defensive re-run)")
continue
# Still tied after pass 3 — mark for manual review
fallback = sorted(tied_domains)[0] fallback = sorted(tied_domains)[0]
db.set_domain_assignment(file_hash, fallback, 'tied_manual') db.set_domain_assignment(file_hash, fallback, 'tied_manual')
stats['manual'] += 1 stats['manual'] += 1
logger.debug(f" {file_hash[:12]}: still tied after pass 3, → tied_manual") logger.debug(f" {file_hash[:12]}: still tied after channel scan, → tied_manual")
except Exception as e: except Exception as e:
logger.warning(f" Tiebreaker error for {file_hash[:12]}: {e}") logger.warning(f" Tiebreaker error for {file_hash[:12]}: {e}")

View file

@ -411,7 +411,7 @@ def embed_single(file_hash, db, config):
from .domain_assigner import compute_assignment from .domain_assigner import compute_assignment
from .peertube_writer import push_category, extract_uuid from .peertube_writer import push_category, extract_uuid
from .recon_domains import DOMAIN_CATEGORY_MAP from .recon_domains import DOMAIN_CATEGORY_MAP
domain, status = compute_assignment(file_hash, db, config) domain, status = compute_assignment(file_hash, db, config, qdrant=qdrant)
db.set_domain_assignment(file_hash, domain, status) db.set_domain_assignment(file_hash, domain, status)
if domain and status == 'assigned': if domain and status == 'assigned':
cat_id = DOMAIN_CATEGORY_MAP[domain] cat_id = DOMAIN_CATEGORY_MAP[domain]

View file

@ -865,6 +865,7 @@ def cmd_ingest(args):
def cmd_assign_categories(args): def cmd_assign_categories(args):
"""Assign RECON domains to PeerTube videos and push categories.""" """Assign RECON domains to PeerTube videos and push categories."""
from qdrant_client import QdrantClient
from lib.domain_assigner import compute_assignment, run_tiebreaker_pass from lib.domain_assigner import compute_assignment, run_tiebreaker_pass
from lib.peertube_writer import push_pending, extract_uuid from lib.peertube_writer import push_pending, extract_uuid
from lib.recon_domains import DOMAIN_CATEGORY_MAP from lib.recon_domains import DOMAIN_CATEGORY_MAP
@ -876,11 +877,13 @@ def cmd_assign_categories(args):
if args.backfill: if args.backfill:
# Pass 1: assign domains to all complete stream docs with no assignment # Pass 1: assign domains to all complete stream docs with no assignment
# or that previously got needs_reprocess
conn = db._get_conn() conn = db._get_conn()
q = """SELECT d.hash FROM documents d q = """SELECT d.hash FROM documents d
LEFT JOIN catalogue c ON d.hash = c.hash LEFT JOIN catalogue c ON d.hash = c.hash
WHERE d.status = 'complete' WHERE d.status = 'complete'
AND d.recon_domain IS NULL AND (d.recon_domain IS NULL
OR d.recon_domain_status = 'needs_reprocess')
AND c.source = 'stream.echo6.co' AND c.source = 'stream.echo6.co'
ORDER BY d.discovered_at""" ORDER BY d.discovered_at"""
if limit: if limit:
@ -895,10 +898,17 @@ def cmd_assign_categories(args):
print(f"Backfill: processing {len(hashes)} documents" + print(f"Backfill: processing {len(hashes)} documents" +
(" [DRY RUN]" if dry_run else "")) (" [DRY RUN]" if dry_run else ""))
stats = {'assigned': 0, 'tied_pass_1': 0, 'needs_reprocess': 0, 'errors': 0} # Create one Qdrant client for the entire backfill
qdrant = QdrantClient(
host=config['vector_db']['host'],
port=config['vector_db']['port'],
timeout=60
)
stats = {'assigned': 0, 'tied_pass_1': 0, 'no_concepts': 0, 'needs_reprocess': 0, 'errors': 0}
for i, file_hash in enumerate(hashes): for i, file_hash in enumerate(hashes):
try: try:
domain, status = compute_assignment(file_hash, db, config) domain, status = compute_assignment(file_hash, db, config, qdrant=qdrant)
stats[status] = stats.get(status, 0) + 1 stats[status] = stats.get(status, 0) + 1
if not dry_run: if not dry_run:
db.set_domain_assignment(file_hash, domain, status) db.set_domain_assignment(file_hash, domain, status)
@ -946,22 +956,10 @@ def cmd_assign_categories(args):
for item in items: for item in items:
file_hash = item['hash'] file_hash = item['hash']
if dry_run: if dry_run:
concepts_dir = os.path.join(config['paths']['concepts'], file_hash) print(f" Would reprocess: {file_hash[:12]}{item.get('filename', '?')}")
has_concepts = os.path.isdir(concepts_dir)
concept_count = len(os.listdir(concepts_dir)) if has_concepts else 0
detail = f"DELETE {concept_count} concept files" if has_concepts else "no concept dir"
print(f" Would reprocess: {file_hash[:12]}{item.get('filename', '?')} ({detail})")
requeued += 1 requeued += 1
continue continue
# Remove stale concept files
import shutil
concepts_dir = os.path.join(config['paths']['concepts'], file_hash)
if os.path.isdir(concepts_dir):
logger.info(f" Deleting concept dir: {concepts_dir} "
f"({len(os.listdir(concepts_dir))} files, hash={file_hash})")
shutil.rmtree(concepts_dir)
# Reset document status to allow re-processing # Reset document status to allow re-processing
conn = db._get_conn() conn = db._get_conn()
conn.execute( conn.execute(