5 KiB
Phase 0: Baseline Capture
Captured: 2026-04-14T06:11Z UTC
Backups
| Item | Location | MD5 Hash |
|---|---|---|
| recon.db (SQLite) | CT 130: /tmp/recon.db.phase0.20260414.bak |
69d94a2c21686871c8c6863903710e3f |
| recon.db (replica) | cortex: /tmp/recon.db.phase0.20260414.bak |
69d94a2c21686871c8c6863903710e3f |
| config.yaml | CT 130: /tmp/config.yaml.phase0.20260414.bak |
6d70ed572dfb2e704abca3850ae33797 |
DB backup verified: opens cleanly, contains 6 tables (catalogue, documents, duplicate_review, file_operations, intel, metrics_snapshots).
MD5 hashes match between CT 130 and cortex replicas.
Service State
recon.service
Status: inactive (dead) since Mon 2026-04-13 22:40:35 UTC
Enabled: enabled (will auto-start on boot)
Duration: 5h 31min 19.690s (last run)
Exit: code=exited, status=0/SUCCESS
recon-watchdog.service
Status: active (running) since Mon 2026-04-13 17:09:08 UTC
Enabled: enabled
Memory: 8.2M (peak 48.7M)
PID: 343
Command: /opt/recon/venv/bin/python3 /opt/recon/recon.py pipeline watch
Qdrant Metrics
Collection: recon_knowledge_hybrid
| Metric | Value |
|---|---|
| Status | green |
| Optimizer status | ok |
| Points count | 2,320,695 |
| Indexed vectors count | 4,641,386 (2x points — dense + sparse) |
| Segments | 8 |
| Vector size | 1024 (Cosine) |
| Sparse vectors | bge-m3-sparse (IDF modifier) |
| Distinct doc_hash values | 29,519 |
| Sum of point counts across all hashes | 2,320,695 (consistent) |
SQLite Metrics
Database: /opt/recon/data/recon.db
Row Counts
| Query | Count |
|---|---|
SELECT COUNT(*) FROM catalogue |
29,812 |
SELECT COUNT(*) FROM documents |
29,812 |
SELECT COUNT(*) FROM documents WHERE status = 'complete' |
29,809 |
SELECT COUNT(*) FROM documents WHERE status = 'complete' AND organized_at IS NULL |
19,148 |
SELECT COUNT(*) FROM catalogue WHERE source = 'stream.echo6.co' |
19,133 |
SELECT COUNT(*) FROM catalogue WHERE path LIKE '/mnt/library/%' AND path LIKE '%.pdf' |
10,679 |
Non-Complete Document Status Breakdown
| Status | Count |
|---|---|
| skipped | 3 |
All documents are either complete (29,809) or skipped (3). Total: 29,812.
Filesystem Metrics
/mnt/library (data node, NFS)
| Metric | Value |
|---|---|
| Total PDFs in library | 15,446 |
| Channel directories (streamecho6, depth ≤2) | 18,987 |
| Transcript directories (streamecho6, depth 2) | 18,855 |
Underscore-prefixed staging directories
| Directory | Size |
|---|---|
/mnt/library/_sources |
439M |
/mnt/library/_acquired |
4.0K |
/mnt/library/_unclassified |
6.3G |
/mnt/library/_ingest |
2.4G |
/opt/recon/data (CT 130 local)
| Path | Size |
|---|---|
data/text/ |
4.5G |
data/concepts/ |
4.0G |
data/ (total) |
8.7G |
| Text subdirectories | 10,955 (count includes parent, so 10,955 hash dirs) |
Anomalies Noted
-
recon-watchdog.service is still running. Expected inactive/dead per instructions, but it is active (running) with PID 343. It runs
recon.py pipeline watchwhich polls for pipeline work. This is a read-only watchdog, but it could trigger pipeline operations if it finds work. Consider stopping it before beginning refactor work. -
Both services are
enabled. They will auto-start on next reboot of CT 130. If the refactor requires the pipeline to stay down, these should be disabled before any reboot. -
catalogue and documents counts match (29,812 each) — this is expected; they should be 1:1.
-
Qdrant distinct doc_hash (29,519) < documents complete (29,809). Delta of 290 documents that are marked
completein SQLite but have no corresponding vectors in Qdrant. Possible explanations:- Documents completed after the last embedding run
- Documents that failed embedding silently
- The 3
skippeddocuments would not have vectors, but that only accounts for 3 of the 290 gap - Actual gap from complete docs: 29,809 - 29,519 = 290 unembedded complete documents
-
Text subdirectory count (10,955) vs STATE 2 expected (~278). The survey predicted ~278 STATE 2 transcript dirs remaining after migration. 10,955 suggests a much larger set of text directories still exists — these likely include PDF-sourced text dirs, web-scraped text dirs, and possibly unmigrated transcripts. This is not necessarily an error; text dirs are created for all document types, not just PeerTube transcripts.
-
19,148 complete documents with
organized_at IS NULL. These documents completed extraction+enrichment+embedding but were never organized (placed into the library filesystem by the organizer). This is ~64% of all complete documents. The organizer thread may have been running behind, or these are PeerTube transcripts that don't go through the organizer path. -
_unclassified (6.3G) and _ingest (2.4G) staging dirs are non-trivial. These contain PDFs that haven't been classified into domain directories yet. Not blocking for the refactor, but worth tracking.