# Phase 0: Baseline Capture **Captured:** 2026-04-14T06:11Z UTC --- ## Backups | Item | Location | MD5 Hash | |------|----------|----------| | recon.db (SQLite) | CT 130: `/tmp/recon.db.phase0.20260414.bak` | `69d94a2c21686871c8c6863903710e3f` | | recon.db (replica) | cortex: `/tmp/recon.db.phase0.20260414.bak` | `69d94a2c21686871c8c6863903710e3f` | | config.yaml | CT 130: `/tmp/config.yaml.phase0.20260414.bak` | `6d70ed572dfb2e704abca3850ae33797` | DB backup verified: opens cleanly, contains 6 tables (catalogue, documents, duplicate_review, file_operations, intel, metrics_snapshots). MD5 hashes match between CT 130 and cortex replicas. --- ## Service State ### recon.service ``` Status: inactive (dead) since Mon 2026-04-13 22:40:35 UTC Enabled: enabled (will auto-start on boot) Duration: 5h 31min 19.690s (last run) Exit: code=exited, status=0/SUCCESS ``` ### recon-watchdog.service ``` Status: active (running) since Mon 2026-04-13 17:09:08 UTC Enabled: enabled Memory: 8.2M (peak 48.7M) PID: 343 Command: /opt/recon/venv/bin/python3 /opt/recon/recon.py pipeline watch ``` --- ## Qdrant Metrics **Collection:** `recon_knowledge_hybrid` | Metric | Value | |--------|-------| | Status | green | | Optimizer status | ok | | Points count | 2,320,695 | | Indexed vectors count | 4,641,386 (2x points — dense + sparse) | | Segments | 8 | | Vector size | 1024 (Cosine) | | Sparse vectors | bge-m3-sparse (IDF modifier) | | Distinct doc_hash values | 29,519 | | Sum of point counts across all hashes | 2,320,695 (consistent) | --- ## SQLite Metrics Database: `/opt/recon/data/recon.db` ### Row Counts | Query | Count | |-------|-------| | `SELECT COUNT(*) FROM catalogue` | 29,812 | | `SELECT COUNT(*) FROM documents` | 29,812 | | `SELECT COUNT(*) FROM documents WHERE status = 'complete'` | 29,809 | | `SELECT COUNT(*) FROM documents WHERE status = 'complete' AND organized_at IS NULL` | 19,148 | | `SELECT COUNT(*) FROM catalogue WHERE source = 'stream.echo6.co'` | 19,133 | | `SELECT COUNT(*) FROM catalogue WHERE path LIKE '/mnt/library/%' AND path LIKE '%.pdf'` | 10,679 | ### Non-Complete Document Status Breakdown | Status | Count | |--------|-------| | skipped | 3 | All documents are either `complete` (29,809) or `skipped` (3). Total: 29,812. --- ## Filesystem Metrics ### /mnt/library (data node, NFS) | Metric | Value | |--------|-------| | Total PDFs in library | 15,446 | | Channel directories (streamecho6, depth ≤2) | 18,987 | | Transcript directories (streamecho6, depth 2) | 18,855 | #### Underscore-prefixed staging directories | Directory | Size | |-----------|------| | `/mnt/library/_sources` | 439M | | `/mnt/library/_acquired` | 4.0K | | `/mnt/library/_unclassified` | 6.3G | | `/mnt/library/_ingest` | 2.4G | ### /opt/recon/data (CT 130 local) | Path | Size | |------|------| | `data/text/` | 4.5G | | `data/concepts/` | 4.0G | | `data/` (total) | 8.7G | | Text subdirectories | 10,955 (count includes parent, so 10,955 hash dirs) | --- ## Anomalies Noted 1. **recon-watchdog.service is still running.** Expected inactive/dead per instructions, but it is active (running) with PID 343. It runs `recon.py pipeline watch` which polls for pipeline work. This is a read-only watchdog, but it could trigger pipeline operations if it finds work. Consider stopping it before beginning refactor work. 2. **Both services are `enabled`.** They will auto-start on next reboot of CT 130. If the refactor requires the pipeline to stay down, these should be disabled before any reboot. 3. **catalogue and documents counts match (29,812 each)** — this is expected; they should be 1:1. 4. **Qdrant distinct doc_hash (29,519) < documents complete (29,809).** Delta of 290 documents that are marked `complete` in SQLite but have no corresponding vectors in Qdrant. Possible explanations: - Documents completed after the last embedding run - Documents that failed embedding silently - The 3 `skipped` documents would not have vectors, but that only accounts for 3 of the 290 gap - Actual gap from complete docs: 29,809 - 29,519 = 290 unembedded complete documents 5. **Text subdirectory count (10,955) vs STATE 2 expected (~278).** The survey predicted ~278 STATE 2 transcript dirs remaining after migration. 10,955 suggests a much larger set of text directories still exists — these likely include PDF-sourced text dirs, web-scraped text dirs, and possibly unmigrated transcripts. This is not necessarily an error; text dirs are created for all document types, not just PeerTube transcripts. 6. **19,148 complete documents with `organized_at IS NULL`.** These documents completed extraction+enrichment+embedding but were never organized (placed into the library filesystem by the organizer). This is ~64% of all complete documents. The organizer thread may have been running behind, or these are PeerTube transcripts that don't go through the organizer path. 7. **_unclassified (6.3G) and _ingest (2.4G) staging dirs are non-trivial.** These contain PDFs that haven't been classified into domain directories yet. Not blocking for the refactor, but worth tracking.