refactored-recon/phases/phase-0-baseline.md

# Phase 0: Baseline Capture

**Captured:** 2026-04-14T06:11Z UTC

---

## Backups

| Item | Location | MD5 Hash |
|------|----------|----------|
| recon.db (SQLite) | CT 130: `/tmp/recon.db.phase0.20260414.bak` | `69d94a2c21686871c8c6863903710e3f` |
| recon.db (replica) | cortex: `/tmp/recon.db.phase0.20260414.bak` | `69d94a2c21686871c8c6863903710e3f` |
| config.yaml | CT 130: `/tmp/config.yaml.phase0.20260414.bak` | `6d70ed572dfb2e704abca3850ae33797` |

DB backup verified: opens cleanly, contains 6 tables (catalogue, documents, duplicate_review, file_operations, intel, metrics_snapshots).

MD5 hashes match between CT 130 and cortex replicas.

---

## Service State

### recon.service

```
Status:   inactive (dead) since Mon 2026-04-13 22:40:35 UTC
Enabled:  enabled (will auto-start on boot)
Duration: 5h 31min 19.690s (last run)
Exit:     code=exited, status=0/SUCCESS
```

### recon-watchdog.service

```
Status:   active (running) since Mon 2026-04-13 17:09:08 UTC
Enabled:  enabled
Memory:   8.2M (peak 48.7M)
PID:      343
Command:  /opt/recon/venv/bin/python3 /opt/recon/recon.py pipeline watch
```

---

## Qdrant Metrics

**Collection:** `recon_knowledge_hybrid`

| Metric | Value |
|--------|-------|
| Status | green |
| Optimizer status | ok |
| Points count | 2,320,695 |
| Indexed vectors count | 4,641,386 (2x points — dense + sparse) |
| Segments | 8 |
| Vector size | 1024 (Cosine) |
| Sparse vectors | bge-m3-sparse (IDF modifier) |
| Distinct doc_hash values | 29,519 |
| Sum of point counts across all hashes | 2,320,695 (consistent) |

---

## SQLite Metrics

Database: `/opt/recon/data/recon.db`

### Row Counts

| Query | Count |
|-------|-------|
| `SELECT COUNT(*) FROM catalogue` | 29,812 |
| `SELECT COUNT(*) FROM documents` | 29,812 |
| `SELECT COUNT(*) FROM documents WHERE status = 'complete'` | 29,809 |
| `SELECT COUNT(*) FROM documents WHERE status = 'complete' AND organized_at IS NULL` | 19,148 |
| `SELECT COUNT(*) FROM catalogue WHERE source = 'stream.echo6.co'` | 19,133 |
| `SELECT COUNT(*) FROM catalogue WHERE path LIKE '/mnt/library/%' AND path LIKE '%.pdf'` | 10,679 |

### Non-Complete Document Status Breakdown

| Status | Count |
|--------|-------|
| skipped | 3 |

All documents are either `complete` (29,809) or `skipped` (3). Total: 29,812.

---

## Filesystem Metrics

### /mnt/library (data node, NFS)

| Metric | Value |
|--------|-------|
| Total PDFs in library | 15,446 |
| Channel directories (streamecho6, depth ≤2) | 18,987 |
| Transcript directories (streamecho6, depth 2) | 18,855 |

#### Underscore-prefixed staging directories

| Directory | Size |
|-----------|------|
| `/mnt/library/_sources` | 439M |
| `/mnt/library/_acquired` | 4.0K |
| `/mnt/library/_unclassified` | 6.3G |
| `/mnt/library/_ingest` | 2.4G |

### /opt/recon/data (CT 130 local)

| Path | Size |
|------|------|
| `data/text/` | 4.5G |
| `data/concepts/` | 4.0G |
| `data/` (total) | 8.7G |
| Text subdirectories | 10,955 (count includes parent, so 10,955 hash dirs) |

---

## Anomalies Noted

1. **recon-watchdog.service is still running.** Expected inactive/dead per instructions, but it is active (running) with PID 343. It runs `recon.py pipeline watch` which polls for pipeline work. This is a read-only watchdog, but it could trigger pipeline operations if it finds work. Consider stopping it before beginning refactor work.

2. **Both services are `enabled`.** They will auto-start on next reboot of CT 130. If the refactor requires the pipeline to stay down, these should be disabled before any reboot.

3. **catalogue and documents counts match (29,812 each)** — this is expected; they should be 1:1.

4. **Qdrant distinct doc_hash (29,519) < documents complete (29,809).** Delta of 290 documents that are marked `complete` in SQLite but have no corresponding vectors in Qdrant. Possible explanations:
   - Documents completed after the last embedding run
   - Documents that failed embedding silently
   - The 3 `skipped` documents would not have vectors, but that only accounts for 3 of the 290 gap
   - Actual gap from complete docs: 29,809 - 29,519 = 290 unembedded complete documents

5. **Text subdirectory count (10,955) vs STATE 2 expected (~278).** The survey predicted ~278 STATE 2 transcript dirs remaining after migration. 10,955 suggests a much larger set of text directories still exists — these likely include PDF-sourced text dirs, web-scraped text dirs, and possibly unmigrated transcripts. This is not necessarily an error; text dirs are created for all document types, not just PeerTube transcripts.

6. **19,148 complete documents with `organized_at IS NULL`.** These documents completed extraction+enrichment+embedding but were never organized (placed into the library filesystem by the organizer). This is ~64% of all complete documents. The organizer thread may have been running behind, or these are PeerTube transcripts that don't go through the organizer path.

7. **_unclassified (6.3G) and _ingest (2.4G) staging dirs are non-trivial.** These contain PDFs that haven't been classified into domain directories yet. Not blocking for the refactor, but worth tracking.