refactored-recon/phases/phase-0-baseline.md
Matt 878bc2744a Phase 0: baseline capture
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-14 06:14:35 +00:00

5 KiB

Phase 0: Baseline Capture

Captured: 2026-04-14T06:11Z UTC


Backups

Item Location MD5 Hash
recon.db (SQLite) CT 130: /tmp/recon.db.phase0.20260414.bak 69d94a2c21686871c8c6863903710e3f
recon.db (replica) cortex: /tmp/recon.db.phase0.20260414.bak 69d94a2c21686871c8c6863903710e3f
config.yaml CT 130: /tmp/config.yaml.phase0.20260414.bak 6d70ed572dfb2e704abca3850ae33797

DB backup verified: opens cleanly, contains 6 tables (catalogue, documents, duplicate_review, file_operations, intel, metrics_snapshots).

MD5 hashes match between CT 130 and cortex replicas.


Service State

recon.service

Status:   inactive (dead) since Mon 2026-04-13 22:40:35 UTC
Enabled:  enabled (will auto-start on boot)
Duration: 5h 31min 19.690s (last run)
Exit:     code=exited, status=0/SUCCESS

recon-watchdog.service

Status:   active (running) since Mon 2026-04-13 17:09:08 UTC
Enabled:  enabled
Memory:   8.2M (peak 48.7M)
PID:      343
Command:  /opt/recon/venv/bin/python3 /opt/recon/recon.py pipeline watch

Qdrant Metrics

Collection: recon_knowledge_hybrid

Metric Value
Status green
Optimizer status ok
Points count 2,320,695
Indexed vectors count 4,641,386 (2x points — dense + sparse)
Segments 8
Vector size 1024 (Cosine)
Sparse vectors bge-m3-sparse (IDF modifier)
Distinct doc_hash values 29,519
Sum of point counts across all hashes 2,320,695 (consistent)

SQLite Metrics

Database: /opt/recon/data/recon.db

Row Counts

Query Count
SELECT COUNT(*) FROM catalogue 29,812
SELECT COUNT(*) FROM documents 29,812
SELECT COUNT(*) FROM documents WHERE status = 'complete' 29,809
SELECT COUNT(*) FROM documents WHERE status = 'complete' AND organized_at IS NULL 19,148
SELECT COUNT(*) FROM catalogue WHERE source = 'stream.echo6.co' 19,133
SELECT COUNT(*) FROM catalogue WHERE path LIKE '/mnt/library/%' AND path LIKE '%.pdf' 10,679

Non-Complete Document Status Breakdown

Status Count
skipped 3

All documents are either complete (29,809) or skipped (3). Total: 29,812.


Filesystem Metrics

/mnt/library (data node, NFS)

Metric Value
Total PDFs in library 15,446
Channel directories (streamecho6, depth ≤2) 18,987
Transcript directories (streamecho6, depth 2) 18,855

Underscore-prefixed staging directories

Directory Size
/mnt/library/_sources 439M
/mnt/library/_acquired 4.0K
/mnt/library/_unclassified 6.3G
/mnt/library/_ingest 2.4G

/opt/recon/data (CT 130 local)

Path Size
data/text/ 4.5G
data/concepts/ 4.0G
data/ (total) 8.7G
Text subdirectories 10,955 (count includes parent, so 10,955 hash dirs)

Anomalies Noted

  1. recon-watchdog.service is still running. Expected inactive/dead per instructions, but it is active (running) with PID 343. It runs recon.py pipeline watch which polls for pipeline work. This is a read-only watchdog, but it could trigger pipeline operations if it finds work. Consider stopping it before beginning refactor work.

  2. Both services are enabled. They will auto-start on next reboot of CT 130. If the refactor requires the pipeline to stay down, these should be disabled before any reboot.

  3. catalogue and documents counts match (29,812 each) — this is expected; they should be 1:1.

  4. Qdrant distinct doc_hash (29,519) < documents complete (29,809). Delta of 290 documents that are marked complete in SQLite but have no corresponding vectors in Qdrant. Possible explanations:

    • Documents completed after the last embedding run
    • Documents that failed embedding silently
    • The 3 skipped documents would not have vectors, but that only accounts for 3 of the 290 gap
    • Actual gap from complete docs: 29,809 - 29,519 = 290 unembedded complete documents
  5. Text subdirectory count (10,955) vs STATE 2 expected (~278). The survey predicted ~278 STATE 2 transcript dirs remaining after migration. 10,955 suggests a much larger set of text directories still exists — these likely include PDF-sourced text dirs, web-scraped text dirs, and possibly unmigrated transcripts. This is not necessarily an error; text dirs are created for all document types, not just PeerTube transcripts.

  6. 19,148 complete documents with organized_at IS NULL. These documents completed extraction+enrichment+embedding but were never organized (placed into the library filesystem by the organizer). This is ~64% of all complete documents. The organizer thread may have been running behind, or these are PeerTube transcripts that don't go through the organizer path.

  7. _unclassified (6.3G) and _ingest (2.4G) staging dirs are non-trivial. These contain PDFs that haven't been classified into domain directories yet. Not blocking for the refactor, but worth tracking.