mirror of https://github.com/zvx-echo6/refactored-recon.git synced 2026-05-20 14:44:39 +02:00

Matt 878bc2744a Phase 0: baseline capture

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-04-14 06:14:35 +00:00

5 KiB

Raw Blame History

Phase 0: Baseline Capture

Captured: 2026-04-14T06:11Z UTC

Backups

Item	Location	MD5 Hash
recon.db (SQLite)	CT 130: `/tmp/recon.db.phase0.20260414.bak`	`69d94a2c21686871c8c6863903710e3f`
recon.db (replica)	cortex: `/tmp/recon.db.phase0.20260414.bak`	`69d94a2c21686871c8c6863903710e3f`
config.yaml	CT 130: `/tmp/config.yaml.phase0.20260414.bak`	`6d70ed572dfb2e704abca3850ae33797`

DB backup verified: opens cleanly, contains 6 tables (catalogue, documents, duplicate_review, file_operations, intel, metrics_snapshots).

MD5 hashes match between CT 130 and cortex replicas.

Service State

recon.service

Status:   inactive (dead) since Mon 2026-04-13 22:40:35 UTC
Enabled:  enabled (will auto-start on boot)
Duration: 5h 31min 19.690s (last run)
Exit:     code=exited, status=0/SUCCESS

recon-watchdog.service

Status:   active (running) since Mon 2026-04-13 17:09:08 UTC
Enabled:  enabled
Memory:   8.2M (peak 48.7M)
PID:      343
Command:  /opt/recon/venv/bin/python3 /opt/recon/recon.py pipeline watch

Qdrant Metrics

Collection: recon_knowledge_hybrid

Metric	Value
Status	green
Optimizer status	ok
Points count	2,320,695
Indexed vectors count	4,641,386 (2x points — dense + sparse)
Segments	8
Vector size	1024 (Cosine)
Sparse vectors	bge-m3-sparse (IDF modifier)
Distinct doc_hash values	29,519
Sum of point counts across all hashes	2,320,695 (consistent)

SQLite Metrics

Database: /opt/recon/data/recon.db

Row Counts

Query	Count
`SELECT COUNT(*) FROM catalogue`	29,812
`SELECT COUNT(*) FROM documents`	29,812
`SELECT COUNT(*) FROM documents WHERE status = 'complete'`	29,809
`SELECT COUNT(*) FROM documents WHERE status = 'complete' AND organized_at IS NULL`	19,148
`SELECT COUNT(*) FROM catalogue WHERE source = 'stream.echo6.co'`	19,133
`SELECT COUNT(*) FROM catalogue WHERE path LIKE '/mnt/library/%' AND path LIKE '%.pdf'`	10,679

Non-Complete Document Status Breakdown

Status	Count
skipped	3

All documents are either complete (29,809) or skipped (3). Total: 29,812.

Filesystem Metrics

/mnt/library (data node, NFS)

Metric	Value
Total PDFs in library	15,446
Channel directories (streamecho6, depth ≤2)	18,987
Transcript directories (streamecho6, depth 2)	18,855

Underscore-prefixed staging directories

Directory	Size
`/mnt/library/_sources`	439M
`/mnt/library/_acquired`	4.0K
`/mnt/library/_unclassified`	6.3G
`/mnt/library/_ingest`	2.4G

/opt/recon/data (CT 130 local)

Path	Size
`data/text/`	4.5G
`data/concepts/`	4.0G
`data/` (total)	8.7G
Text subdirectories	10,955 (count includes parent, so 10,955 hash dirs)

Anomalies Noted

recon-watchdog.service is still running. Expected inactive/dead per instructions, but it is active (running) with PID 343. It runs recon.py pipeline watch which polls for pipeline work. This is a read-only watchdog, but it could trigger pipeline operations if it finds work. Consider stopping it before beginning refactor work.
Both services are enabled. They will auto-start on next reboot of CT 130. If the refactor requires the pipeline to stay down, these should be disabled before any reboot.
catalogue and documents counts match (29,812 each) — this is expected; they should be 1:1.
Qdrant distinct doc_hash (29,519) < documents complete (29,809). Delta of 290 documents that are marked complete in SQLite but have no corresponding vectors in Qdrant. Possible explanations:
- Documents completed after the last embedding run
- Documents that failed embedding silently
- The 3 skipped documents would not have vectors, but that only accounts for 3 of the 290 gap
- Actual gap from complete docs: 29,809 - 29,519 = 290 unembedded complete documents
Text subdirectory count (10,955) vs STATE 2 expected (~278). The survey predicted ~278 STATE 2 transcript dirs remaining after migration. 10,955 suggests a much larger set of text directories still exists — these likely include PDF-sourced text dirs, web-scraped text dirs, and possibly unmigrated transcripts. This is not necessarily an error; text dirs are created for all document types, not just PeerTube transcripts.
19,148 complete documents with organized_at IS NULL. These documents completed extraction+enrichment+embedding but were never organized (placed into the library filesystem by the organizer). This is ~64% of all complete documents. The organizer thread may have been running behind, or these are PeerTube transcripts that don't go through the organizer path.
_unclassified (6.3G) and _ingest (2.4G) staging dirs are non-trivial. These contain PDFs that haven't been classified into domain directories yet. Not blocking for the refactor, but worth tracking.

5 KiB Raw Blame History