mirror of
https://github.com/zvx-echo6/refactored-recon.git
synced 2026-05-20 14:44:39 +02:00
135 lines
5 KiB
Markdown
135 lines
5 KiB
Markdown
# Phase 0: Baseline Capture
|
|
|
|
**Captured:** 2026-04-14T06:11Z UTC
|
|
|
|
---
|
|
|
|
## Backups
|
|
|
|
| Item | Location | MD5 Hash |
|
|
|------|----------|----------|
|
|
| recon.db (SQLite) | CT 130: `/tmp/recon.db.phase0.20260414.bak` | `69d94a2c21686871c8c6863903710e3f` |
|
|
| recon.db (replica) | cortex: `/tmp/recon.db.phase0.20260414.bak` | `69d94a2c21686871c8c6863903710e3f` |
|
|
| config.yaml | CT 130: `/tmp/config.yaml.phase0.20260414.bak` | `6d70ed572dfb2e704abca3850ae33797` |
|
|
|
|
DB backup verified: opens cleanly, contains 6 tables (catalogue, documents, duplicate_review, file_operations, intel, metrics_snapshots).
|
|
|
|
MD5 hashes match between CT 130 and cortex replicas.
|
|
|
|
---
|
|
|
|
## Service State
|
|
|
|
### recon.service
|
|
|
|
```
|
|
Status: inactive (dead) since Mon 2026-04-13 22:40:35 UTC
|
|
Enabled: enabled (will auto-start on boot)
|
|
Duration: 5h 31min 19.690s (last run)
|
|
Exit: code=exited, status=0/SUCCESS
|
|
```
|
|
|
|
### recon-watchdog.service
|
|
|
|
```
|
|
Status: active (running) since Mon 2026-04-13 17:09:08 UTC
|
|
Enabled: enabled
|
|
Memory: 8.2M (peak 48.7M)
|
|
PID: 343
|
|
Command: /opt/recon/venv/bin/python3 /opt/recon/recon.py pipeline watch
|
|
```
|
|
|
|
---
|
|
|
|
## Qdrant Metrics
|
|
|
|
**Collection:** `recon_knowledge_hybrid`
|
|
|
|
| Metric | Value |
|
|
|--------|-------|
|
|
| Status | green |
|
|
| Optimizer status | ok |
|
|
| Points count | 2,320,695 |
|
|
| Indexed vectors count | 4,641,386 (2x points — dense + sparse) |
|
|
| Segments | 8 |
|
|
| Vector size | 1024 (Cosine) |
|
|
| Sparse vectors | bge-m3-sparse (IDF modifier) |
|
|
| Distinct doc_hash values | 29,519 |
|
|
| Sum of point counts across all hashes | 2,320,695 (consistent) |
|
|
|
|
---
|
|
|
|
## SQLite Metrics
|
|
|
|
Database: `/opt/recon/data/recon.db`
|
|
|
|
### Row Counts
|
|
|
|
| Query | Count |
|
|
|-------|-------|
|
|
| `SELECT COUNT(*) FROM catalogue` | 29,812 |
|
|
| `SELECT COUNT(*) FROM documents` | 29,812 |
|
|
| `SELECT COUNT(*) FROM documents WHERE status = 'complete'` | 29,809 |
|
|
| `SELECT COUNT(*) FROM documents WHERE status = 'complete' AND organized_at IS NULL` | 19,148 |
|
|
| `SELECT COUNT(*) FROM catalogue WHERE source = 'stream.echo6.co'` | 19,133 |
|
|
| `SELECT COUNT(*) FROM catalogue WHERE path LIKE '/mnt/library/%' AND path LIKE '%.pdf'` | 10,679 |
|
|
|
|
### Non-Complete Document Status Breakdown
|
|
|
|
| Status | Count |
|
|
|--------|-------|
|
|
| skipped | 3 |
|
|
|
|
All documents are either `complete` (29,809) or `skipped` (3). Total: 29,812.
|
|
|
|
---
|
|
|
|
## Filesystem Metrics
|
|
|
|
### /mnt/library (data node, NFS)
|
|
|
|
| Metric | Value |
|
|
|--------|-------|
|
|
| Total PDFs in library | 15,446 |
|
|
| Channel directories (streamecho6, depth ≤2) | 18,987 |
|
|
| Transcript directories (streamecho6, depth 2) | 18,855 |
|
|
|
|
#### Underscore-prefixed staging directories
|
|
|
|
| Directory | Size |
|
|
|-----------|------|
|
|
| `/mnt/library/_sources` | 439M |
|
|
| `/mnt/library/_acquired` | 4.0K |
|
|
| `/mnt/library/_unclassified` | 6.3G |
|
|
| `/mnt/library/_ingest` | 2.4G |
|
|
|
|
### /opt/recon/data (CT 130 local)
|
|
|
|
| Path | Size |
|
|
|------|------|
|
|
| `data/text/` | 4.5G |
|
|
| `data/concepts/` | 4.0G |
|
|
| `data/` (total) | 8.7G |
|
|
| Text subdirectories | 10,955 (count includes parent, so 10,955 hash dirs) |
|
|
|
|
---
|
|
|
|
## Anomalies Noted
|
|
|
|
1. **recon-watchdog.service is still running.** Expected inactive/dead per instructions, but it is active (running) with PID 343. It runs `recon.py pipeline watch` which polls for pipeline work. This is a read-only watchdog, but it could trigger pipeline operations if it finds work. Consider stopping it before beginning refactor work.
|
|
|
|
2. **Both services are `enabled`.** They will auto-start on next reboot of CT 130. If the refactor requires the pipeline to stay down, these should be disabled before any reboot.
|
|
|
|
3. **catalogue and documents counts match (29,812 each)** — this is expected; they should be 1:1.
|
|
|
|
4. **Qdrant distinct doc_hash (29,519) < documents complete (29,809).** Delta of 290 documents that are marked `complete` in SQLite but have no corresponding vectors in Qdrant. Possible explanations:
|
|
- Documents completed after the last embedding run
|
|
- Documents that failed embedding silently
|
|
- The 3 `skipped` documents would not have vectors, but that only accounts for 3 of the 290 gap
|
|
- Actual gap from complete docs: 29,809 - 29,519 = 290 unembedded complete documents
|
|
|
|
5. **Text subdirectory count (10,955) vs STATE 2 expected (~278).** The survey predicted ~278 STATE 2 transcript dirs remaining after migration. 10,955 suggests a much larger set of text directories still exists — these likely include PDF-sourced text dirs, web-scraped text dirs, and possibly unmigrated transcripts. This is not necessarily an error; text dirs are created for all document types, not just PeerTube transcripts.
|
|
|
|
6. **19,148 complete documents with `organized_at IS NULL`.** These documents completed extraction+enrichment+embedding but were never organized (placed into the library filesystem by the organizer). This is ~64% of all complete documents. The organizer thread may have been running behind, or these are PeerTube transcripts that don't go through the organizer path.
|
|
|
|
7. **_unclassified (6.3G) and _ingest (2.4G) staging dirs are non-trivial.** These contain PDFs that haven't been classified into domain directories yet. Not blocking for the refactor, but worth tracking.
|