diff --git a/phases/phase-1-scaffolding.md b/phases/phase-1-scaffolding.md new file mode 100644 index 0000000..187a479 --- /dev/null +++ b/phases/phase-1-scaffolding.md @@ -0,0 +1,96 @@ +# Phase 1: Scaffolding + +**Executed:** 2026-04-14T14:45Z UTC + +--- + +## Backups (taken before changes) + +| Item | Location | MD5 Hash | +|------|----------|----------| +| recon.db | CT 130: `/tmp/recon.db.phase1.20260414.bak` | `69d94a2c21686871c8c6863903710e3f` | +| config.yaml | CT 130: `/tmp/config.yaml.phase1.20260414.bak` | `6d70ed572dfb2e704abca3850ae33797` | + +DB hash matches Phase 0 backup — no changes occurred between phases. + +--- + +## What Changed + +### 1. Filesystem: New directory tree + +Created under `/opt/recon/data/`: + +``` +acquired/ + README.md + pdf/.keep + stream/.keep + html/.keep +processing/ + README.md +``` + +All owned by `zvx:zvx`, matching the existing data directory. + +### 2. Config: Three edits to `/opt/recon/config.yaml` + +**a) `new_pipeline.enabled` set to `false`** + +The Stream B library pipeline (watchdog-driven file intake from `_acquired/` and `_ingest/`) is disabled. This prevents the old pipeline from processing files while we build the replacement. + +**b) `crawler.sites` set to `[]`** + +All 44 crawl target site definitions commented out and preserved as historical reference. The crawler scheduler will find zero sites and do nothing if started. + +**c) New `pipeline:` section added at end of file** + +```yaml +pipeline: + acquired_root: /opt/recon/data/acquired + processing_root: /opt/recon/data/processing + dispatch: + pdf: pdf_processor + stream: transcript_processor + html: html_processor + mtime_stability_seconds: 10 +``` + +Scaffolding only — no code reads this section yet. Processors do not exist. + +**Config diff stats:** 284 lines removed, 302 lines added (bulk is the 44 sites being commented/uncommented). + +### 3. Schema: `text_dir` column added to `documents` table + +```sql +ALTER TABLE documents ADD COLUMN text_dir TEXT; +``` + +All 29,812 existing rows have `text_dir = NULL`. This column will hold the path to each document's extracted text directory, replacing the convention-based `data/text/{hash}/` lookup. + +--- + +## What Did Not Change + +- **No code modified:** `recon.py`, `lib/`, `scripts/`, templates, static assets — all untouched +- **No data modified:** catalogue and documents row counts remain 29,812 each +- **No service state changed:** Both `recon.service` and `recon-watchdog.service` remain inactive (both still `enabled` — will auto-start on reboot) +- **No Qdrant changes:** Collection `recon_knowledge_hybrid` untouched (2,320,695 points) +- **No file moves or deletions:** Existing `data/text/`, `data/concepts/`, NFS mounts all untouched + +--- + +## Verification (post-change) + +| Check | Result | +|-------|--------| +| recon.service | inactive | +| recon-watchdog.service | inactive | +| catalogue rows | 29,812 | +| documents rows | 29,812 | +| text_dir NULL count | 29,812 (all rows) | +| new_pipeline.enabled | `false` | +| crawler.sites | `[]` | +| pipeline.acquired_root | `/opt/recon/data/acquired` | +| New directories exist | all 5 confirmed, zvx:zvx | +| YAML validates | yes |