# Stream B — Post-Migration Validation Report **Date:** 2026-04-13 **Pipeline version:** new_pipeline.py (Stream B v1, with 2 hotfixes applied during testing) --- ## Executive Summary Both validation tasks passed. The Stream B pipeline is operational: - **Task A (Aurora RAG):** All 3 queries returned correct Civil Organization results with updated download_urls. Migration has not broken RAG retrieval. - **Task C (Watchdog Ingest):** Full two-phase ingest lifecycle validated end-to-end: acquire → extract → enrich → embed → place → reverse → re-place. Two bugs found and fixed during testing. --- ## Task A — Aurora RAG Retrieval Validation **Verdict: PASS** | Test | Result | |------|--------| | Query 1: Community governance principles | Relevant Civil Org results returned | | Query 2: Emergency preparedness organization | Relevant Civil Org results returned | | Query 3: Dispute resolution frameworks | Relevant Civil Org results returned | | Download URL resolution (5 tested) | All 5 resolve to files on disk | | Qdrant vectors have updated paths | YES | | original_filename populated | YES | **Conclusion:** The Phase 4 migration of 80 Civil Organization files has not degraded RAG quality. Qdrant vectors correctly reference the new standardized file paths. --- ## Task C — Watchdog Two-Phase Ingest Test **Verdict: PASS** ### Test Document - **Input:** `TestDoc_Civil_Governance_Framework_2024.pdf` (2,480 bytes, generated via reportlab) - **Hash:** `346a65d9d72550df64490ad8e9998622` - **Enriched title:** "Civil Governance Framework Analysis" - **Enriched author:** "Dr. James Mitchell" - **Domain:** Civil Organization / Governance ### Phase A (Acquisition) | Step | Result | |------|--------| | File detected in `_acquired/` | PASS | | Moved to `_ingest/` preserving original name | PASS | | Catalogue entry created (status=queued) | PASS | | Documents entry created (status=queued) | PASS | ### RECON Pipeline Processing | Stage | Time | Duration | |-------|------|----------| | Extract | 05:50:40 | 26s | | Enrich (Gemini) | 05:51:05 | 25s | | Embed (TEI/Qdrant) | 05:51:25 | 20s | | **Total processing** | | **~71s** | ### Phase B (Library Placement) | Step | Result | |------|--------| | Filename standardized from book_title | `Civil_Governance_Framework_Analysis.pdf` | | Domain classified | Civil Organization | | Subdomain classified | Governance | | Collision step | 1 (base, no collision) | | File placed in library | `Civil-Organization/Governance/Civil_Governance_Framework_Analysis.pdf` | | DB paths updated | PASS | | Qdrant payloads updated (2 vectors) | PASS | | original_filename preserved in Qdrant | PASS | | file_operations audit entry created | PASS | ### Reverse + Re-place | Step | Result | |------|--------| | Reverse moves file back to _ingest/ | PASS | | DB/Qdrant reverted to _ingest paths | PASS | | Re-placement produces identical result | PASS | | file_operations tracks both operations | PASS | --- ## Bugs Found & Fixed ### Bug 1: Phase B query overwhelmed by unorganized docs **Severity:** Blocker (Phase B would never find new ingest docs) **Root cause:** `get_unorganized(limit=50)` returns oldest 50 unorganized docs out of 29,469 total. PeerTube transcripts fill the entire result set. **Fix:** Added `get_ingest_pending(ingest_dir, limit)` — path-filtered query. Updated `ingest_scan()` Phase B to use it. **Impact:** Without this fix, the watchdog Phase B would never process new acquisitions. ### Bug 2: Reverse doesn't clear organized_at **Severity:** Minor (reverse + re-trigger workflow broken) **Root cause:** `reverse_operation()` moved files and updated DB paths but didn't clear `organized_at`, so Phase B wouldn't re-trigger placement. **Fix:** Added `UPDATE documents SET organized_at = NULL` to `reverse_operation()`. **Impact:** Only affects the reverse → re-place workflow. Normal forward flow unaffected. --- ## Files Modified During Validation | File | Changes | |------|---------| | `/opt/recon/lib/new_pipeline.py` | Phase B query fix + organized_at clear in reverse | | `/opt/recon/lib/status.py` | Added `get_ingest_pending()` method | Both fixes synced to local copies at `/home/zvx/projects/recon/lib/`. --- ## Pipeline State After Validation | Item | State | |------|-------| | `new_pipeline.enabled` | false (disabled after test) | | Watchdog process | killed | | Test document | Left in place at `Civil-Organization/Governance/Civil_Governance_Framework_Analysis.pdf` | | `_acquired/` | Empty | | `_ingest/` | Empty | | `_ingest/_duplicates/` | Empty | | `_ingest/_failed/` | Empty | | Total file_operations records | 82 (80 from migration + 2 from test) | | duplicate_review records | 0 | --- ## Recommendations 1. **Ready for production:** The two-phase ingest pipeline is functional. Enable `new_pipeline.enabled: true` when ready to accept new acquisitions. 2. **Watchdog logging:** Consider calling `setup_logging('recon.pipeline')` at the start of `run_watchdog()` so logs appear in the main RECON log file even when run standalone via `recon.py pipeline watch`. 3. **Domain expansion:** The `pilot_domain: "Civil Organization"` restriction limits placement to Civil Org docs only. To enable for all domains, set `pilot_domain: null` or remove it. 4. **PeerTube organized_at:** 29,469 complete docs with `organized_at IS NULL` are mostly PeerTube transcripts. Consider bulk-setting `organized_at` for non-PDF docs to prevent the `get_unorganized()` query from growing unbounded (though the new `get_ingest_pending()` query sidesteps this issue for the pipeline). --- ## Final Verdict **Stream B: New Library Pipeline — VALIDATED** All components tested and operational: - Phase A acquisition (watchdog → `_acquired/` → `_ingest/`) - RECON pipeline integration (extract → enrich → embed) - Phase B placement (standardized naming from book_title → collision ladder → library) - Qdrant payload updates (download_url, filename, original_filename) - Reverse operation (full rollback including Qdrant) - Re-placement after reverse - Aurora RAG retrieval (citations resolve to new paths) - Audit trail (file_operations table)