Files changed: docs/services/services.md reports/logistics_migration.md reports/post_validation_report.md reports/task_a_aurora_validation.md reports/task_c_watchdog_test.md
152 lines
6 KiB
Markdown
152 lines
6 KiB
Markdown
# Stream B — Post-Migration Validation Report
|
|
|
|
**Date:** 2026-04-13
|
|
**Pipeline version:** new_pipeline.py (Stream B v1, with 2 hotfixes applied during testing)
|
|
|
|
---
|
|
|
|
## Executive Summary
|
|
|
|
Both validation tasks passed. The Stream B pipeline is operational:
|
|
- **Task A (Aurora RAG):** All 3 queries returned correct Civil Organization results with updated download_urls. Migration has not broken RAG retrieval.
|
|
- **Task C (Watchdog Ingest):** Full two-phase ingest lifecycle validated end-to-end: acquire → extract → enrich → embed → place → reverse → re-place. Two bugs found and fixed during testing.
|
|
|
|
---
|
|
|
|
## Task A — Aurora RAG Retrieval Validation
|
|
|
|
**Verdict: PASS**
|
|
|
|
| Test | Result |
|
|
|------|--------|
|
|
| Query 1: Community governance principles | Relevant Civil Org results returned |
|
|
| Query 2: Emergency preparedness organization | Relevant Civil Org results returned |
|
|
| Query 3: Dispute resolution frameworks | Relevant Civil Org results returned |
|
|
| Download URL resolution (5 tested) | All 5 resolve to files on disk |
|
|
| Qdrant vectors have updated paths | YES |
|
|
| original_filename populated | YES |
|
|
|
|
**Conclusion:** The Phase 4 migration of 80 Civil Organization files has not degraded RAG quality. Qdrant vectors correctly reference the new standardized file paths.
|
|
|
|
---
|
|
|
|
## Task C — Watchdog Two-Phase Ingest Test
|
|
|
|
**Verdict: PASS**
|
|
|
|
### Test Document
|
|
- **Input:** `TestDoc_Civil_Governance_Framework_2024.pdf` (2,480 bytes, generated via reportlab)
|
|
- **Hash:** `346a65d9d72550df64490ad8e9998622`
|
|
- **Enriched title:** "Civil Governance Framework Analysis"
|
|
- **Enriched author:** "Dr. James Mitchell"
|
|
- **Domain:** Civil Organization / Governance
|
|
|
|
### Phase A (Acquisition)
|
|
| Step | Result |
|
|
|------|--------|
|
|
| File detected in `_acquired/` | PASS |
|
|
| Moved to `_ingest/` preserving original name | PASS |
|
|
| Catalogue entry created (status=queued) | PASS |
|
|
| Documents entry created (status=queued) | PASS |
|
|
|
|
### RECON Pipeline Processing
|
|
| Stage | Time | Duration |
|
|
|-------|------|----------|
|
|
| Extract | 05:50:40 | 26s |
|
|
| Enrich (Gemini) | 05:51:05 | 25s |
|
|
| Embed (TEI/Qdrant) | 05:51:25 | 20s |
|
|
| **Total processing** | | **~71s** |
|
|
|
|
### Phase B (Library Placement)
|
|
| Step | Result |
|
|
|------|--------|
|
|
| Filename standardized from book_title | `Civil_Governance_Framework_Analysis.pdf` |
|
|
| Domain classified | Civil Organization |
|
|
| Subdomain classified | Governance |
|
|
| Collision step | 1 (base, no collision) |
|
|
| File placed in library | `Civil-Organization/Governance/Civil_Governance_Framework_Analysis.pdf` |
|
|
| DB paths updated | PASS |
|
|
| Qdrant payloads updated (2 vectors) | PASS |
|
|
| original_filename preserved in Qdrant | PASS |
|
|
| file_operations audit entry created | PASS |
|
|
|
|
### Reverse + Re-place
|
|
| Step | Result |
|
|
|------|--------|
|
|
| Reverse moves file back to _ingest/ | PASS |
|
|
| DB/Qdrant reverted to _ingest paths | PASS |
|
|
| Re-placement produces identical result | PASS |
|
|
| file_operations tracks both operations | PASS |
|
|
|
|
---
|
|
|
|
## Bugs Found & Fixed
|
|
|
|
### Bug 1: Phase B query overwhelmed by unorganized docs
|
|
|
|
**Severity:** Blocker (Phase B would never find new ingest docs)
|
|
**Root cause:** `get_unorganized(limit=50)` returns oldest 50 unorganized docs out of 29,469 total. PeerTube transcripts fill the entire result set.
|
|
**Fix:** Added `get_ingest_pending(ingest_dir, limit)` — path-filtered query. Updated `ingest_scan()` Phase B to use it.
|
|
**Impact:** Without this fix, the watchdog Phase B would never process new acquisitions.
|
|
|
|
### Bug 2: Reverse doesn't clear organized_at
|
|
|
|
**Severity:** Minor (reverse + re-trigger workflow broken)
|
|
**Root cause:** `reverse_operation()` moved files and updated DB paths but didn't clear `organized_at`, so Phase B wouldn't re-trigger placement.
|
|
**Fix:** Added `UPDATE documents SET organized_at = NULL` to `reverse_operation()`.
|
|
**Impact:** Only affects the reverse → re-place workflow. Normal forward flow unaffected.
|
|
|
|
---
|
|
|
|
## Files Modified During Validation
|
|
|
|
| File | Changes |
|
|
|------|---------|
|
|
| `/opt/recon/lib/new_pipeline.py` | Phase B query fix + organized_at clear in reverse |
|
|
| `/opt/recon/lib/status.py` | Added `get_ingest_pending()` method |
|
|
|
|
Both fixes synced to local copies at `/home/zvx/projects/recon/lib/`.
|
|
|
|
---
|
|
|
|
## Pipeline State After Validation
|
|
|
|
| Item | State |
|
|
|------|-------|
|
|
| `new_pipeline.enabled` | false (disabled after test) |
|
|
| Watchdog process | killed |
|
|
| Test document | Left in place at `Civil-Organization/Governance/Civil_Governance_Framework_Analysis.pdf` |
|
|
| `_acquired/` | Empty |
|
|
| `_ingest/` | Empty |
|
|
| `_ingest/_duplicates/` | Empty |
|
|
| `_ingest/_failed/` | Empty |
|
|
| Total file_operations records | 82 (80 from migration + 2 from test) |
|
|
| duplicate_review records | 0 |
|
|
|
|
---
|
|
|
|
## Recommendations
|
|
|
|
1. **Ready for production:** The two-phase ingest pipeline is functional. Enable `new_pipeline.enabled: true` when ready to accept new acquisitions.
|
|
|
|
2. **Watchdog logging:** Consider calling `setup_logging('recon.pipeline')` at the start of `run_watchdog()` so logs appear in the main RECON log file even when run standalone via `recon.py pipeline watch`.
|
|
|
|
3. **Domain expansion:** The `pilot_domain: "Civil Organization"` restriction limits placement to Civil Org docs only. To enable for all domains, set `pilot_domain: null` or remove it.
|
|
|
|
4. **PeerTube organized_at:** 29,469 complete docs with `organized_at IS NULL` are mostly PeerTube transcripts. Consider bulk-setting `organized_at` for non-PDF docs to prevent the `get_unorganized()` query from growing unbounded (though the new `get_ingest_pending()` query sidesteps this issue for the pipeline).
|
|
|
|
---
|
|
|
|
## Final Verdict
|
|
|
|
**Stream B: New Library Pipeline — VALIDATED**
|
|
|
|
All components tested and operational:
|
|
- Phase A acquisition (watchdog → `_acquired/` → `_ingest/`)
|
|
- RECON pipeline integration (extract → enrich → embed)
|
|
- Phase B placement (standardized naming from book_title → collision ladder → library)
|
|
- Qdrant payload updates (download_url, filename, original_filename)
|
|
- Reverse operation (full rollback including Qdrant)
|
|
- Re-placement after reverse
|
|
- Aurora RAG retrieval (citations resolve to new paths)
|
|
- Audit trail (file_operations table)
|