Files changed: docs/services/services.md reports/logistics_migration.md reports/post_validation_report.md reports/task_a_aurora_validation.md reports/task_c_watchdog_test.md
6 KiB
Stream B — Post-Migration Validation Report
Date: 2026-04-13 Pipeline version: new_pipeline.py (Stream B v1, with 2 hotfixes applied during testing)
Executive Summary
Both validation tasks passed. The Stream B pipeline is operational:
- Task A (Aurora RAG): All 3 queries returned correct Civil Organization results with updated download_urls. Migration has not broken RAG retrieval.
- Task C (Watchdog Ingest): Full two-phase ingest lifecycle validated end-to-end: acquire → extract → enrich → embed → place → reverse → re-place. Two bugs found and fixed during testing.
Task A — Aurora RAG Retrieval Validation
Verdict: PASS
| Test | Result |
|---|---|
| Query 1: Community governance principles | Relevant Civil Org results returned |
| Query 2: Emergency preparedness organization | Relevant Civil Org results returned |
| Query 3: Dispute resolution frameworks | Relevant Civil Org results returned |
| Download URL resolution (5 tested) | All 5 resolve to files on disk |
| Qdrant vectors have updated paths | YES |
| original_filename populated | YES |
Conclusion: The Phase 4 migration of 80 Civil Organization files has not degraded RAG quality. Qdrant vectors correctly reference the new standardized file paths.
Task C — Watchdog Two-Phase Ingest Test
Verdict: PASS
Test Document
- Input:
TestDoc_Civil_Governance_Framework_2024.pdf(2,480 bytes, generated via reportlab) - Hash:
346a65d9d72550df64490ad8e9998622 - Enriched title: "Civil Governance Framework Analysis"
- Enriched author: "Dr. James Mitchell"
- Domain: Civil Organization / Governance
Phase A (Acquisition)
| Step | Result |
|---|---|
File detected in _acquired/ |
PASS |
Moved to _ingest/ preserving original name |
PASS |
| Catalogue entry created (status=queued) | PASS |
| Documents entry created (status=queued) | PASS |
RECON Pipeline Processing
| Stage | Time | Duration |
|---|---|---|
| Extract | 05:50:40 | 26s |
| Enrich (Gemini) | 05:51:05 | 25s |
| Embed (TEI/Qdrant) | 05:51:25 | 20s |
| Total processing | ~71s |
Phase B (Library Placement)
| Step | Result |
|---|---|
| Filename standardized from book_title | Civil_Governance_Framework_Analysis.pdf |
| Domain classified | Civil Organization |
| Subdomain classified | Governance |
| Collision step | 1 (base, no collision) |
| File placed in library | Civil-Organization/Governance/Civil_Governance_Framework_Analysis.pdf |
| DB paths updated | PASS |
| Qdrant payloads updated (2 vectors) | PASS |
| original_filename preserved in Qdrant | PASS |
| file_operations audit entry created | PASS |
Reverse + Re-place
| Step | Result |
|---|---|
| Reverse moves file back to _ingest/ | PASS |
| DB/Qdrant reverted to _ingest paths | PASS |
| Re-placement produces identical result | PASS |
| file_operations tracks both operations | PASS |
Bugs Found & Fixed
Bug 1: Phase B query overwhelmed by unorganized docs
Severity: Blocker (Phase B would never find new ingest docs)
Root cause: get_unorganized(limit=50) returns oldest 50 unorganized docs out of 29,469 total. PeerTube transcripts fill the entire result set.
Fix: Added get_ingest_pending(ingest_dir, limit) — path-filtered query. Updated ingest_scan() Phase B to use it.
Impact: Without this fix, the watchdog Phase B would never process new acquisitions.
Bug 2: Reverse doesn't clear organized_at
Severity: Minor (reverse + re-trigger workflow broken)
Root cause: reverse_operation() moved files and updated DB paths but didn't clear organized_at, so Phase B wouldn't re-trigger placement.
Fix: Added UPDATE documents SET organized_at = NULL to reverse_operation().
Impact: Only affects the reverse → re-place workflow. Normal forward flow unaffected.
Files Modified During Validation
| File | Changes |
|---|---|
/opt/recon/lib/new_pipeline.py |
Phase B query fix + organized_at clear in reverse |
/opt/recon/lib/status.py |
Added get_ingest_pending() method |
Both fixes synced to local copies at /home/zvx/projects/recon/lib/.
Pipeline State After Validation
| Item | State |
|---|---|
new_pipeline.enabled |
false (disabled after test) |
| Watchdog process | killed |
| Test document | Left in place at Civil-Organization/Governance/Civil_Governance_Framework_Analysis.pdf |
_acquired/ |
Empty |
_ingest/ |
Empty |
_ingest/_duplicates/ |
Empty |
_ingest/_failed/ |
Empty |
| Total file_operations records | 82 (80 from migration + 2 from test) |
| duplicate_review records | 0 |
Recommendations
-
Ready for production: The two-phase ingest pipeline is functional. Enable
new_pipeline.enabled: truewhen ready to accept new acquisitions. -
Watchdog logging: Consider calling
setup_logging('recon.pipeline')at the start ofrun_watchdog()so logs appear in the main RECON log file even when run standalone viarecon.py pipeline watch. -
Domain expansion: The
pilot_domain: "Civil Organization"restriction limits placement to Civil Org docs only. To enable for all domains, setpilot_domain: nullor remove it. -
PeerTube organized_at: 29,469 complete docs with
organized_at IS NULLare mostly PeerTube transcripts. Consider bulk-settingorganized_atfor non-PDF docs to prevent theget_unorganized()query from growing unbounded (though the newget_ingest_pending()query sidesteps this issue for the pipeline).
Final Verdict
Stream B: New Library Pipeline — VALIDATED
All components tested and operational:
- Phase A acquisition (watchdog →
_acquired/→_ingest/) - RECON pipeline integration (extract → enrich → embed)
- Phase B placement (standardized naming from book_title → collision ladder → library)
- Qdrant payload updates (download_url, filename, original_filename)
- Reverse operation (full rollback including Qdrant)
- Re-placement after reverse
- Aurora RAG retrieval (citations resolve to new paths)
- Audit trail (file_operations table)