echo6-docs/reports/post_validation_report.md
echo6-autocommit abb0bd0b7c auto: docs sync 2026-04-13T12:00:05+00:00
Files changed: docs/services/services.md reports/logistics_migration.md reports/post_validation_report.md reports/task_a_aurora_validation.md reports/task_c_watchdog_test.md
2026-04-13 12:00:05 +00:00

6 KiB

Stream B — Post-Migration Validation Report

Date: 2026-04-13 Pipeline version: new_pipeline.py (Stream B v1, with 2 hotfixes applied during testing)


Executive Summary

Both validation tasks passed. The Stream B pipeline is operational:

  • Task A (Aurora RAG): All 3 queries returned correct Civil Organization results with updated download_urls. Migration has not broken RAG retrieval.
  • Task C (Watchdog Ingest): Full two-phase ingest lifecycle validated end-to-end: acquire → extract → enrich → embed → place → reverse → re-place. Two bugs found and fixed during testing.

Task A — Aurora RAG Retrieval Validation

Verdict: PASS

Test Result
Query 1: Community governance principles Relevant Civil Org results returned
Query 2: Emergency preparedness organization Relevant Civil Org results returned
Query 3: Dispute resolution frameworks Relevant Civil Org results returned
Download URL resolution (5 tested) All 5 resolve to files on disk
Qdrant vectors have updated paths YES
original_filename populated YES

Conclusion: The Phase 4 migration of 80 Civil Organization files has not degraded RAG quality. Qdrant vectors correctly reference the new standardized file paths.


Task C — Watchdog Two-Phase Ingest Test

Verdict: PASS

Test Document

  • Input: TestDoc_Civil_Governance_Framework_2024.pdf (2,480 bytes, generated via reportlab)
  • Hash: 346a65d9d72550df64490ad8e9998622
  • Enriched title: "Civil Governance Framework Analysis"
  • Enriched author: "Dr. James Mitchell"
  • Domain: Civil Organization / Governance

Phase A (Acquisition)

Step Result
File detected in _acquired/ PASS
Moved to _ingest/ preserving original name PASS
Catalogue entry created (status=queued) PASS
Documents entry created (status=queued) PASS

RECON Pipeline Processing

Stage Time Duration
Extract 05:50:40 26s
Enrich (Gemini) 05:51:05 25s
Embed (TEI/Qdrant) 05:51:25 20s
Total processing ~71s

Phase B (Library Placement)

Step Result
Filename standardized from book_title Civil_Governance_Framework_Analysis.pdf
Domain classified Civil Organization
Subdomain classified Governance
Collision step 1 (base, no collision)
File placed in library Civil-Organization/Governance/Civil_Governance_Framework_Analysis.pdf
DB paths updated PASS
Qdrant payloads updated (2 vectors) PASS
original_filename preserved in Qdrant PASS
file_operations audit entry created PASS

Reverse + Re-place

Step Result
Reverse moves file back to _ingest/ PASS
DB/Qdrant reverted to _ingest paths PASS
Re-placement produces identical result PASS
file_operations tracks both operations PASS

Bugs Found & Fixed

Bug 1: Phase B query overwhelmed by unorganized docs

Severity: Blocker (Phase B would never find new ingest docs) Root cause: get_unorganized(limit=50) returns oldest 50 unorganized docs out of 29,469 total. PeerTube transcripts fill the entire result set. Fix: Added get_ingest_pending(ingest_dir, limit) — path-filtered query. Updated ingest_scan() Phase B to use it. Impact: Without this fix, the watchdog Phase B would never process new acquisitions.

Bug 2: Reverse doesn't clear organized_at

Severity: Minor (reverse + re-trigger workflow broken) Root cause: reverse_operation() moved files and updated DB paths but didn't clear organized_at, so Phase B wouldn't re-trigger placement. Fix: Added UPDATE documents SET organized_at = NULL to reverse_operation(). Impact: Only affects the reverse → re-place workflow. Normal forward flow unaffected.


Files Modified During Validation

File Changes
/opt/recon/lib/new_pipeline.py Phase B query fix + organized_at clear in reverse
/opt/recon/lib/status.py Added get_ingest_pending() method

Both fixes synced to local copies at /home/zvx/projects/recon/lib/.


Pipeline State After Validation

Item State
new_pipeline.enabled false (disabled after test)
Watchdog process killed
Test document Left in place at Civil-Organization/Governance/Civil_Governance_Framework_Analysis.pdf
_acquired/ Empty
_ingest/ Empty
_ingest/_duplicates/ Empty
_ingest/_failed/ Empty
Total file_operations records 82 (80 from migration + 2 from test)
duplicate_review records 0

Recommendations

  1. Ready for production: The two-phase ingest pipeline is functional. Enable new_pipeline.enabled: true when ready to accept new acquisitions.

  2. Watchdog logging: Consider calling setup_logging('recon.pipeline') at the start of run_watchdog() so logs appear in the main RECON log file even when run standalone via recon.py pipeline watch.

  3. Domain expansion: The pilot_domain: "Civil Organization" restriction limits placement to Civil Org docs only. To enable for all domains, set pilot_domain: null or remove it.

  4. PeerTube organized_at: 29,469 complete docs with organized_at IS NULL are mostly PeerTube transcripts. Consider bulk-setting organized_at for non-PDF docs to prevent the get_unorganized() query from growing unbounded (though the new get_ingest_pending() query sidesteps this issue for the pipeline).


Final Verdict

Stream B: New Library Pipeline — VALIDATED

All components tested and operational:

  • Phase A acquisition (watchdog → _acquired/_ingest/)
  • RECON pipeline integration (extract → enrich → embed)
  • Phase B placement (standardized naming from book_title → collision ladder → library)
  • Qdrant payload updates (download_url, filename, original_filename)
  • Reverse operation (full rollback including Qdrant)
  • Re-placement after reverse
  • Aurora RAG retrieval (citations resolve to new paths)
  • Audit trail (file_operations table)