auto: docs sync 2026-04-13T12:00:05+00:00

Files changed: docs/services/services.md reports/logistics_migration.md reports/post_validation_report.md reports/task_a_aurora_validation.md reports/task_c_watchdog_test.md
This commit is contained in:
echo6-autocommit 2026-04-13 12:00:05 +00:00
commit abb0bd0b7c
5 changed files with 611 additions and 2 deletions

View file

@ -0,0 +1,152 @@
# Stream B — Post-Migration Validation Report
**Date:** 2026-04-13
**Pipeline version:** new_pipeline.py (Stream B v1, with 2 hotfixes applied during testing)
---
## Executive Summary
Both validation tasks passed. The Stream B pipeline is operational:
- **Task A (Aurora RAG):** All 3 queries returned correct Civil Organization results with updated download_urls. Migration has not broken RAG retrieval.
- **Task C (Watchdog Ingest):** Full two-phase ingest lifecycle validated end-to-end: acquire → extract → enrich → embed → place → reverse → re-place. Two bugs found and fixed during testing.
---
## Task A — Aurora RAG Retrieval Validation
**Verdict: PASS**
| Test | Result |
|------|--------|
| Query 1: Community governance principles | Relevant Civil Org results returned |
| Query 2: Emergency preparedness organization | Relevant Civil Org results returned |
| Query 3: Dispute resolution frameworks | Relevant Civil Org results returned |
| Download URL resolution (5 tested) | All 5 resolve to files on disk |
| Qdrant vectors have updated paths | YES |
| original_filename populated | YES |
**Conclusion:** The Phase 4 migration of 80 Civil Organization files has not degraded RAG quality. Qdrant vectors correctly reference the new standardized file paths.
---
## Task C — Watchdog Two-Phase Ingest Test
**Verdict: PASS**
### Test Document
- **Input:** `TestDoc_Civil_Governance_Framework_2024.pdf` (2,480 bytes, generated via reportlab)
- **Hash:** `346a65d9d72550df64490ad8e9998622`
- **Enriched title:** "Civil Governance Framework Analysis"
- **Enriched author:** "Dr. James Mitchell"
- **Domain:** Civil Organization / Governance
### Phase A (Acquisition)
| Step | Result |
|------|--------|
| File detected in `_acquired/` | PASS |
| Moved to `_ingest/` preserving original name | PASS |
| Catalogue entry created (status=queued) | PASS |
| Documents entry created (status=queued) | PASS |
### RECON Pipeline Processing
| Stage | Time | Duration |
|-------|------|----------|
| Extract | 05:50:40 | 26s |
| Enrich (Gemini) | 05:51:05 | 25s |
| Embed (TEI/Qdrant) | 05:51:25 | 20s |
| **Total processing** | | **~71s** |
### Phase B (Library Placement)
| Step | Result |
|------|--------|
| Filename standardized from book_title | `Civil_Governance_Framework_Analysis.pdf` |
| Domain classified | Civil Organization |
| Subdomain classified | Governance |
| Collision step | 1 (base, no collision) |
| File placed in library | `Civil-Organization/Governance/Civil_Governance_Framework_Analysis.pdf` |
| DB paths updated | PASS |
| Qdrant payloads updated (2 vectors) | PASS |
| original_filename preserved in Qdrant | PASS |
| file_operations audit entry created | PASS |
### Reverse + Re-place
| Step | Result |
|------|--------|
| Reverse moves file back to _ingest/ | PASS |
| DB/Qdrant reverted to _ingest paths | PASS |
| Re-placement produces identical result | PASS |
| file_operations tracks both operations | PASS |
---
## Bugs Found & Fixed
### Bug 1: Phase B query overwhelmed by unorganized docs
**Severity:** Blocker (Phase B would never find new ingest docs)
**Root cause:** `get_unorganized(limit=50)` returns oldest 50 unorganized docs out of 29,469 total. PeerTube transcripts fill the entire result set.
**Fix:** Added `get_ingest_pending(ingest_dir, limit)` — path-filtered query. Updated `ingest_scan()` Phase B to use it.
**Impact:** Without this fix, the watchdog Phase B would never process new acquisitions.
### Bug 2: Reverse doesn't clear organized_at
**Severity:** Minor (reverse + re-trigger workflow broken)
**Root cause:** `reverse_operation()` moved files and updated DB paths but didn't clear `organized_at`, so Phase B wouldn't re-trigger placement.
**Fix:** Added `UPDATE documents SET organized_at = NULL` to `reverse_operation()`.
**Impact:** Only affects the reverse → re-place workflow. Normal forward flow unaffected.
---
## Files Modified During Validation
| File | Changes |
|------|---------|
| `/opt/recon/lib/new_pipeline.py` | Phase B query fix + organized_at clear in reverse |
| `/opt/recon/lib/status.py` | Added `get_ingest_pending()` method |
Both fixes synced to local copies at `/home/zvx/projects/recon/lib/`.
---
## Pipeline State After Validation
| Item | State |
|------|-------|
| `new_pipeline.enabled` | false (disabled after test) |
| Watchdog process | killed |
| Test document | Left in place at `Civil-Organization/Governance/Civil_Governance_Framework_Analysis.pdf` |
| `_acquired/` | Empty |
| `_ingest/` | Empty |
| `_ingest/_duplicates/` | Empty |
| `_ingest/_failed/` | Empty |
| Total file_operations records | 82 (80 from migration + 2 from test) |
| duplicate_review records | 0 |
---
## Recommendations
1. **Ready for production:** The two-phase ingest pipeline is functional. Enable `new_pipeline.enabled: true` when ready to accept new acquisitions.
2. **Watchdog logging:** Consider calling `setup_logging('recon.pipeline')` at the start of `run_watchdog()` so logs appear in the main RECON log file even when run standalone via `recon.py pipeline watch`.
3. **Domain expansion:** The `pilot_domain: "Civil Organization"` restriction limits placement to Civil Org docs only. To enable for all domains, set `pilot_domain: null` or remove it.
4. **PeerTube organized_at:** 29,469 complete docs with `organized_at IS NULL` are mostly PeerTube transcripts. Consider bulk-setting `organized_at` for non-PDF docs to prevent the `get_unorganized()` query from growing unbounded (though the new `get_ingest_pending()` query sidesteps this issue for the pipeline).
---
## Final Verdict
**Stream B: New Library Pipeline — VALIDATED**
All components tested and operational:
- Phase A acquisition (watchdog → `_acquired/``_ingest/`)
- RECON pipeline integration (extract → enrich → embed)
- Phase B placement (standardized naming from book_title → collision ladder → library)
- Qdrant payload updates (download_url, filename, original_filename)
- Reverse operation (full rollback including Qdrant)
- Re-placement after reverse
- Aurora RAG retrieval (citations resolve to new paths)
- Audit trail (file_operations table)