refactored-recon/phases/phase-2-shared-filing.md

115 lines
4 KiB
Markdown
Raw Normal View History

# Phase 2: Shared Filing Function
**Executed:** 2026-04-14T15:15Z UTC
---
## Backup
| Item | Location | MD5 Hash |
|------|----------|----------|
| recon.db (pre-Phase 2) | CT 130: `/tmp/recon.db.phase2.20260414.bak` | `20ec1fec2247a999e7d42f6a716481b0` |
---
## Git Setup (prerequisite work)
`/opt/recon` was not a git repository. Initialized and pushed:
- **Repo:** https://forge.echo6.co/matt/recon (private)
- **Auth:** HTTPS with API token (SSH key on CT 130 was already registered elsewhere in Forgejo)
- **Initial commit:** `563c16b` — full codebase baseline on `master`
- **Refactor branch:** `refactor` created from `master`
---
## What Was Created
### `lib/filing.py` — `file_processed_item()` function
**RECON branch:** `refactor`
**Commit:** `de2c59a`
A shared filing function that any future processor can call to file a completed item from the processing stage into the organized library.
**Signature:**
```python
def file_processed_item(doc_hash, source_file_path, db, config, dry_run=False) -> dict
```
**Return dict keys:** `hash`, `action`, `source_path`, `target_path`, `domain`, `subdomain`, `qdrant_points_updated`, `error`
**Action values:** `filed`, `skip_unclassified`, `skip_already_filed`, `would_file`, `error`
**What it does (in order):**
1. Verifies source file exists
2. Calls `determine_dominant_domain()` to classify from concept JSONs
3. Looks up original filename from catalogue
4. Calls `_build_target_path()` with collision handling
5. Checks idempotency (source == target → skip_already_filed)
6. In dry_run: returns `would_file` without moving
7. Moves file with `shutil.move()`
8. Updates catalogue path, documents path, marks organized
9. Updates Qdrant payloads (download_url, filename, original_filename)
---
## Dependencies on Existing Code
| Module | Function/Method | Purpose |
|--------|----------------|---------|
| `lib/organizer.py` | `determine_dominant_domain(doc_hash, data_dir)` | Domain classification from concept JSONs |
| `lib/organizer.py` | `_build_target_path(library_root, domain, subdomain, filename, doc_hash)` | Target path with collision handling |
| `lib/new_pipeline.py` | `update_qdrant_payload(doc_hash, new_path, new_filename, original_filename, config)` | Qdrant payload sync |
| `lib/status.py` | `StatusDB.update_catalogue_path(hash, path, filename)` | Catalogue DB update |
| `lib/status.py` | `StatusDB.sync_document_path(hash, path, filename)` | Documents DB update |
| `lib/status.py` | `StatusDB.mark_organized(hash)` | Set organized_at timestamp |
| `lib/status.py` | `StatusDB._get_conn()` | Thread-local SQLite connection |
---
## Testing
### Import test
```
python3 -c "from lib.filing import file_processed_item; print('Import OK')"
→ Import OK
```
### Dry-run test against real data
Document: `3c8512868fa568a861c7994019ed5e88` (U.S. Army Reconnaissance And Surveillance Handbook)
```
action: would_file
domain: Defense & Tactics
subdomain: Reconnaissance
target_path: /mnt/library/Defense-and-Tactics/Reconnaissance/U.S. Army Reconnaissance And Surveillance Handbook.pdf
qdrant_points_updated: 0 (dry_run — no actual update)
error: None
```
The function correctly classified the document, derived the canonical path, and returned `would_file` (source path uses underscores, target uses spaces — slight rename).
---
## What Did NOT Change
- **No existing files modified:** `lib/organizer.py`, `lib/status.py`, `lib/new_pipeline.py`, `lib/utils.py`, `recon.py` — all untouched
- **No data modified:** catalogue=29,812, documents=29,812 (unchanged)
- **No service state changed:** Both services remain inactive
- **Processing directory empty:** No files placed in `/opt/recon/data/processing/`
- **Legacy `organize_document()` untouched** — remains available for existing code paths
---
## Verification
| Check | Result |
|-------|--------|
| catalogue rows | 29,812 |
| documents rows | 29,812 |
| processing/ files | 0 |
| recon.service | inactive |
| recon-watchdog.service | inactive |
| Import test | passed |
| Dry-run test | passed (would_file) |