refactored-recon/phases/phase-2-shared-filing.md
Matt 2a1d211d7c Phase 2: shared filing function
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-14 15:04:13 +00:00

4 KiB

Phase 2: Shared Filing Function

Executed: 2026-04-14T15:15Z UTC


Backup

Item Location MD5 Hash
recon.db (pre-Phase 2) CT 130: /tmp/recon.db.phase2.20260414.bak 20ec1fec2247a999e7d42f6a716481b0

Git Setup (prerequisite work)

/opt/recon was not a git repository. Initialized and pushed:

  • Repo: https://forge.echo6.co/matt/recon (private)
  • Auth: HTTPS with API token (SSH key on CT 130 was already registered elsewhere in Forgejo)
  • Initial commit: 563c16b — full codebase baseline on master
  • Refactor branch: refactor created from master

What Was Created

lib/filing.pyfile_processed_item() function

RECON branch: refactor Commit: de2c59a

A shared filing function that any future processor can call to file a completed item from the processing stage into the organized library.

Signature:

def file_processed_item(doc_hash, source_file_path, db, config, dry_run=False) -> dict

Return dict keys: hash, action, source_path, target_path, domain, subdomain, qdrant_points_updated, error

Action values: filed, skip_unclassified, skip_already_filed, would_file, error

What it does (in order):

  1. Verifies source file exists
  2. Calls determine_dominant_domain() to classify from concept JSONs
  3. Looks up original filename from catalogue
  4. Calls _build_target_path() with collision handling
  5. Checks idempotency (source == target → skip_already_filed)
  6. In dry_run: returns would_file without moving
  7. Moves file with shutil.move()
  8. Updates catalogue path, documents path, marks organized
  9. Updates Qdrant payloads (download_url, filename, original_filename)

Dependencies on Existing Code

Module Function/Method Purpose
lib/organizer.py determine_dominant_domain(doc_hash, data_dir) Domain classification from concept JSONs
lib/organizer.py _build_target_path(library_root, domain, subdomain, filename, doc_hash) Target path with collision handling
lib/new_pipeline.py update_qdrant_payload(doc_hash, new_path, new_filename, original_filename, config) Qdrant payload sync
lib/status.py StatusDB.update_catalogue_path(hash, path, filename) Catalogue DB update
lib/status.py StatusDB.sync_document_path(hash, path, filename) Documents DB update
lib/status.py StatusDB.mark_organized(hash) Set organized_at timestamp
lib/status.py StatusDB._get_conn() Thread-local SQLite connection

Testing

Import test

python3 -c "from lib.filing import file_processed_item; print('Import OK')"
→ Import OK

Dry-run test against real data

Document: 3c8512868fa568a861c7994019ed5e88 (U.S. Army Reconnaissance And Surveillance Handbook)

action: would_file
domain: Defense & Tactics
subdomain: Reconnaissance
target_path: /mnt/library/Defense-and-Tactics/Reconnaissance/U.S. Army Reconnaissance And Surveillance Handbook.pdf
qdrant_points_updated: 0 (dry_run — no actual update)
error: None

The function correctly classified the document, derived the canonical path, and returned would_file (source path uses underscores, target uses spaces — slight rename).


What Did NOT Change

  • No existing files modified: lib/organizer.py, lib/status.py, lib/new_pipeline.py, lib/utils.py, recon.py — all untouched
  • No data modified: catalogue=29,812, documents=29,812 (unchanged)
  • No service state changed: Both services remain inactive
  • Processing directory empty: No files placed in /opt/recon/data/processing/
  • Legacy organize_document() untouched — remains available for existing code paths

Verification

Check Result
catalogue rows 29,812
documents rows 29,812
processing/ files 0
recon.service inactive
recon-watchdog.service inactive
Import test passed
Dry-run test passed (would_file)