mirror of
https://github.com/zvx-echo6/refactored-recon.git
synced 2026-05-20 14:44:39 +02:00
First commit of the cleanup log to the repo — previously maintained as an uncommitted working document across sessions. 31 original items triaged. 11 moved to Resolved section with phase references (6a through 6k, PROJECT-BIBLE rewrite, pi-nas decommission). 5 new backlog items added (duplicate consolidation, legacy data/text dirs, backup architecture, signal-archive, Phase 5a edge cases). 4,771 duplicate PDFs marked PARTIALLY RESOLVED (hash-match dupes handled; same-content-different-bytes clusters split to new item).
584 lines
40 KiB
Markdown
584 lines
40 KiB
Markdown
# Cleanup Log
|
|
|
|
A running list of non-blocking issues discovered during the RECON refactor. These are things to handle later — they don't block any current phase but they shouldn't be forgotten.
|
|
|
|
Each item has a discovery date, a description, where it came from, and a suggested resolution. As items get handled, they move to the "Resolved" section at the bottom with a date.
|
|
|
|
This log is updated as new things are noticed. Commit history shows when each item appeared and when each was closed out.
|
|
|
|
---
|
|
|
|
## Open
|
|
|
|
### Cruft in /opt/recon/ — PARTIALLY RESOLVED
|
|
|
|
**Discovered:** 2026-04-14 (Phase 2)
|
|
**Resolved:** 2026-04-14 (Phase 6c) — 24 .bak files (~800KB) removed during Phase 6c cleanup
|
|
**Source:** CC's initial git add caught a bunch of files that don't belong in version control
|
|
|
|
The 24 .bak files identified at the start of the refactor were all manual pre-edit safety backups. All originals are in git history. Phase 6c removed them cleanly:
|
|
- 4 recon.py.bak* variants
|
|
- 2 config.yaml.bak* variants
|
|
- 7 lib/api.py.bak* variants
|
|
- 7 other lib/*.py.bak files
|
|
- 4 templates/scripts/static .bak files
|
|
|
|
Still potentially present (not investigated by Phase 6c):
|
|
- `-.png` (stray image with hyphen filename)
|
|
- `pipeline.log` (log file in working directory)
|
|
- Any non-.bak cruft
|
|
|
|
**Remaining cleanup:** quick once-over for any non-.bak cruft as part of final closeout. Likely just a few files. Not blocking.
|
|
|
|
---
|
|
|
|
### `.git-credentials` plaintext token on CT 130
|
|
|
|
**Discovered:** 2026-04-14 (Phase 2)
|
|
**Source:** SSH key on CT 130 wasn't usable for forge auth, fell back to HTTPS with API token
|
|
|
|
CT 130 has `~/.git-credentials` containing a plaintext Forgejo API token for the matt user. This was needed because the existing SSH key (`zvx@recon-lxc`) was already registered as a non-deploy key elsewhere on forge and couldn't be re-added.
|
|
|
|
The token works for git operations but is a security risk — it's full-account access in plaintext on disk.
|
|
|
|
**Suggested resolution:** at end of refactor, either:
|
|
1. Generate a new SSH key on CT 130 specifically for forge access and add it to the matt account
|
|
2. Rotate the API token, store the new one in a credential helper that doesn't write plaintext (e.g., libsecret)
|
|
3. Convert to a deploy key with limited scope to just the recon and refactored-recon repos
|
|
|
|
Lowest friction: option 1.
|
|
|
|
---
|
|
|
|
### `_unclassified` and `_ingest` staging dirs are non-empty — RESOLVED
|
|
|
|
**Discovered:** 2026-04-14 (Phase 0 baseline)
|
|
**Resolved:** 2026-04-15 (Phase 6j library cleanup)
|
|
|
|
`_unclassified/` (1,240 PDFs, 6.3G) was fully refiled through the new pipeline — all files unprocessed from DB/Qdrant, dropped into `acquired/pdf/`, and directory recreated empty. `_ingest/_duplicates/` (328 files, 2.4G) confirmed as all hash-match duplicates of content already filed in domain folders — deleted entirely. Both directories cleaned.
|
|
|
|
---
|
|
|
|
### 13 PDFs in `_unclassified/` with stale `embedded_at` and no Qdrant vectors — RESOLVED
|
|
|
|
**Discovered:** 2026-04-14 (Phase 0 anomaly investigation)
|
|
**Resolved:** 2026-04-15 (Phase 6j) — subsumed by the full _unclassified refile pass
|
|
|
|
13 PDFs were moved to `/mnt/library/_unclassified/` by yesterday's `sweep_unclassified` operation. The sweep correctly:
|
|
- Moved the files
|
|
- Updated catalogue paths
|
|
- Set `organized_at` timestamps
|
|
|
|
But it incorrectly left `embedded_at` set in the documents table while deleting the corresponding Qdrant points. Result: 13 docs that claim to be embedded but have zero vectors in Qdrant.
|
|
|
|
Affected:
|
|
- 11 from Survival-Companion-Library
|
|
- 1 from Defense-and-Tactics
|
|
- 1 from `_ingest`
|
|
|
|
Files are intact. The state is internally consistent with "this needs re-processing" — the contradiction is only visible if you compare documents.embedded_at vs Qdrant point counts.
|
|
|
|
**Suggested resolution:** these will get re-processed naturally as part of `_unclassified/` triage in Phase 5. No special action needed beyond the broader unclassified cleanup. Verify after re-processing that the `embedded_at` and `organized_at` columns reflect reality.
|
|
|
|
This is also a data point illustrating the atomicity problem the new architecture solves: any operation that touches Qdrant should update the corresponding DB columns in the same transition.
|
|
|
|
---
|
|
|
|
### 4,771 duplicate physical PDF copies on disk — PARTIALLY RESOLVED
|
|
|
|
**Discovered:** 2026-04-13 (forensic catalogue audit, pre-refactor)
|
|
**Partially resolved:** 2026-04-15 (Phase 6j) — 398 byte-for-byte hash-match duplicates deleted (Army_Pubs, Acquired, Scenario-Playbooks dupes, loose root). Remaining: same-content-different-bytes duplicates from pre-refactor ingestion. See new backlog item "Pre-refactor library contains duplicate clusters requiring one-time consolidation."
|
|
|
|
---
|
|
|
|
### pi-nas 283 GB orphaned NFS export — RESOLVED
|
|
|
|
**Discovered:** 2026-04-13 (during B.3a library reorganization)
|
|
**Resolved:** 2026-04-15 — pi-nas `/export/library` wiped (283G deleted), NFS + SMB shares removed via OMV web UI, shared folder and fstab bind-mount entry removed. Pi-nas now designated as backup target (planned, not yet configured).
|
|
|
|
---
|
|
|
|
### 277 STATE 2 PeerTube transcripts deferred at `/opt/recon/data/text/` — RESOLVED
|
|
|
|
**Discovered:** 2026-04-13 (Phase A transcript migration)
|
|
**Resolved:** 2026-04-15 (Phase 6h) — 283 zero-vector transcripts deleted (DB rows, concepts, local text, Qdrant entries). 1,198 orphan dirs in `data/text/` also deleted (269 MB freed). PeerTube transcription re-triggered for 332 videos without captions via `POST /api/v1/videos/{uuid}/captions/generate`. Runner on cortex processes Whisper jobs; peertube-acq picks up new captions automatically.
|
|
|
|
---
|
|
|
|
### `recon-watchdog.service` and `recon.service` still systemd-enabled
|
|
|
|
**Discovered:** 2026-04-14 (Phase 0 baseline + Phase 0 watchdog stop)
|
|
**Source:** systemctl is-enabled checks during Phase 0
|
|
|
|
Both services are stopped but still `enabled` in systemd, meaning a host reboot or manual intervention would start them. This is intentional for now — we don't want to forget to re-enable them after the refactor — but it's a footgun if anything triggers a service start before the cutover (Phase 5).
|
|
|
|
**Suggested resolution:** decide before Phase 5 whether to disable both services and re-enable at cutover, or leave enabled and rely on the stop holding. Belt-and-suspenders argues for disable. Document the decision in `decisions.md`.
|
|
|
|
---
|
|
|
|
### PROJECT-BIBLE.md is stale — RESOLVED
|
|
|
|
**Discovered:** 2026-04-14 (codebase review)
|
|
**Resolved:** 2026-04-15 — complete rewrite of PROJECT-BIBLE.md in refactored-recon repo. Covers full post-refactor architecture (pipeline lifecycle, 3 content types, 7 daemon threads, data locations, domain taxonomy, config, CLI, API, refactor history through Phase 6k, operational runbook, known gotchas). Follow-up commit corrected storage topology (LXC bind-mount, not NFS).
|
|
|
|
---
|
|
|
|
### Phase 3 noise: dispatcher logs errors for unregistered processors
|
|
|
|
**Discovered:** 2026-04-14 (Phase 3 test run)
|
|
**Source:** dispatcher logs during end-to-end test
|
|
|
|
The dispatcher iterates over all subfolders configured in `pipeline.dispatch` and tries to import the processor module for each. In Phase 3 only `transcript_processor` exists. Imports for `pdf_processor` and `html_processor` fail and the dispatcher logs them as ERROR-level messages. The dispatcher correctly skips the failing subfolders and continues, but the log noise is misleading — they look like real errors when they're "expected absence."
|
|
|
|
**Suggested resolution:** in the Phase 4 prompt (when we add pdf_processor), have CC change the dispatcher to silently skip subfolders where the registered processor module doesn't import. Log at DEBUG level instead of ERROR. Or: only try to import a processor when there's actually content in its subfolder waiting to be processed (lazy import). Either approach is fine.
|
|
|
|
---
|
|
|
|
### Stale processing/ and concepts/ directories cause re-run failures
|
|
|
|
**Discovered:** 2026-04-14 (Phase 3 test run)
|
|
**Source:** the first enricher run on the test transcript failed because `/opt/recon/data/concepts/{hash}/` already existed from the original ingestion months ago
|
|
|
|
The transcript processor moves files into `/opt/recon/data/processing/{hash}/` but does not check for or clean up any pre-existing scratch state (concepts dir, partial processing dir). When a hash has been processed before (as in the unprocess-and-test case), the stale state collides with the new run.
|
|
|
|
**Suggested resolution:** add a pre-flight cleanup step to the transcript processor (and PDF processor when we build it) that removes any existing `processing/{hash}/` and `concepts/{hash}/` directories before starting. This is safe because the contract is "the processor owns the processing directory for this hash for the duration of processing." Pre-existing scratch is by definition stale.
|
|
|
|
Lower priority: this only matters in the unprocess-and-test workflow. In normal operation, brand-new content won't have stale state. But fixing it makes the migration in Phase 5 safer because all 277 STATE 2 transcripts have stale concept directories.
|
|
|
|
---
|
|
|
|
### `lib/embedder.py` and `lib/enricher.py` were root-owned
|
|
|
|
**Discovered:** 2026-04-14 (Phase 3 code edits)
|
|
**Source:** CC tried to edit them and got permission denied
|
|
|
|
Two files in `/opt/recon/lib/` were owned by root:root instead of zvx:zvx. CC chowned them to zvx during Phase 3. The rest of `lib/` should be audited for ownership consistency — anything that's not zvx:zvx is a footgun for future edits.
|
|
|
|
**Suggested resolution:** quick audit of `/opt/recon/lib/` and `/opt/recon/scripts/` ownership. `find /opt/recon -not -user zvx` would show anything anomalous. Fix any stragglers with chown. Probably hours of work if there are many; minutes if just a few. Worth doing during Phase 6 cleanup pass.
|
|
|
|
---
|
|
|
|
### Phase 3 test artifacts left on disk
|
|
|
|
**Discovered:** 2026-04-14 (Phase 3 test, expected)
|
|
**Source:** copy-and-unprocess test workflow
|
|
|
|
The Phase 3 test left two artifacts on disk that we deliberately did not clean up:
|
|
|
|
1. `/opt/recon/data/text/172f39ae7fc6f5b02e0fabcea450c0e4/` — the original processed form of the test transcript (one of the 277 STATE 2). We used copy-and-unprocess so this is intact.
|
|
2. `/opt/recon/data/processing/172f39ae7fc6f5b02e0fabcea450c0e4/` — the new pipeline's processing dir for the same hash. The transcript was filed as `skip_unclassified` so the file never moved out.
|
|
|
|
Both are useful diagnostic state and harmless. They'll get cleaned up either by Phase 5's bulk migration of the 277 STATE 2 transcripts, or by Phase 6 cleanup.
|
|
|
|
**Suggested resolution:** none for now. Track that they exist so they don't surprise us later.
|
|
|
|
---
|
|
|
|
### Phase 5 design note: 277 STATE 2 transcripts likely all empty content
|
|
|
|
**Discovered:** 2026-04-14 (Phase 3 investigation + Phase 3 test result)
|
|
**Source:** all 277 STATE 2 transcripts have `concepts_extracted = 0`; the Phase 3 test transcript was a YouTube channel announcement that correctly produced zero concepts
|
|
|
|
The 277 STATE 2 transcripts that we planned to "reprocess through the new pipeline in Phase 5" are likely all thin content — channel announcements, intros, Q&A clips, short reactions. Things with dialogue but no extractable knowledge. The original pipeline correctly identified that they had no concepts to embed and marked them complete with 0 vectors.
|
|
|
|
**Phase 5 design implication:** we should NOT reprocess all 277 through the full pipeline. They'll all hit `skip_unclassified` and waste cycles. Instead, mark them as organized with a "no concepts, intentionally skipped" state, leave them where they are (or move them to an archive folder), and skip the reprocessing step entirely.
|
|
|
|
This needs to be reflected in the migration-plan.md when we get to Phase 5 detailed planning.
|
|
|
|
**Suggested resolution:** when Phase 5 is being scoped, update the migration plan to handle the 277 differently than originally proposed. Possible approaches:
|
|
- SQL update to set `organized_at` and add a flag like `skipped_no_concepts` on the documents row
|
|
- Move the original `/opt/recon/data/text/{hash}/` directories to an archive location for retention
|
|
- Document in PROJECT-BIBLE.md that empty-concept content is an intentional state, not a bug
|
|
|
|
---
|
|
|
|
### Filing function hardcodes `.pdf` extension regardless of content type
|
|
|
|
**Discovered:** 2026-04-14 (Phase 3 re-test)
|
|
**Source:** the soldering transcript test was filed as `How to Solder Copper Pipe in a Wall (Complete Guide) GOT2LEARN.pdf` — a text file with a `.pdf` extension
|
|
|
|
The `_build_target_path()` helper in `lib/organizer.py` (called by `lib/filing.py`'s `file_processed_item`) hardcodes `.pdf` as the extension when constructing the canonical library filename. This was correct for the old PDF-only pipeline but is wrong now that transcripts and (eventually) HTML content flow through the same filing path.
|
|
|
|
Concrete consequences if left unfixed:
|
|
- Browsers and file managers will try to open transcript files as PDFs and fail
|
|
- Download URLs in Qdrant point at `.pdf` files that don't render
|
|
- Dashboard search results show `.pdf` icons next to transcripts
|
|
- The PDF processor in Phase 4 will use the same helper and `_build_target_path()` collision logic will compare `.pdf` against `.pdf` even when one is a real PDF and one is a transcript-pretending-to-be-pdf — which is a real correctness issue when level-2/3/4 name dedupe runs
|
|
|
|
**Suggested resolution:** fix as part of the Phase 4 prompt scope. Either:
|
|
- Have `_build_target_path()` accept an `extension` parameter and pass it through from the calling processor
|
|
- Or: derive the extension from the source file path (`os.path.splitext(source_file_path)[1]`) inside `file_processed_item()`
|
|
|
|
Second option is simpler and self-documenting. Each processor's source file already has the right extension when it arrives in `_processing/`, so the filing function can just preserve it.
|
|
|
|
Phase 4 also needs to back-fix the Phase 3 re-test artifact: rename `/mnt/library/Shelter-and-Construction/Plumbing/How to Solder Copper Pipe in a Wall (Complete Guide) GOT2LEARN.pdf` to `.txt`, and update catalogue + documents + Qdrant payloads to match. One file, one hash, three places to update.
|
|
|
|
---
|
|
|
|
### Domain classification is non-deterministic across runs
|
|
|
|
**Discovered:** 2026-04-14 (Phase 4 end-to-end test)
|
|
**Source:** re-running the hydroelectric PDF through the new pipeline put it at `Power-Systems/*` instead of the baseline `Off-grid-Systems/Hydroelectric-Systems/*`
|
|
|
|
The domain classifier (`determine_dominant_domain()`) reads all concept JSONs for a document and picks the most common domain tag. Because Gemini enrichment is non-deterministic — same text produces different concept counts and slightly different tags on different runs — the winning domain can shift between runs of the same content.
|
|
|
|
In the Phase 4 test, the hydroelectric book had 26 concepts in the new run vs 20 in the baseline. Enough of the new concepts were tagged `Power Systems` instead of `Off-grid Systems` to flip the domain winner.
|
|
|
|
**Neither classification is wrong.** Both are defensible categories for a hydroelectric book. But the implication is real: **reprocessing existing content through the new pipeline will shuffle some files to different library locations.**
|
|
|
|
**Phase 5 design implication:** when we resweep the 18,855 transcripts to file them by domain instead of by source, the classifications will be driven by the existing concept JSONs (no new enrichment) so they'll be stable. Good.
|
|
|
|
But for any future reprocessing that re-runs enrichment, we should be explicit that "reprocessing is not idempotent for domain placement." Either:
|
|
1. Accept the shuffling as a feature — documents migrate toward their most-current best classification
|
|
2. Pin classifications to their first-computed value by storing `classified_domain` on the documents row and never recomputing
|
|
|
|
Option 1 is simpler and matches the spirit of "let the system learn over time." Option 2 is more predictable but adds state that has to be invalidated somehow.
|
|
|
|
**Suggested resolution:** document this behavior in the new PROJECT-BIBLE.md during Phase 6. No code change needed unless we decide option 2 is worth the complexity. For Phase 5 specifically, use the existing concept JSONs (don't re-enrich) so classifications stay stable during the resweep.
|
|
|
|
---
|
|
|
|
### Sidecar contract is implicit per content type
|
|
|
|
**Discovered:** 2026-04-14 (Phase 4 Fix 1.5 investigation)
|
|
**Source:** the dispatcher was originally written assuming "content + sidecar pair" was universal, but the PDF processor doesn't need a sidecar
|
|
|
|
Different processors have different sidecar requirements:
|
|
- **Transcript processor:** requires `.meta.json` sidecar. Metadata is born at the PeerTube API call and cannot be derived from the transcript text.
|
|
- **PDF processor:** sidecar is optional. Metadata is derivable from the PDF itself (internal dict, filename, Gemini extraction).
|
|
- **Future processors:** TBD per content type.
|
|
|
|
The dispatcher was relaxed in Phase 4 (Fix 1.5) to accept both paired and solo content files. But there's no formal documentation or type annotation that says "content type X requires/doesn't require a sidecar." A future processor author could easily assume the wrong thing.
|
|
|
|
**Suggested resolution:** add a module-level constant to each processor module declaring its sidecar expectation. Something like:
|
|
|
|
```python
|
|
# In lib/processors/transcript_processor.py
|
|
SIDECAR_REQUIRED = True
|
|
|
|
# In lib/processors/pdf_processor.py
|
|
SIDECAR_REQUIRED = False
|
|
```
|
|
|
|
The dispatcher reads this when loading each processor and enforces the contract accordingly — if a processor declares `SIDECAR_REQUIRED = True` and a solo content file appears in its subfolder, the dispatcher skips it (or moves it to review). Makes the contract explicit and self-documenting.
|
|
|
|
Low priority. Phase 4 works correctly as-is. Worth doing before the next content type is added.
|
|
|
|
---
|
|
|
|
### /tmp/recon_phase4/ clutter on cortex
|
|
|
|
**Discovered:** 2026-04-14 (Phase 4 execution)
|
|
**Source:** CC writes Python files locally on cortex, scp's them to CT 130, leaves originals in /tmp
|
|
|
|
Leftover files on cortex in /tmp/recon_phase4/:
|
|
- `pdf_processor.py` (656 lines — the source-of-record before it was scp'd to CT 130)
|
|
- `setup_test.py` (77 lines — the test setup helper)
|
|
|
|
Same pattern exists in /tmp/recon_phase3/ and /tmp/recon_phase2/ from earlier phases.
|
|
|
|
These are harmless — /tmp gets cleared on reboot, and cortex isn't the canonical source anyway (CT 130 is). But it's clutter.
|
|
|
|
**Suggested resolution:** none. Let /tmp clear naturally. If we care later, add a cleanup step to each phase prompt that removes the /tmp scratch dir after successful commit and push.
|
|
|
|
---
|
|
|
|
### CC cloned refactored-recon to the wrong location during Phase 4
|
|
|
|
**Discovered:** 2026-04-14 (Phase 4 doc commit step)
|
|
**Source:** CC couldn't find the existing clone and cloned fresh into `/tmp/refactored-recon/`
|
|
|
|
The canonical refactored-recon clone on cortex is at `/home/zvx/projects/repos/recon_refactor/`. CC searched in the wrong places (`/opt/refactored-recon/`, `/home/zvx/projects/refactored-recon/`) and didn't find it, then cloned fresh into `/tmp/refactored-recon/`.
|
|
|
|
Consequences:
|
|
- The Phase 4 doc was committed from `/tmp/refactored-recon/` by `Ubuntu <zvx@cortex.echo6.co>` — different committer identity than earlier commits
|
|
- The /tmp clone persists and is now a second working copy that could drift from the canonical one
|
|
- The canonical clone at `/home/zvx/projects/repos/recon_refactor/` is one commit behind forge master until the next pull
|
|
|
|
**Suggested resolution:** in the Phase 5 prompt, explicitly tell CC the canonical path: `/home/zvx/projects/repos/recon_refactor/`. Before committing any phase doc, CC should `cd` there, `git pull`, then make the edit. After Phase 5, delete `/tmp/refactored-recon/` so there's only one clone.
|
|
|
|
Minor but worth fixing because committer identity drift is annoying to clean up later.
|
|
|
|
---
|
|
|
|
### `shutil.rmtree(ignore_errors=True)` is a footgun
|
|
|
|
**Discovered:** 2026-04-14 (Phase 5b execution)
|
|
**Source:** the first deletion pass in Phase 5b claimed to delete 2,259 source directories but silently did nothing — NFS root_squash was denying permissions to the root user, and `ignore_errors=True` made every failure invisible
|
|
|
|
The `delete_and_clean.py` script used `shutil.rmtree(source_dir, ignore_errors=True)` to remove transcript directories from `/mnt/library/_sources/streamecho6/`. The script completed without logging any errors. Post-execution verification then showed the directories still existed, which is how the root_squash issue was caught.
|
|
|
|
**Lesson:** `ignore_errors=True` should almost never be used in destructive operations. It's fine for "this file may not exist, don't crash" cases but not for "I am expecting to delete this." The correct pattern is explicit error handling where failures are logged and counted, with a post-execution assertion that expected failures didn't happen.
|
|
|
|
**Suggested resolution:** when writing future phase scripts that use `shutil.rmtree` or similar destructive helpers, default to `ignore_errors=False` and wrap in try/except that logs the failure with its path. Add a counter for "deletion failures" that shows up in the final summary. No silent failures.
|
|
|
|
Not a code change to existing RECON — it's a pattern note for future phase prompts.
|
|
|
|
---
|
|
|
|
### File ownership on /mnt/library/ — run as zvx (updated: not NFS root_squash)
|
|
|
|
**Discovered:** 2026-04-14 (Phase 5b execution)
|
|
**Updated:** 2026-04-15 — the NFS root_squash explanation was wrong. `/mnt/library/` is NOT NFS; it's an LXC bind-mount from the data host's local SSD. The "run as zvx" guidance is still correct but for a different reason: files are owned by zvx, the service runs as zvx, and running ad-hoc operations as root creates root-owned files that the service can't manage on subsequent runs. This was hit in Phase 5c-2.
|
|
|
|
**Rule unchanged:** do not use sudo for file operations on `/mnt/library/` or `/opt/recon/data/`. Use sudo ONLY for systemctl and privileged system operations.
|
|
|
|
---
|
|
|
|
### CC's habitual use of `sudo` on /opt/recon/ is unnecessary and inconsistent with the zvx-ownership model
|
|
|
|
**Discovered:** 2026-04-14 (Phase 5c-1 execution)
|
|
**Source:** CC used `sudo` for every operation on /opt/recon/ during Phase 5c-1 edits, despite files being owned by zvx
|
|
|
|
The refactor has been chown'ing files to zvx:zvx as needed (first noted in Phase 3 when lib/enricher.py and lib/embedder.py were root-owned). Most of lib/ is now zvx-owned, and CC's Phase 5c-1 edits to lib/dispatcher.py, lib/filing.py, and recon.py did not require sudo. CC used it anyway out of habit.
|
|
|
|
This isn't a correctness issue for local-disk files like /opt/recon/ — sudo just runs the command as root and the result is the same. But it's inconsistent with the Phase 5b rule ("don't use sudo for operations that touch NFS") and it obscures which operations actually need root. It also means files edited through sudo end up with mtime changes done as root, which can confuse file-ownership audits.
|
|
|
|
**Suggested resolution:** in Phase 5c-2 and beyond, tell CC explicitly: "Run Python and file edits as zvx. Use sudo ONLY for systemctl and for chown/chmod operations that need it." Operations on /opt/recon/ code files, /opt/recon/data/ database files, and /mnt/library/ filesystem should all run as zvx.
|
|
|
|
If CC finds a file it can't write because it's root-owned, that file needs to be chown'd to zvx as a one-off fix, and the fix logged here. Don't work around ownership with sudo.
|
|
|
|
---
|
|
|
|
### CT 130 git push uses HTTPS with no stored credentials
|
|
|
|
**Discovered:** 2026-04-14 (Phase 5c-1 commit step)
|
|
**Source:** when CC tried to `git push origin refactor` from /opt/recon/, it failed because the origin was set to HTTPS and no credentials were configured
|
|
|
|
Git remote config on /opt/recon/:
|
|
- Originally HTTPS (from Phase 2 bootstrap when we couldn't use SSH due to key registration issues)
|
|
- CC tried to switch to SSH, but CT 130's SSH key isn't in matt's forge account
|
|
- Fell back to embedding the API token in the URL: `https://matt:<token>@forge.echo6.co/matt/recon.git`
|
|
- After push, cleaned up the URL back to plain HTTPS, which means the next push from CT 130 will hit the same problem
|
|
|
|
This is the second time this has come up (first was Phase 2 when the repo was first created). Root cause is still the SSH key registration issue on forge — CT 130's key was already registered as a non-deploy key elsewhere and can't be re-added to the matt account.
|
|
|
|
**Suggested resolution:** pick one of these and stop patching around it:
|
|
|
|
1. **Generate a new SSH key specifically for matt's account on CT 130**, add it to the matt account on forge. Remove the original `zvx@recon-lxc` key if it's obsolete. Set `/opt/recon/` remote to `ssh://git@forge.echo6.co:2222/matt/recon.git`. Clean, permanent, matches how refactored-recon works from cortex.
|
|
|
|
2. **Use `git credential-store`** on CT 130 with the API token. Set remote back to HTTPS, cache credentials in `~/.git-credentials` with restrictive permissions. Token stays on disk in plaintext (already an open cleanup item).
|
|
|
|
3. **Do all commits from cortex instead of CT 130.** Would require CC to always SSH the modified files back to cortex, commit from there. More SSH round-trips but keeps credentials off CT 130 entirely.
|
|
|
|
My lean: option 1. It's the cleanest and matches existing patterns. The key registration issue is solvable by just generating a new key pair specifically for this purpose. This could be done during Phase 6 cleanup or earlier if we get annoyed by the friction.
|
|
|
|
For Phase 5c-2: the service start doesn't require any git push, so the issue won't come up in that phase. But Phase 6 will definitely hit it again, so earlier is better.
|
|
|
|
---
|
|
|
|
### Dashboard shows transcripts as "Untitled" / "WEB" — RESOLVED
|
|
|
|
**Discovered:** 2026-04-14 (Phase 5c-2 drain monitoring)
|
|
**Resolved:** 2026-04-14 (Phase 6b) — two fixes in `lib/api.py`: title display fixed via COALESCE with `catalogue.filename` for transcripts; type badge fixed by adding a `transcript` branch keyed on `source='stream.echo6.co'`. Added `.badge-transcript` CSS (purple).
|
|
|
|
---
|
|
|
|
### Transcripts don't get filed into library tree — they stay in /opt/recon/data/processing/
|
|
|
|
**Discovered:** 2026-04-14 (Phase 5c-2 drain post-verification)
|
|
**Resolution:** Phase 6a — accept as correct behavior. Transcripts stay in processing/, watch URL is their permanent download_url.
|
|
|
|
After Phase 5c-2's drain, all 2,259 newly-completed transcripts had `organized_at IS NULL` because the filing worker's query filter (`path LIKE '/opt/recon/data/processing/%'`) didn't match them — their `documents.path` is the PeerTube watch URL, not a filesystem path.
|
|
|
|
My initial instinct was to "fix" this by making the filing worker pick up transcripts and file them to `library/Domain/Subdomain/*.txt` like PDFs. That would have lost the watch URL (filing overwrites catalogue.path and Qdrant download_url with the library path).
|
|
|
|
**The correct resolution is Option 3:** transcripts are NOT files the library stores. They're extracted text from videos that live on PeerTube. The library's `Domain/Subdomain/` tree is for primary source documents (PDFs, manuals, books). Transcripts are derived data whose canonical source is the video, and whose searchable form is the Qdrant vectors.
|
|
|
|
Phase 6a implements Option 3:
|
|
1. Transcript processor sets `organized_at = CURRENT_TIMESTAMP` at the end of successful pre_flight
|
|
2. Back-fill SQL sets organized_at on the 2,259 already-completed transcripts
|
|
3. Watch URL stays in catalogue.path and Qdrant download_url — users clicking search results go to the PeerTube video
|
|
4. Transcripts remain at `/opt/recon/data/processing/{hash}/` as their permanent home
|
|
5. Filing worker query filter naturally excludes transcripts — no worker changes needed
|
|
|
|
**Semantic drift to document:** `/opt/recon/data/processing/` used to mean "transient scratch for in-flight items." After Phase 6a, it also holds at-rest transcript content. Acceptable trade-off vs adding a new `_transcripts/` directory and another move operation per item. PROJECT-BIBLE.md should reflect this in Phase 6.
|
|
|
|
---
|
|
|
|
### Phase 5a retroactively filed 16,596 transcripts into library/Domain/Subdomain/ — RESOLVED
|
|
|
|
**Discovered:** 2026-04-14 (Phase 6a design conversation)
|
|
**Resolved:** 2026-04-15 (Phase 6k) — 16,340 of 16,596 transcripts un-filed via title-matching against PeerTube video list (98.6% match rate). catalogue.path restored to PeerTube watch URLs, physical .txt files deleted from library, Qdrant download_url payloads updated, 4,955 empty dirs cleaned. 223 edge cases remain (82 MULTI_MATCH + 141 UNMATCHED, documented at `/tmp/phase5a_remaining.txt` on CT 130).
|
|
|
|
---
|
|
|
|
### Backup policy for derived data (concepts + Qdrant) is not defined
|
|
|
|
**Discovered:** 2026-04-14 (Phase 6a design conversation)
|
|
**Severity:** Medium — not blocking, but should be addressed before the refactor is "released" (merged to master, tagged).
|
|
|
|
The RECON data layers have different backup needs:
|
|
|
|
**Primary source layers (MUST be backed up):**
|
|
- `/mnt/library/*.pdf` — the actual PDF files. Only copy. Irreplaceable.
|
|
- PeerTube video files on CT 110 — backed up by PeerTube's own backup system.
|
|
|
|
**Derived data layers (regeneratable but expensive):**
|
|
- `/opt/recon/data/concepts/{hash}/` — Gemini enrichment outputs. Regeneratable by re-running enrichment at Gemini API cost.
|
|
- Qdrant collection `recon_knowledge_hybrid` — vector embeddings and payloads. Regeneratable by re-running embedding (free, but requires concepts to exist).
|
|
- `/opt/recon/data/recon.db` — SQLite catalogue + documents + status index. Regeneratable from the other layers.
|
|
- `/opt/recon/data/processing/{hash}/` — transcript text storage (post-Phase 6a). Regeneratable by re-fetching from PeerTube.
|
|
|
|
**The worst-case scenario:** concepts AND Qdrant both lost at the same time. Recovery requires re-enriching every document via fresh Gemini calls. At current scale (~30K documents), that's thousands of API calls and real money.
|
|
|
|
**Either of these alone prevents that worst case:**
|
|
- Back up `/opt/recon/data/concepts/` — a few GB of JSON files, easy to rsync
|
|
- Take Qdrant snapshots — Qdrant has native snapshot support, can be scheduled
|
|
|
|
The concepts backup is simpler and smaller. The Qdrant snapshot is more comprehensive.
|
|
|
|
**Suggested resolution (Phase 6 backlog):**
|
|
|
|
1. Add a weekly rsync of `/opt/recon/data/concepts/` to the backup target
|
|
2. Add a weekly Qdrant snapshot via its HTTP API
|
|
3. Document the restore procedure for each loss scenario
|
|
4. Add a backup verification step
|
|
|
|
Not blocking for Phase 6 correctness work. Should be done before the refactor is declared "released."
|
|
|
|
---
|
|
|
|
### Caption drift on PeerTube — RESOLVED
|
|
|
|
**Discovered:** 2026-04-14 (Phase 6a architecture investigation)
|
|
**Resolved:** 2026-04-15 (Phase 6h) — the re-transcription workflow via `POST /api/v1/videos/{uuid}/captions/generate` gives us a way to trigger fresh captions for any video. 332 videos without captions were triggered. PeerTube is the source of truth for captions; peertube-acq picks up new/changed captions automatically on its 30-minute polling cycle.
|
|
|
|
---
|
|
|
|
### No automatic PeerTube ingestion — RESOLVED in Phase 6d
|
|
|
|
**Discovered:** 2026-04-14 (Phase 6c investigation)
|
|
**Resolved:** 2026-04-14 (Phase 6d) — `lib/acquisition/peertube.py` built and integrated
|
|
**Severity:** HIGH at discovery (feature regression). Now closed.
|
|
|
|
Phase 6d built `lib/acquisition/peertube.py` as the first acquisition module in the new architecture:
|
|
- `acquisition_loop()` runs as a daemon thread inside `recon.service`, polling every 30 minutes
|
|
- `acquire_batch()` dedups against catalogue (UUID for URL-path rows, title for Phase 5a library-path rows), fetches new English captions via existing `peertube_scraper` helpers, writes pairs to `/opt/recon/data/acquired/stream/{hash}.txt` + `.meta.json`
|
|
- `max_per_pass=50` cap prevents flooding the dispatcher on cold start (~258 unindexed videos at deployment time will drain over ~3 hours)
|
|
- The CLI handler `cmd_ingest_peertube` was rewired to call `acquire_batch()` instead of the old broken `peertube_scraper.ingest_video()` path
|
|
- End-to-end verification confirmed: the dispatcher picks up acquired files, the transcript processor's hash dedup works correctly, no crashes
|
|
|
|
CLI behavior change to note: `--channel`, `--since`, `--enrich`, and `--process` flags were removed. The acquisition module fetches all videos and dedup filters them; the service handles the full pipeline automatically. `--stats` is unchanged.
|
|
|
|
Service thread count: 6 → 7 (added peertube-acq).
|
|
|
|
The feature regression is closed. New PeerTube content now flows automatically.
|
|
|
|
---
|
|
|
|
### Phase 6d optional refinements (not blocking)
|
|
|
|
**Discovered:** 2026-04-14 (Phase 6d implementation)
|
|
**Source:** Items deferred during Phase 6d for non-blocking future improvement
|
|
|
|
Three Phase 6d refinements identified but deferred:
|
|
|
|
1. **CLI lost `--channel` and `--since` filtering.** The original `recon ingest-peertube` accepted these flags. Phase 6d's rewrite removed them because the daemon-driven flow doesn't need them. If you ever want channel-specific or time-bounded acquisition runs, `list_new_videos()` could accept those parameters and pass through to `get_videos()`. Trivial to add back.
|
|
|
|
2. **Title dedup is fuzzy and may occasionally let duplicates through.** Exact string match on PeerTube's `video['name']` vs catalogue's `filename` (with `.txt` stripped) can fail on whitespace, Unicode normalization, or special character differences. The transcript processor's hash dedup is the safety net — duplicates get caught at processing time and the staged content gets cleaned up. A more robust title normalization function would reduce wasted API calls.
|
|
|
|
3. **`_build_known_sets` rebuilds in-memory dedup sets every batch.** Loads ~19K catalogue rows on each acquisition cycle. Fine at current scale (milliseconds), but at hundreds of thousands of rows would benefit from caching. Backlog for future scale.
|
|
|
|
None of these affect correctness. All are optimizations or feature additions.
|
|
|
|
---
|
|
|
|
### lib/new_pipeline.py is misleadingly named — it's a library management CLI tool, not the refactor's new pipeline
|
|
|
|
**Discovered:** 2026-04-14 (Phase 6c investigation)
|
|
**Severity:** Low — naming confusion, not functional
|
|
|
|
`lib/new_pipeline.py` is 1,637 lines and was created on April 13, 2026 (yesterday). Despite the name, it's NOT the refactor's new pipeline architecture (dispatcher + processors). It's a separate CLI tool for library management operations: `recon pipeline {status|migrate|reverse|watch|sweep}`.
|
|
|
|
The naming creates confusion because:
|
|
- "new_pipeline" sounds like it might be the active pipeline implementation
|
|
- The refactor we're doing IS a new pipeline architecture, but it lives in `lib/dispatcher.py`, `lib/processors/`, etc., not in `lib/new_pipeline.py`
|
|
- Future-you (or anyone reading the codebase) might assume `new_pipeline.py` is the active orchestration layer and waste time figuring out the relationship
|
|
|
|
**Suggested resolution (Phase 6 docs phase):** rename `lib/new_pipeline.py` to something descriptive like `lib/library_manager.py` or `lib/library_tools.py`. Update the import in `recon.py`. Add a clear docstring at the top of the file describing what it does. Update the CLI subcommand name from `pipeline` to `library` if that's not too disruptive.
|
|
|
|
If renaming is too much friction, at minimum add a docstring at the top of the file: "This is NOT the refactor's pipeline. This is a CLI tool for library management. The active pipeline lives in dispatcher.py + processors/. See PROJECT-BIBLE.md."
|
|
|
|
---
|
|
|
|
### Pre-refactor library contains duplicate clusters requiring one-time consolidation
|
|
|
|
**Discovered:** 2026-04-15 (visual audit of `Defense-and-Tactics/Improvised-Weapons/`)
|
|
**Severity:** HIGH — wastes storage, pollutes search results with same-book multiple copies
|
|
|
|
User visually inspected one folder (360 MB total) and found ~50% duplication. Examples from that one folder:
|
|
- Poor Man's James Bond Vol 3 — 3 copies at 85.5 MB each (~256 MB waste)
|
|
- ZIPS/Improvised Weapons Pens & Pipes — 3 copies at 20.1 MB each
|
|
- David's Tool Kit — 2 copies at 14.9 MB each
|
|
- Knuckle Gun / THE_KNUCKLE_GUN — 3 copies at 518 KB each
|
|
|
|
These are NOT hash-identical (hash dedup correctly passed them). They're same-content-different-bytes: different scans, different OCR, different PDF library versions, cosmetic filename variants ("Vol" vs "Volume", apostrophe differences, underscore vs space).
|
|
|
|
**NOT a pipeline bug.** These files entered the library before the refactored pipeline's hash + level-4 dedup was operational. They came through pre-refactor `scan_library()`, old new_pipeline Stream-B, or the SCL scan shortcut during Phase 6j. The current pdf_processor would catch these at ingestion time.
|
|
|
|
**Will NOT re-drop through pipeline.** Each copy already had Gemini enrichment. Reprocessing would burn real money for no new information.
|
|
|
|
**Proposed resolution — surgical consolidation (three phases):**
|
|
|
|
Phase A — Cluster identification: write a script that walks `/mnt/library/` domain folders, aggressively normalizes filenames (lowercase, strip punctuation, collapse whitespace, strip "vol/volume/part" tokens), groups by normalized form, produces a report of all clusters with 2+ members including filenames, sizes, hashes, DB metadata.
|
|
|
|
Phase B — User review: human inspects report, marks each cluster as COLLAPSE (pick one, delete rest), KEEP_ALL (legitimate variants), or UNCLEAR.
|
|
|
|
Phase C — Execute: for each duplicate to delete — remove physical file, delete documents + catalogue rows, delete Qdrant vectors, delete concepts. Zero reprocessing.
|
|
|
|
**Estimated scope:** half-day to full day depending on cluster count.
|
|
|
|
---
|
|
|
|
### 9,478 legacy dirs in `/opt/recon/data/text/`
|
|
|
|
**Discovered:** 2026-04-15 (Phase 6h cleanup investigation)
|
|
**Severity:** Low — historical extraction output, 3.9 GB
|
|
|
|
Pre-refactor pipeline wrote extracted text to `data/text/{hash}/`. Current pipeline uses `data/processing/{hash}/`. The 9,478 remaining dirs are all for documents still in catalogue. Can be cleaned up once confirmed none are the sole text copy for any document. Not blocking.
|
|
|
|
---
|
|
|
|
### Backup architecture not yet implemented
|
|
|
|
**Discovered:** 2026-04-15 (operational review)
|
|
**Severity:** Medium — no automated backup exists for derived data
|
|
|
|
Targets identified: pi-nas (192.168.1.245, ~22T available, library export wiped and NFS/SMB decommissioned 2026-04-15) and Contabo (offsite). Scope, tool choice (rsync / restic / borg), schedule, retention policy, and monitoring all TBD. The `recon-backup.timer` referenced in early documentation never existed.
|
|
|
|
---
|
|
|
|
### `signal-archive/` in `/mnt/library/`
|
|
|
|
**Discovered:** 2026-04-15 (library cleanup audit)
|
|
**Severity:** Low — 44 files, 25 MB, not library content
|
|
|
|
Signal/Matrix chat log exports (images, text logs). Matt says these will "eventually contribute" to the knowledge base but no ingestion path exists yet. Left in place.
|
|
|
|
---
|
|
|
|
### 223 Phase 5a edge-case transcripts still have library paths
|
|
|
|
**Discovered:** 2026-04-15 (Phase 6k un-file)
|
|
**Severity:** Low — 82 MULTI_MATCH + 141 UNMATCHED
|
|
|
|
The Phase 6k un-file matched 98.6% of transcripts. The remaining 223 either match multiple PeerTube videos (ambiguous) or match none (video likely removed from PeerTube). Documented at `/tmp/phase5a_remaining.txt` on CT 130. Can be hand-resolved or tombstoned later.
|
|
|
|
---
|
|
|
|
## Resolved
|
|
|
|
Items moved here with resolution date and phase reference as they're closed out.
|
|
|
|
- **2026-04-14 Phase 6a:** Transcripts don't get filed into library tree → accepted as correct behavior (organized in place)
|
|
- **2026-04-14 Phase 6b:** Dashboard shows transcripts as "Untitled" / "WEB" → fixed via COALESCE + transcript type branch
|
|
- **2026-04-14 Phase 6c:** Cruft in /opt/recon/ → 24 .bak files removed
|
|
- **2026-04-14 Phase 6d:** No automatic PeerTube ingestion → acquisition module built and integrated
|
|
- **2026-04-15 Phase 6h:** 277 STATE 2 PeerTube transcripts → 283 deleted, re-transcription triggered for 332 videos
|
|
- **2026-04-15 Phase 6h:** Caption drift on PeerTube → re-transcription workflow established
|
|
- **2026-04-15 Phase 6j:** _unclassified and _ingest staging dirs non-empty → fully cleaned (1,240 PDFs refiled, 328 dupes deleted)
|
|
- **2026-04-15 Phase 6j:** 13 PDFs in _unclassified with stale embedded_at → subsumed by full refile pass
|
|
- **2026-04-15 Phase 6k:** Phase 5a 16,596 transcripts filed to library → 16,340 un-filed via title matching (223 edge cases remain)
|
|
- **2026-04-15:** PROJECT-BIBLE.md stale → complete rewrite + topology fix
|
|
- **2026-04-15:** pi-nas 283 GB orphaned NFS export → wiped, NFS/SMB decommissioned via OMV
|