Files changed: failed_documents/README.md failed_documents/corrupt_pdfs.md failed_documents/deleted_peertube_videos.md failed_documents/drm_encrypted_ia_pdfs.md failed_documents/macos_resource_forks.md failed_documents/test_artifacts.md
31 lines
1.5 KiB
Markdown
31 lines
1.5 KiB
Markdown
# Failed Documents Cleanup Log
|
|
|
|
**Date:** 2026-04-14
|
|
**Total rows purged:** 56
|
|
**Source:** RECON pipeline `documents` table, `status='failed'`
|
|
|
|
All 56 entries failed during PDF extraction or transcript ingestion and never produced vectors, concepts, or usable text. They are permanently unrecoverable without manual intervention (re-acquisition, DRM removal, or file repair).
|
|
|
|
## Category Breakdown
|
|
|
|
| Category | Count | File |
|
|
|----------|-------|------|
|
|
| DRM-encrypted Internet Archive PDFs | 18 | [drm_encrypted_ia_pdfs.md](drm_encrypted_ia_pdfs.md) |
|
|
| Corrupt/malformed PDFs | 13 | [corrupt_pdfs.md](corrupt_pdfs.md) |
|
|
| macOS resource forks (`._` files) | 22 | [macos_resource_forks.md](macos_resource_forks.md) |
|
|
| Deleted PeerTube videos | 2 | [deleted_peertube_videos.md](deleted_peertube_videos.md) |
|
|
| Test artifacts | 1 | [test_artifacts.md](test_artifacts.md) |
|
|
| **Total** | **56** | |
|
|
|
|
## What Was Deleted
|
|
|
|
- 56 rows from `catalogue` table
|
|
- 56 rows from `documents` table
|
|
- Physical PDF files on `/mnt/library/` (where they still existed)
|
|
- Text directories under `/opt/recon/data/text/{hash}/`
|
|
- Concept directories under `/opt/recon/data/concepts/{hash}/`
|
|
- No Qdrant vectors existed (failed before embedding stage)
|
|
|
|
## Regrab Candidates
|
|
|
|
The 18 DRM-encrypted PDFs are the only category worth re-acquiring. The Internet Archive identifiers are listed in [drm_encrypted_ia_pdfs.md](drm_encrypted_ia_pdfs.md) — these books can potentially be re-downloaded in non-DRM format using `ia download <identifier>` or borrowed via Open Library.
|