echo6-docs/failed_documents/README.md

31 lines
1.5 KiB
Markdown
Raw Permalink Normal View History

# Failed Documents Cleanup Log
**Date:** 2026-04-14
**Total rows purged:** 56
**Source:** RECON pipeline `documents` table, `status='failed'`
All 56 entries failed during PDF extraction or transcript ingestion and never produced vectors, concepts, or usable text. They are permanently unrecoverable without manual intervention (re-acquisition, DRM removal, or file repair).
## Category Breakdown
| Category | Count | File |
|----------|-------|------|
| DRM-encrypted Internet Archive PDFs | 18 | [drm_encrypted_ia_pdfs.md](drm_encrypted_ia_pdfs.md) |
| Corrupt/malformed PDFs | 13 | [corrupt_pdfs.md](corrupt_pdfs.md) |
| macOS resource forks (`._` files) | 22 | [macos_resource_forks.md](macos_resource_forks.md) |
| Deleted PeerTube videos | 2 | [deleted_peertube_videos.md](deleted_peertube_videos.md) |
| Test artifacts | 1 | [test_artifacts.md](test_artifacts.md) |
| **Total** | **56** | |
## What Was Deleted
- 56 rows from `catalogue` table
- 56 rows from `documents` table
- Physical PDF files on `/mnt/library/` (where they still existed)
- Text directories under `/opt/recon/data/text/{hash}/`
- Concept directories under `/opt/recon/data/concepts/{hash}/`
- No Qdrant vectors existed (failed before embedding stage)
## Regrab Candidates
The 18 DRM-encrypted PDFs are the only category worth re-acquiring. The Internet Archive identifiers are listed in [drm_encrypted_ia_pdfs.md](drm_encrypted_ia_pdfs.md) — these books can potentially be re-downloaded in non-DRM format using `ia download <identifier>` or borrowed via Open Library.