diff --git a/failed_documents/README.md b/failed_documents/README.md new file mode 100644 index 0000000..bd24bc4 --- /dev/null +++ b/failed_documents/README.md @@ -0,0 +1,31 @@ +# Failed Documents Cleanup Log + +**Date:** 2026-04-14 +**Total rows purged:** 56 +**Source:** RECON pipeline `documents` table, `status='failed'` + +All 56 entries failed during PDF extraction or transcript ingestion and never produced vectors, concepts, or usable text. They are permanently unrecoverable without manual intervention (re-acquisition, DRM removal, or file repair). + +## Category Breakdown + +| Category | Count | File | +|----------|-------|------| +| DRM-encrypted Internet Archive PDFs | 18 | [drm_encrypted_ia_pdfs.md](drm_encrypted_ia_pdfs.md) | +| Corrupt/malformed PDFs | 13 | [corrupt_pdfs.md](corrupt_pdfs.md) | +| macOS resource forks (`._` files) | 22 | [macos_resource_forks.md](macos_resource_forks.md) | +| Deleted PeerTube videos | 2 | [deleted_peertube_videos.md](deleted_peertube_videos.md) | +| Test artifacts | 1 | [test_artifacts.md](test_artifacts.md) | +| **Total** | **56** | | + +## What Was Deleted + +- 56 rows from `catalogue` table +- 56 rows from `documents` table +- Physical PDF files on `/mnt/library/` (where they still existed) +- Text directories under `/opt/recon/data/text/{hash}/` +- Concept directories under `/opt/recon/data/concepts/{hash}/` +- No Qdrant vectors existed (failed before embedding stage) + +## Regrab Candidates + +The 18 DRM-encrypted PDFs are the only category worth re-acquiring. The Internet Archive identifiers are listed in [drm_encrypted_ia_pdfs.md](drm_encrypted_ia_pdfs.md) — these books can potentially be re-downloaded in non-DRM format using `ia download ` or borrowed via Open Library. diff --git a/failed_documents/corrupt_pdfs.md b/failed_documents/corrupt_pdfs.md new file mode 100644 index 0000000..015f054 --- /dev/null +++ b/failed_documents/corrupt_pdfs.md @@ -0,0 +1,27 @@ +# Corrupt/Malformed PDFs + +**Count:** 13 +**Subcategories:** +- Truncated (EOF marker not found): 10 +- Negative seek value: 2 +- Missing /Root object: 1 + +**Failure reason:** These PDFs are structurally damaged — truncated during download, missing required PDF objects, or otherwise malformed. Both PyPDF2 and pdftotext/pdfinfo return 0 extractable pages. Manual repair is theoretically possible but not worth the effort for these titles. + +## Entries + +| # | Filename | Path | Hash | Size | Discovered | Error | +|---|----------|------|------|------|------------|-------| +| 1 | Depression Era Recipies.pdf | `/mnt/library/Survival-Companion-Library/Companion Survival Resource Library/Food, Nutrition & Recipes/Recipes/Depression Era Recipies.pdf` | `b9bcabfe1d0d9aac` | 11,483 | 2026-02-16 00:22:23 | EOF marker not found | +| 2 | EMP-1.pdf | `/mnt/library/Survival-Companion-Library/EMP/EMP-1.pdf` | `ebdfc16840b35ad6` | 1,589 | 2026-04-13 01:06:51 | EOF marker not found | +| 3 | Food Storage and Disaster Calendar.pdf | `/mnt/library/Survival-Companion-Library/Food Storage/Food Storage and Disaster Calendar.pdf` | `faf8212fea01d991` | 96,014 | 2026-02-16 00:22:23 | EOF marker not found | +| 4 | Food_storage_guide.pdf | `/mnt/library/Survival-Companion-Library/Food Storage/Food_storage_guide.pdf` | `48905846ef1f395f` | 1,525,293 | 2026-02-16 00:22:23 | EOF marker not found | +| 5 | Homemade C4 - A Recipe For Survival - Ragnar Benson.pdf | `/mnt/library/Survival-Companion-Library/Books-Magazines/Homemade C4 - A Recipe For Survival - Ragnar Benson.pdf` | `68f01f006e5c05b2` | 8,146,967 | 2026-02-16 00:22:24 | '/Root' | +| 6 | Homemade Grenade Launchers - Ragnar Benson.pdf | `/mnt/library/Survival-Companion-Library/Books-Magazines/Homemade Grenade Launchers - Ragnar Benson.pdf` | `3d82ec1f8e0cefae` | 6,172,870 | 2026-02-16 00:22:24 | negative seek value -1 | +| 7 | PPS_complete.pdf | `/mnt/library/Survival-Companion-Library/Medicine - Health - Hygiene - Sanitation/PPS_complete.pdf` | `30a694dbee39f98b` | 17,986 | 2026-02-16 00:22:24 | EOF marker not found | +| 8 | Survivalist #01 - Premier Issue.pdf | `/mnt/library/Survival-Companion-Library/Books-Magazines/American Survival Guide/Survivalist #01 - Premier Issue.pdf` | `c691de4341ac4ad0` | 33,372,463 | 2026-02-16 00:22:24 | EOF marker not found | +| 9 | Survivalist #03 - Self-Reliance.pdf | `/mnt/library/Survival-Companion-Library/Books-Magazines/American Survival Guide/Survivalist #03 - Self-Reliance.pdf` | `318b6a9749672666` | 53,833,074 | 2026-02-16 00:22:24 | EOF marker not found | +| 10 | Survivalist #05 - Societal Collapse.pdf | `/mnt/library/Survival-Companion-Library/Books-Magazines/American Survival Guide/Survivalist #05 - Societal Collapse.pdf` | `0c6505dcbaf7de70` | 55,616,920 | 2026-02-16 00:22:24 | EOF marker not found | +| 11 | Survivalist #07 – When the Lights go Out!.pdf | `/mnt/library/Survival-Companion-Library/Books-Magazines/American Survival Guide/Survivalist #07 – When the Lights go Out!.pdf` | `4b69384d20e64d31` | 75,493,362 | 2026-02-16 00:22:24 | EOF marker not found | +| 12 | Survivalist #11 - Real Self Defense.pdf | `/mnt/library/Survival-Companion-Library/Books-Magazines/American Survival Guide/Survivalist #11 - Real Self Defense.pdf` | `bdcb548d8bfc99e0` | 86,043,525 | 2026-02-16 00:22:24 | EOF marker not found | +| 13 | fm3-22-68_2006.pdf | `/mnt/library/Survival-Companion-Library/Companion Survival Resource Library/Army Field Manuals/fm3-22-68_2006.pdf` | `7c6695360d1f03c1` | 8,126,464 | 2026-04-13 01:06:51 | negative seek value -1 | diff --git a/failed_documents/deleted_peertube_videos.md b/failed_documents/deleted_peertube_videos.md new file mode 100644 index 0000000..9965a2e --- /dev/null +++ b/failed_documents/deleted_peertube_videos.md @@ -0,0 +1,11 @@ +# Deleted PeerTube Videos + +**Count:** 2 +**Failure reason:** These PeerTube videos were deleted from the instance after RECON catalogued them but before enrichment/embedding completed. The transcript text was extracted but the pipeline later marked them as failed when it could no longer reach the source URL. + +## Entries + +| # | Title | Video UUID | Channel | Hash | Discovered | URL | +|---|-------|-----------|---------|------|------------|-----| +| 1 | I brought home a farm animal we’ve never had before | VLOG | `13761144-1bc7-46c4-895f-0d450b91007f` | Roots and Refuge Farm | `1cdc40b72b95db78` | 2026-04-05 15:41:48 | `https://stream.echo6.co/w/13761144-1bc7-46c4-895f-0d450b91007f` | +| 2 | Christmas Giveaway Day 3: CCNA, CCNA Cyber Ops, Amazon eGift and more! | `f67b4a3a-02f0-4b0e-a3a4-2f18b8e20833` | David Bombal | `0dc816878cae5a9c` | 2026-04-06 00:46:33 | `https://stream.echo6.co/w/f67b4a3a-02f0-4b0e-a3a4-2f18b8e20833` | diff --git a/failed_documents/drm_encrypted_ia_pdfs.md b/failed_documents/drm_encrypted_ia_pdfs.md new file mode 100644 index 0000000..ba7ef28 --- /dev/null +++ b/failed_documents/drm_encrypted_ia_pdfs.md @@ -0,0 +1,29 @@ +# DRM-Encrypted Internet Archive PDFs + +**Count:** 18 +**Failure reason:** These are Adobe DRM (ACSM) encrypted PDFs downloaded from Internet Archive's lending library. The RECON pipeline's PyPDF2/pdftotext extractors cannot decrypt them — they require Adobe Digital Editions or equivalent DRM removal tooling. + +**Regrab note:** Most of these titles are available on Internet Archive. The `ia_identifier` column can be used with `ia download ` to re-download, or the books can be borrowed via Open Library in a non-DRM format. + +## Entries + +| # | Filename | IA Identifier | Path | Hash | Size | Discovered | Error | +|---|----------|---------------|------|------|------|------------|-------| +| 1 | Root-Cellaring.pdf | `Root-Cellaring` | `/mnt/library/Root-Cellaring.pdf` | `12be9cab19173c9f` | 14,056,345 | 2026-03-19 01:12:29 | encryption handler | +| 2 | Storeys-Guide-Raising-Beef-Cattle.pdf | `Storeys-Guide-Raising-Beef-Cattle` | `/mnt/library/Storeys-Guide-Raising-Beef-Cattle.pdf` | `6f5127e5e861cd7f` | 23,168,056 | 2026-03-19 01:12:29 | encryption handler | +| 3 | Storeys-Guide-Raising-Pigs.pdf | `Storeys-Guide-Raising-Pigs` | `/mnt/library/Storeys-Guide-Raising-Pigs.pdf` | `d58acf724f6e75c0` | 18,779,175 | 2026-03-19 01:12:29 | encryption handler | +| 4 | Storeys-Guide-Raising-Rabbits.pdf | `Storeys-Guide-Raising-Rabbits` | `/mnt/library/Storeys-Guide-Raising-Rabbits.pdf` | `cc83e27c348205c4` | 12,058,415 | 2026-03-19 01:12:29 | encryption handler | +| 5 | Storeys-Guide-Raising-Sheep.pdf | `Storeys-Guide-Raising-Sheep` | `/mnt/library/Storeys-Guide-Raising-Sheep.pdf` | `a2740665539480f0` | 22,925,624 | 2026-03-19 01:12:29 | encryption handler | +| 6 | The-Complete-Medicinal-Herbal.pdf | `The-Complete-Medicinal-Herbal` | `/mnt/library/The-Complete-Medicinal-Herbal.pdf` | `166024bed0da3899` | 33,058,636 | 2026-03-19 01:12:29 | encryption handler | +| 7 | barefootarchitec00leng_encrypted.pdf | `barefootarchitec00leng` | `/mnt/library/Shelter-and-Construction/barefootarchitec00leng_encrypted.pdf` | `47929d6547c949bc` | 27,130,488 | 2026-03-19 01:12:29 | encryption handler | +| 8 | beginnersguideto0000shol_o6v9_encrypted.pdf | `beginnersguideto0000shol_o6v9` | `/mnt/library/Acquired/Food/beginnersguideto0000shol_o6v9_encrypted.pdf` | `d18fa3ce98f8d572` | 8,203,078 | 2026-04-13 01:06:51 | file not found (moved) | +| 9 | bestloveddepress0000unse_encrypted.pdf | `bestloveddepress0000unse` | `/mnt/library/Acquired/Food/bestloveddepress0000unse_encrypted.pdf` | `d54acd11ed6a2faa` | 5,503,960 | 2026-04-13 01:06:51 | file not found (moved) | +| 10 | bushcraftoutdoor0000mors_encrypted.pdf | `bushcraftoutdoor0000mors` | `/mnt/library/Wilderness-Skills/bushcraftoutdoor0000mors_encrypted.pdf` | `7bcee33c6dfca5e9` | 13,798,508 | 2026-03-19 01:12:29 | encryption handler | +| 11 | completemedicina00odyp_encrypted.pdf | `completemedicina00odyp` | `/mnt/library/Acquired/Medical/Herbalism/completemedicina00odyp_encrypted.pdf` | `be52efb423c699b8` | 25,304,735 | 2026-04-13 01:06:51 | file not found (moved) | +| 12 | hamradiofordummi0000silv_encrypted.pdf | `hamradiofordummi0000silv` | `/mnt/library/Acquired/Skills/hamradiofordummi0000silv_encrypted.pdf` | `127ad91c8035e3a2` | 21,170,450 | 2026-04-13 01:06:51 | file not found (moved) | +| 13 | howtostayalivein0000angi_encrypted.pdf | `howtostayalivein0000angi` | `/mnt/library/Wilderness-Skills/howtostayalivein0000angi_encrypted.pdf` | `99dbc9394ef38b9b` | 12,855,750 | 2026-03-19 01:12:29 | encryption handler | +| 14 | justincasehowtob0000harr_encrypted.pdf | `justincasehowtob0000harr` | `/mnt/library/Scenario-Playbooks/justincasehowtob0000harr_encrypted.pdf` | `41e8a2158b53f604` | 16,135,469 | 2026-03-19 01:12:29 | encryption handler | +| 15 | livingreadypocke0000hubb_encrypted.pdf | `livingreadypocke0000hubb` | `/mnt/library/Acquired/Skills/livingreadypocke0000hubb_encrypted.pdf` | `4224c213bf549716` | 6,689,915 | 2026-04-13 01:06:51 | file not found (moved) | +| 16 | multitudewardemo00hard_encrypted.pdf | `multitudewardemo00hard` | `/mnt/library/Acquired/Scenario/multitudewardemo00hard_encrypted.pdf` | `2807e76d8393e018` | 37,795,358 | 2026-04-13 01:06:51 | file not found (moved) | +| 17 | seedtoseedseedsa0000ashw_encrypted.pdf | `seedtoseedseedsa0000ashw` | `/mnt/library/Agriculture-and-Livestock/seedtoseedseedsa0000ashw_encrypted.pdf` | `75c393d852a75f2d` | 20,084,059 | 2026-03-19 01:12:29 | encryption handler | +| 18 | teamingwithmicro0000lowe_encrypted.pdf | `teamingwithmicro0000lowe` | `/mnt/library/Acquired/Permaculture/teamingwithmicro0000lowe_encrypted.pdf` | `9e4d5b170276627c` | 14,145,070 | 2026-04-13 01:06:51 | file not found (moved) | diff --git a/failed_documents/macos_resource_forks.md b/failed_documents/macos_resource_forks.md new file mode 100644 index 0000000..d5f32b7 --- /dev/null +++ b/failed_documents/macos_resource_forks.md @@ -0,0 +1,29 @@ +# macOS Resource Fork Files + +**Count:** 22 +**Failure reason:** These are macOS `._` resource fork / extended attribute sidecar files, not real PDFs. They are 4,096 bytes each (one filesystem block) and contain Apple-specific metadata. The RECON scanner picked them up because they end in `.pdf` but they have no extractable content. + +## Paths + +1. `/mnt/library/Survival-Companion-Library/Books-Magazines/._Life after Doomsday.pdf` (`b333d5d5e0796c3f`) +2. `/mnt/library/Survival-Companion-Library/Books-Magazines/._Survivalist #09 - Urban Survival.pdf` (`5e9bbf4c04347ac4`) +3. `/mnt/library/Survival-Companion-Library/Books-Magazines/._The Survival Medicine Handbook_ - Alton, Joseph.pdf` (`bd5519cb0db9e1be`) +4. `/mnt/library/Survival-Companion-Library/Books-Magazines/._The Survival handbook.pdf` (`80ef6bcc79982214`) +5. `/mnt/library/Survival-Companion-Library/Firearms - Defense/._Basic_Manual_On_Knife_Throwing_2003.pdf` (`7725b720e8f1799d`) +6. `/mnt/library/Survival-Companion-Library/Firearms - Defense/._Home And Family Security System.pdf` (`951327b09f27d881`) +7. `/mnt/library/Survival-Companion-Library/Firearms - Defense/._survival battery.pdf` (`9686c66a7c17d601`) +8. `/mnt/library/Survival-Companion-Library/Food Storage/._3monthfoodsupply.pdf` (`58f4ed9073d2e6d0`) +9. `/mnt/library/Survival-Companion-Library/Food Storage/._3monthsupplyoffoodschedule.pdf` (`c8c88f3cdd8dc88c`) +10. `/mnt/library/Survival-Companion-Library/Food Storage/._5 dollar a week food storage plan - Unknown.pdf` (`14c4c65f03e45128`) +11. `/mnt/library/Survival-Companion-Library/Food Storage/._Food+Prepping+Checklist.pdf` (`45222452dddcb0d1`) +12. `/mnt/library/Survival-Companion-Library/Food Storage/._canning.pdf` (`4e85ec1524f6ebe8`) +13. `/mnt/library/Survival-Companion-Library/General Survival/EXTREME FAMILY SURVIVAL/._Bonus-Riot_Safety_for_patriots.pdf` (`2cda0807321d0a8d`) +14. `/mnt/library/Survival-Companion-Library/General Survival/Family Survival Course/._survive_any_disaster_v4.pdf` (`492643d20f610ff3`) +15. `/mnt/library/Survival-Companion-Library/General Survival/Massive Download 2/._GunFlash.pdf` (`a2224706e1567229`) +16. `/mnt/library/Survival-Companion-Library/General Survival/Prepping for Pennies/._BONUS 3 - What to Stockpile.pdf` (`6331b36047b611a8`) +17. `/mnt/library/Survival-Companion-Library/General Survival/SurvivalSpin Stuff/._Camping+Supplies+Checklist.pdf` (`3413af6b7af47578`) +18. `/mnt/library/Survival-Companion-Library/Medicine - Health - Hygiene - Sanitation/._FINAL COPY PDF DOOMSDAY BOOK OF MEDICINE.pdf` (`7f5a1c8b840c8229`) +19. `/mnt/library/Survival-Companion-Library/Medicine - Health - Hygiene - Sanitation/._First Aid FM 4-25.pdf` (`e5583ff012cc17ba`) +20. `/mnt/library/Survival-Companion-Library/Military guides - Manuals/._ar350-30_Survival_Evasion_Resistance_Escape.pdf` (`2536226e1dcdab11`) +21. `/mnt/library/Survival-Companion-Library/Survival Uploads/._NATO-emergency-war-surgery.pdf` (`809748c7dafb7647`) +22. `/mnt/library/Survival-Companion-Library/Survival Uploads/Bugging Out/._BugoutBag.pdf` (`c4fae8882cdb1441`) diff --git a/failed_documents/test_artifacts.md b/failed_documents/test_artifacts.md new file mode 100644 index 0000000..4e8b59a --- /dev/null +++ b/failed_documents/test_artifacts.md @@ -0,0 +1,10 @@ +# Test Artifacts + +**Count:** 1 +**Failure reason:** This was a CLI test file created during RECON development and subsequently deleted from the library. The catalogue/documents rows were never cleaned up. + +## Entry + +| Filename | Path | Hash | Size | Discovered | Error | +|----------|------|------|------|------------|-------| +| recon-test-cli.pdf | `/mnt/library/Technical/recon-test-cli.pdf` | `f95452d04916154d` | 381 | 2026-04-13 01:06:51 | File not found: /mnt/library/Technical/recon-test-cli.pdf |