Migration: consolidate Echo6 docs to cortex with full infrastructure cleanup sync

- Documents recent infrastructure cleanup (8 CTs destroyed, 35 DNS records removed, Headscale cleanup) - Adds 24 new runbooks covering Authentik, PeerTube, Meshtastic, RECON, Proxmox, Mailcow, Internet Archive, GPU routing - Adds project documentation for headscale, vaultwarden, peertube, matrix, mmud, advbbs, arr stack - Updates services.md, environment.md, caddy.md, authentik.md to match live infrastructure - Removes 4 deprecated runbook duplicates (canonical versions live in projects/) - Adds .gitignore for binary archives and editor temp files Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-13 06:02:16 +00:00 · 2026-04-13 06:02:16 +00:00 · e9231ac24a
commit e9231ac24a
parent 89834796ff
93 changed files with 51223 additions and 254 deletions
--- a/runbooks/ia-download-mirror.md
+++ b/runbooks/ia-download-mirror.md
@ -0,0 +1,429 @@
+# Download & Mirror from Internet Archive
+
+Procedures for downloading items, filtering by format/pattern, bulk downloading from collections, and mirroring entire collections via the `ia` CLI on pi-nas.
+
+---
+
+## Prerequisites
+
+- `ia` CLI installed on pi-nas (192.168.1.245) — v5.7.2
+- Authenticated if downloading restricted items: `ia configure`
+- Sufficient storage on pi-nas (check with `df -h`)
+- Reference: `ia-cli-reference.md` for search/query syntax
+
+---
+
+## 1. Download a Single Item
+
+An "item" is a logical unit on archive.org identified by its identifier (visible in the URL: `archive.org/details/<identifier>`).
+
+```bash
+# Download all files in an item to ./<identifier>/
+ia download <identifier>
+
+# Example
+ia download prelinger_films
+```
+
+The default creates a directory named after the identifier containing all files (originals + derivatives).
+
+### Gate
+
+Verify the download directory exists and has files:
+
+```bash
+ls -la <identifier>/
+```
+
+---
+
+## 2. Filtered Downloads
+
+### By glob pattern
+
+Download only files matching a shell glob pattern.
+
+```bash
+# Only PDFs
+ia download <identifier> --glob="*.pdf"
+
+# Only MP4 video files
+ia download <identifier> --glob="*.mp4"
+
+# Multiple patterns (pipe-separated)
+ia download <identifier> --glob="*.pdf|*.epub"
+```
+
+### With exclusions
+
+Exclude patterns require `--glob` to also be set.
+
+```bash
+# All MP4s except low-quality variants
+ia download <identifier> --glob="*.mp4" --exclude="*512kb*"
+
+# All files except metadata/review XMLs
+ia download <identifier> --glob="*" --exclude="*_meta.xml|*_reviews.xml|*_files.xml"
+
+# Multiple exclusions
+ia download <identifier> --glob="*.mp4" --exclude="*512kb*|*_thumb*"
+```
+
+### By format name
+
+Download files of a specific archive.org format (as shown by `ia metadata --formats`).
+
+```bash
+# Check available formats first
+ia metadata <identifier> --formats
+
+# Download only a specific format
+ia download <identifier> --format="512Kb MPEG4"
+ia download <identifier> --format="PDF"
+ia download <identifier> --format="EPUB"
+```
+
+**Note:** `--format` is incompatible with `--glob` and `--exclude`. Use one approach or the other.
+
+### On-the-fly formats
+
+Some formats (EPUB, MOBI, DAISY, MARCXML) are generated on demand.
+
+```bash
+ia download <identifier> --on-the-fly --format="EPUB"
+```
+
+---
+
+## 3. Download Options
+
+### Control output location
+
+```bash
+# Download to a specific directory
+ia download <identifier> --destdir=/mnt/archive/downloads/
+
+# Flatten directory structure (no subdirectory per item)
+ia download <identifier> --no-directories
+```
+
+### Resume interrupted downloads
+
+```bash
+# Resume — skips files that already exist and match checksum
+ia download <identifier> --checksum
+
+# Checksum mode compares MD5 hashes — safe to re-run
+```
+
+### Preserve timestamps
+
+```bash
+# Keep original timestamps from archive.org
+ia download <identifier> --no-change-timestamp
+```
+
+### Dry run
+
+```bash
+# See what would be downloaded without actually downloading
+ia download <identifier> --dry-run
+```
+
+---
+
+## 4. Bulk Download from Search Results
+
+Pipe search results directly into download. This is the primary method for downloading multiple items.
+
+### Basic pattern
+
+```bash
+# Search → itemlist → download
+ia search 'collection:prelinger mediatype:movies' --itemlist | \
+  ia download --itemlist -
+
+# The - tells ia download to read identifiers from stdin
+```
+
+### With filters
+
+```bash
+# Download only PDFs from all items in a collection
+ia search 'collection:arrl_qst' --itemlist | \
+  ia download --itemlist - --glob="*.pdf"
+
+# Download only MP3s from an audio collection
+ia search 'collection:librivoxaudio' --itemlist | \
+  ia download --itemlist - --glob="*.mp3"
+```
+
+### With destination directory
+
+```bash
+# Download to a specific location
+ia search 'collection:prelinger' --itemlist | \
+  ia download --itemlist - --destdir=/mnt/archive/prelinger/
+```
+
+### Save itemlist for reuse
+
+When a search is large, save the itemlist first so you can resume without re-searching.
+
+```bash
+# Step 1: Save itemlist
+ia search 'collection:prelinger mediatype:movies' --itemlist > prelinger-items.txt
+
+# Step 2: Check count
+wc -l prelinger-items.txt
+
+# Step 3: Download from file
+ia download --itemlist prelinger-items.txt --glob="*.mp4"
+
+# Step 4: Resume if interrupted (just re-run with --checksum)
+ia download --itemlist prelinger-items.txt --glob="*.mp4" --checksum
+```
+
+---
+
+## 5. Bulk Download with GNU Parallel
+
+For faster bulk downloads, use GNU Parallel for concurrent item downloads.
+
+```bash
+# Install parallel if not present
+sudo apt install -y parallel
+
+# Download 5 items concurrently
+ia search 'collection:prelinger' --itemlist | \
+  parallel -j5 'ia download {} --glob="*.mp4"'
+
+# With destination directory
+ia search 'collection:prelinger' --itemlist | \
+  parallel -j5 'ia download {} --glob="*.mp4" --destdir=/mnt/archive/prelinger/'
+
+# From saved itemlist
+parallel -j5 'ia download {} --glob="*.pdf"' < items.txt
+```
+
+**Caution:** Be respectful of archive.org bandwidth. 3-5 concurrent downloads is reasonable. Higher parallelism may trigger rate limiting.
+
+---
+
+## 6. Mirror an Entire Collection
+
+Mirroring means downloading everything and being able to re-run to pick up new additions.
+
+### Initial mirror
+
+```bash
+# Step 1: Create working directory
+mkdir -p /mnt/archive/<collection-name>
+cd /mnt/archive/<collection-name>
+
+# Step 2: Generate itemlist
+ia search 'collection:<collection-name>' --itemlist > itemlist.txt
+echo "Found $(wc -l < itemlist.txt) items"
+
+# Step 3: Download all items (adjust --glob as needed)
+ia download --itemlist itemlist.txt --destdir=/mnt/archive/<collection-name>/
+
+# Or with format filter
+ia download --itemlist itemlist.txt --glob="*.pdf" --destdir=/mnt/archive/<collection-name>/
+```
+
+### Update an existing mirror
+
+Re-run the same commands. Use `--checksum` to skip already-downloaded files.
+
+```bash
+cd /mnt/archive/<collection-name>
+
+# Refresh itemlist (new items since last run)
+ia search 'collection:<collection-name>' --itemlist > itemlist-new.txt
+
+# Download only new/changed files
+ia download --itemlist itemlist-new.txt --checksum --destdir=/mnt/archive/<collection-name>/
+```
+
+### Mirror with a script
+
+For recurring mirrors, create a simple script:
+
+```bash
+#!/bin/bash
+# mirror-collection.sh <collection-name> [glob-pattern]
+COLLECTION="$1"
+GLOB="${2:-*}"
+DEST="/mnt/archive/$COLLECTION"
+
+mkdir -p "$DEST"
+
+echo "Refreshing itemlist for $COLLECTION..."
+ia search "collection:$COLLECTION" --itemlist > "$DEST/itemlist.txt"
+COUNT=$(wc -l < "$DEST/itemlist.txt")
+echo "Found $COUNT items"
+
+echo "Downloading (glob: $GLOB)..."
+ia download --itemlist "$DEST/itemlist.txt" --glob="$GLOB" --checksum --destdir="$DEST/"
+
+echo "Mirror complete: $DEST"
+```
+
+Usage:
+
+```bash
+chmod +x mirror-collection.sh
+./mirror-collection.sh arrl_qst "*.pdf"
+./mirror-collection.sh prelinger "*.mp4"
+```
+
+---
+
+## 7. Practical Patterns
+
+### Download all PDFs from a collection
+
+```bash
+ia search 'collection:arrl_qst' --itemlist | \
+  ia download --itemlist - --glob="*.pdf" --destdir=/mnt/archive/arrl-qst/
+```
+
+### Download specific media types from a collection
+
+```bash
+# High-quality video only
+ia search 'collection:prelinger' --itemlist | \
+  ia download --itemlist - --format="MPEG4" --destdir=/mnt/archive/prelinger-video/
+
+# Audio in MP3 format
+ia search 'collection:librivoxaudio creator:"Mark Twain"' --itemlist | \
+  ia download --itemlist - --glob="*64kb*.mp3" --destdir=/mnt/archive/twain-audio/
+```
+
+### Download items matching a date range
+
+```bash
+ia search 'collection:arrl_qst date:[1950-01-01 TO 1959-12-31]' --itemlist | \
+  ia download --itemlist - --glob="*.pdf" --destdir=/mnt/archive/arrl-1950s/
+```
+
+### Download a single specific file from an item
+
+```bash
+# List files first
+ia list <identifier>
+
+# Download just one file
+ia download <identifier> specific-file.pdf
+```
+
+### Download and preserve directory structure
+
+```bash
+# Default behavior — each item gets its own subdirectory
+ia download --itemlist items.txt --destdir=/mnt/archive/output/
+# Result: /mnt/archive/output/<identifier1>/files...
+#         /mnt/archive/output/<identifier2>/files...
+```
+
+### Pipe a single file to stdout
+
+```bash
+# Stream a file without saving to disk
+ia download <identifier> specific-file.pdf --stdout | less
+ia download <identifier> data.json --stdout | jq .
+```
+
+---
+
+## 8. Storage Planning
+
+Before large downloads, estimate storage requirements.
+
+```bash
+# Count items in collection
+ia search 'collection:<name>' --num-found
+
+# Check a sample item's size
+ia metadata <sample-identifier> | jq '[.files[].size | tonumber] | add / 1048576 | floor'
+# Output in MB
+
+# Check available storage on pi-nas
+df -h /mnt/
+```
+
+### Rule of thumb
+
+- Text collections (PDFs, EPUBs): ~10-100 MB per item
+- Audio collections: ~100 MB - 1 GB per item
+- Video collections: ~1-10 GB per item
+- Software archives: highly variable
+
+---
+
+## Troubleshooting
+
+### Download hangs or stalls
+
+```bash
+# Kill and resume with checksum verification
+# Ctrl+C to stop, then re-run with --checksum
+ia download --itemlist items.txt --glob="*.pdf" --checksum
+```
+
+### "Item not found" errors in bulk download
+
+Some items in a collection may be restricted or taken down. These will fail individually but the batch continues. Check errors in output.
+
+### Disk full during bulk download
+
+```bash
+# Check what's using space
+du -sh /mnt/archive/*/ | sort -rh | head -20
+
+# Resume after freeing space — checksum mode skips completed files
+ia download --itemlist items.txt --checksum
+```
+
+### Rate limiting / 429 errors
+
+Archive.org may throttle aggressive downloads.
+
+- Reduce parallel jobs (if using GNU Parallel)
+- Add delays between items: `parallel -j2 --delay 5 'ia download {}' < items.txt`
+- Wait and retry later
+
+### Corrupt downloads
+
+```bash
+# Re-download with checksum verification — replaces corrupt files
+ia download <identifier> --checksum
+```
+
+### Permission denied on destination
+
+```bash
+# Ensure the download user owns the target directory
+sudo chown -R $(whoami):$(whoami) /mnt/archive/
+```
+
+---
+
+## Checklist: Collection Mirror
+
+```
+[ ] Identify collection identifier on archive.org
+[ ] Check available storage (df -h)
+[ ] Estimate collection size (--num-found + sample item size)
+[ ] Generate itemlist (ia search --itemlist > itemlist.txt)
+[ ] Review itemlist count (wc -l itemlist.txt)
+[ ] Start download with appropriate filters (--glob, --format)
+[ ] Verify downloaded files exist and are non-zero
+[ ] If interrupted, resume with --checksum
+[ ] Record collection details in project notes
+```
+
+---
+
+*Last updated: 2026-02-14*