Migration: consolidate Echo6 docs to cortex with full infrastructure cleanup sync
- Documents recent infrastructure cleanup (8 CTs destroyed, 35 DNS records removed, Headscale cleanup) - Adds 24 new runbooks covering Authentik, PeerTube, Meshtastic, RECON, Proxmox, Mailcow, Internet Archive, GPU routing - Adds project documentation for headscale, vaultwarden, peertube, matrix, mmud, advbbs, arr stack - Updates services.md, environment.md, caddy.md, authentik.md to match live infrastructure - Removes 4 deprecated runbook duplicates (canonical versions live in projects/) - Adds .gitignore for binary archives and editor temp files Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
parent
89834796ff
commit
e9231ac24a
93 changed files with 51223 additions and 254 deletions
429
runbooks/ia-download-mirror.md
Normal file
429
runbooks/ia-download-mirror.md
Normal file
|
|
@ -0,0 +1,429 @@
|
|||
# Download & Mirror from Internet Archive
|
||||
|
||||
Procedures for downloading items, filtering by format/pattern, bulk downloading from collections, and mirroring entire collections via the `ia` CLI on pi-nas.
|
||||
|
||||
---
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- `ia` CLI installed on pi-nas (192.168.1.245) — v5.7.2
|
||||
- Authenticated if downloading restricted items: `ia configure`
|
||||
- Sufficient storage on pi-nas (check with `df -h`)
|
||||
- Reference: `ia-cli-reference.md` for search/query syntax
|
||||
|
||||
---
|
||||
|
||||
## 1. Download a Single Item
|
||||
|
||||
An "item" is a logical unit on archive.org identified by its identifier (visible in the URL: `archive.org/details/<identifier>`).
|
||||
|
||||
```bash
|
||||
# Download all files in an item to ./<identifier>/
|
||||
ia download <identifier>
|
||||
|
||||
# Example
|
||||
ia download prelinger_films
|
||||
```
|
||||
|
||||
The default creates a directory named after the identifier containing all files (originals + derivatives).
|
||||
|
||||
### Gate
|
||||
|
||||
Verify the download directory exists and has files:
|
||||
|
||||
```bash
|
||||
ls -la <identifier>/
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 2. Filtered Downloads
|
||||
|
||||
### By glob pattern
|
||||
|
||||
Download only files matching a shell glob pattern.
|
||||
|
||||
```bash
|
||||
# Only PDFs
|
||||
ia download <identifier> --glob="*.pdf"
|
||||
|
||||
# Only MP4 video files
|
||||
ia download <identifier> --glob="*.mp4"
|
||||
|
||||
# Multiple patterns (pipe-separated)
|
||||
ia download <identifier> --glob="*.pdf|*.epub"
|
||||
```
|
||||
|
||||
### With exclusions
|
||||
|
||||
Exclude patterns require `--glob` to also be set.
|
||||
|
||||
```bash
|
||||
# All MP4s except low-quality variants
|
||||
ia download <identifier> --glob="*.mp4" --exclude="*512kb*"
|
||||
|
||||
# All files except metadata/review XMLs
|
||||
ia download <identifier> --glob="*" --exclude="*_meta.xml|*_reviews.xml|*_files.xml"
|
||||
|
||||
# Multiple exclusions
|
||||
ia download <identifier> --glob="*.mp4" --exclude="*512kb*|*_thumb*"
|
||||
```
|
||||
|
||||
### By format name
|
||||
|
||||
Download files of a specific archive.org format (as shown by `ia metadata --formats`).
|
||||
|
||||
```bash
|
||||
# Check available formats first
|
||||
ia metadata <identifier> --formats
|
||||
|
||||
# Download only a specific format
|
||||
ia download <identifier> --format="512Kb MPEG4"
|
||||
ia download <identifier> --format="PDF"
|
||||
ia download <identifier> --format="EPUB"
|
||||
```
|
||||
|
||||
**Note:** `--format` is incompatible with `--glob` and `--exclude`. Use one approach or the other.
|
||||
|
||||
### On-the-fly formats
|
||||
|
||||
Some formats (EPUB, MOBI, DAISY, MARCXML) are generated on demand.
|
||||
|
||||
```bash
|
||||
ia download <identifier> --on-the-fly --format="EPUB"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. Download Options
|
||||
|
||||
### Control output location
|
||||
|
||||
```bash
|
||||
# Download to a specific directory
|
||||
ia download <identifier> --destdir=/mnt/archive/downloads/
|
||||
|
||||
# Flatten directory structure (no subdirectory per item)
|
||||
ia download <identifier> --no-directories
|
||||
```
|
||||
|
||||
### Resume interrupted downloads
|
||||
|
||||
```bash
|
||||
# Resume — skips files that already exist and match checksum
|
||||
ia download <identifier> --checksum
|
||||
|
||||
# Checksum mode compares MD5 hashes — safe to re-run
|
||||
```
|
||||
|
||||
### Preserve timestamps
|
||||
|
||||
```bash
|
||||
# Keep original timestamps from archive.org
|
||||
ia download <identifier> --no-change-timestamp
|
||||
```
|
||||
|
||||
### Dry run
|
||||
|
||||
```bash
|
||||
# See what would be downloaded without actually downloading
|
||||
ia download <identifier> --dry-run
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. Bulk Download from Search Results
|
||||
|
||||
Pipe search results directly into download. This is the primary method for downloading multiple items.
|
||||
|
||||
### Basic pattern
|
||||
|
||||
```bash
|
||||
# Search → itemlist → download
|
||||
ia search 'collection:prelinger mediatype:movies' --itemlist | \
|
||||
ia download --itemlist -
|
||||
|
||||
# The - tells ia download to read identifiers from stdin
|
||||
```
|
||||
|
||||
### With filters
|
||||
|
||||
```bash
|
||||
# Download only PDFs from all items in a collection
|
||||
ia search 'collection:arrl_qst' --itemlist | \
|
||||
ia download --itemlist - --glob="*.pdf"
|
||||
|
||||
# Download only MP3s from an audio collection
|
||||
ia search 'collection:librivoxaudio' --itemlist | \
|
||||
ia download --itemlist - --glob="*.mp3"
|
||||
```
|
||||
|
||||
### With destination directory
|
||||
|
||||
```bash
|
||||
# Download to a specific location
|
||||
ia search 'collection:prelinger' --itemlist | \
|
||||
ia download --itemlist - --destdir=/mnt/archive/prelinger/
|
||||
```
|
||||
|
||||
### Save itemlist for reuse
|
||||
|
||||
When a search is large, save the itemlist first so you can resume without re-searching.
|
||||
|
||||
```bash
|
||||
# Step 1: Save itemlist
|
||||
ia search 'collection:prelinger mediatype:movies' --itemlist > prelinger-items.txt
|
||||
|
||||
# Step 2: Check count
|
||||
wc -l prelinger-items.txt
|
||||
|
||||
# Step 3: Download from file
|
||||
ia download --itemlist prelinger-items.txt --glob="*.mp4"
|
||||
|
||||
# Step 4: Resume if interrupted (just re-run with --checksum)
|
||||
ia download --itemlist prelinger-items.txt --glob="*.mp4" --checksum
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. Bulk Download with GNU Parallel
|
||||
|
||||
For faster bulk downloads, use GNU Parallel for concurrent item downloads.
|
||||
|
||||
```bash
|
||||
# Install parallel if not present
|
||||
sudo apt install -y parallel
|
||||
|
||||
# Download 5 items concurrently
|
||||
ia search 'collection:prelinger' --itemlist | \
|
||||
parallel -j5 'ia download {} --glob="*.mp4"'
|
||||
|
||||
# With destination directory
|
||||
ia search 'collection:prelinger' --itemlist | \
|
||||
parallel -j5 'ia download {} --glob="*.mp4" --destdir=/mnt/archive/prelinger/'
|
||||
|
||||
# From saved itemlist
|
||||
parallel -j5 'ia download {} --glob="*.pdf"' < items.txt
|
||||
```
|
||||
|
||||
**Caution:** Be respectful of archive.org bandwidth. 3-5 concurrent downloads is reasonable. Higher parallelism may trigger rate limiting.
|
||||
|
||||
---
|
||||
|
||||
## 6. Mirror an Entire Collection
|
||||
|
||||
Mirroring means downloading everything and being able to re-run to pick up new additions.
|
||||
|
||||
### Initial mirror
|
||||
|
||||
```bash
|
||||
# Step 1: Create working directory
|
||||
mkdir -p /mnt/archive/<collection-name>
|
||||
cd /mnt/archive/<collection-name>
|
||||
|
||||
# Step 2: Generate itemlist
|
||||
ia search 'collection:<collection-name>' --itemlist > itemlist.txt
|
||||
echo "Found $(wc -l < itemlist.txt) items"
|
||||
|
||||
# Step 3: Download all items (adjust --glob as needed)
|
||||
ia download --itemlist itemlist.txt --destdir=/mnt/archive/<collection-name>/
|
||||
|
||||
# Or with format filter
|
||||
ia download --itemlist itemlist.txt --glob="*.pdf" --destdir=/mnt/archive/<collection-name>/
|
||||
```
|
||||
|
||||
### Update an existing mirror
|
||||
|
||||
Re-run the same commands. Use `--checksum` to skip already-downloaded files.
|
||||
|
||||
```bash
|
||||
cd /mnt/archive/<collection-name>
|
||||
|
||||
# Refresh itemlist (new items since last run)
|
||||
ia search 'collection:<collection-name>' --itemlist > itemlist-new.txt
|
||||
|
||||
# Download only new/changed files
|
||||
ia download --itemlist itemlist-new.txt --checksum --destdir=/mnt/archive/<collection-name>/
|
||||
```
|
||||
|
||||
### Mirror with a script
|
||||
|
||||
For recurring mirrors, create a simple script:
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# mirror-collection.sh <collection-name> [glob-pattern]
|
||||
COLLECTION="$1"
|
||||
GLOB="${2:-*}"
|
||||
DEST="/mnt/archive/$COLLECTION"
|
||||
|
||||
mkdir -p "$DEST"
|
||||
|
||||
echo "Refreshing itemlist for $COLLECTION..."
|
||||
ia search "collection:$COLLECTION" --itemlist > "$DEST/itemlist.txt"
|
||||
COUNT=$(wc -l < "$DEST/itemlist.txt")
|
||||
echo "Found $COUNT items"
|
||||
|
||||
echo "Downloading (glob: $GLOB)..."
|
||||
ia download --itemlist "$DEST/itemlist.txt" --glob="$GLOB" --checksum --destdir="$DEST/"
|
||||
|
||||
echo "Mirror complete: $DEST"
|
||||
```
|
||||
|
||||
Usage:
|
||||
|
||||
```bash
|
||||
chmod +x mirror-collection.sh
|
||||
./mirror-collection.sh arrl_qst "*.pdf"
|
||||
./mirror-collection.sh prelinger "*.mp4"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7. Practical Patterns
|
||||
|
||||
### Download all PDFs from a collection
|
||||
|
||||
```bash
|
||||
ia search 'collection:arrl_qst' --itemlist | \
|
||||
ia download --itemlist - --glob="*.pdf" --destdir=/mnt/archive/arrl-qst/
|
||||
```
|
||||
|
||||
### Download specific media types from a collection
|
||||
|
||||
```bash
|
||||
# High-quality video only
|
||||
ia search 'collection:prelinger' --itemlist | \
|
||||
ia download --itemlist - --format="MPEG4" --destdir=/mnt/archive/prelinger-video/
|
||||
|
||||
# Audio in MP3 format
|
||||
ia search 'collection:librivoxaudio creator:"Mark Twain"' --itemlist | \
|
||||
ia download --itemlist - --glob="*64kb*.mp3" --destdir=/mnt/archive/twain-audio/
|
||||
```
|
||||
|
||||
### Download items matching a date range
|
||||
|
||||
```bash
|
||||
ia search 'collection:arrl_qst date:[1950-01-01 TO 1959-12-31]' --itemlist | \
|
||||
ia download --itemlist - --glob="*.pdf" --destdir=/mnt/archive/arrl-1950s/
|
||||
```
|
||||
|
||||
### Download a single specific file from an item
|
||||
|
||||
```bash
|
||||
# List files first
|
||||
ia list <identifier>
|
||||
|
||||
# Download just one file
|
||||
ia download <identifier> specific-file.pdf
|
||||
```
|
||||
|
||||
### Download and preserve directory structure
|
||||
|
||||
```bash
|
||||
# Default behavior — each item gets its own subdirectory
|
||||
ia download --itemlist items.txt --destdir=/mnt/archive/output/
|
||||
# Result: /mnt/archive/output/<identifier1>/files...
|
||||
# /mnt/archive/output/<identifier2>/files...
|
||||
```
|
||||
|
||||
### Pipe a single file to stdout
|
||||
|
||||
```bash
|
||||
# Stream a file without saving to disk
|
||||
ia download <identifier> specific-file.pdf --stdout | less
|
||||
ia download <identifier> data.json --stdout | jq .
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 8. Storage Planning
|
||||
|
||||
Before large downloads, estimate storage requirements.
|
||||
|
||||
```bash
|
||||
# Count items in collection
|
||||
ia search 'collection:<name>' --num-found
|
||||
|
||||
# Check a sample item's size
|
||||
ia metadata <sample-identifier> | jq '[.files[].size | tonumber] | add / 1048576 | floor'
|
||||
# Output in MB
|
||||
|
||||
# Check available storage on pi-nas
|
||||
df -h /mnt/
|
||||
```
|
||||
|
||||
### Rule of thumb
|
||||
|
||||
- Text collections (PDFs, EPUBs): ~10-100 MB per item
|
||||
- Audio collections: ~100 MB - 1 GB per item
|
||||
- Video collections: ~1-10 GB per item
|
||||
- Software archives: highly variable
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Download hangs or stalls
|
||||
|
||||
```bash
|
||||
# Kill and resume with checksum verification
|
||||
# Ctrl+C to stop, then re-run with --checksum
|
||||
ia download --itemlist items.txt --glob="*.pdf" --checksum
|
||||
```
|
||||
|
||||
### "Item not found" errors in bulk download
|
||||
|
||||
Some items in a collection may be restricted or taken down. These will fail individually but the batch continues. Check errors in output.
|
||||
|
||||
### Disk full during bulk download
|
||||
|
||||
```bash
|
||||
# Check what's using space
|
||||
du -sh /mnt/archive/*/ | sort -rh | head -20
|
||||
|
||||
# Resume after freeing space — checksum mode skips completed files
|
||||
ia download --itemlist items.txt --checksum
|
||||
```
|
||||
|
||||
### Rate limiting / 429 errors
|
||||
|
||||
Archive.org may throttle aggressive downloads.
|
||||
|
||||
- Reduce parallel jobs (if using GNU Parallel)
|
||||
- Add delays between items: `parallel -j2 --delay 5 'ia download {}' < items.txt`
|
||||
- Wait and retry later
|
||||
|
||||
### Corrupt downloads
|
||||
|
||||
```bash
|
||||
# Re-download with checksum verification — replaces corrupt files
|
||||
ia download <identifier> --checksum
|
||||
```
|
||||
|
||||
### Permission denied on destination
|
||||
|
||||
```bash
|
||||
# Ensure the download user owns the target directory
|
||||
sudo chown -R $(whoami):$(whoami) /mnt/archive/
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Checklist: Collection Mirror
|
||||
|
||||
```
|
||||
[ ] Identify collection identifier on archive.org
|
||||
[ ] Check available storage (df -h)
|
||||
[ ] Estimate collection size (--num-found + sample item size)
|
||||
[ ] Generate itemlist (ia search --itemlist > itemlist.txt)
|
||||
[ ] Review itemlist count (wc -l itemlist.txt)
|
||||
[ ] Start download with appropriate filters (--glob, --format)
|
||||
[ ] Verify downloaded files exist and are non-zero
|
||||
[ ] If interrupted, resume with --checksum
|
||||
[ ] Record collection details in project notes
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
*Last updated: 2026-02-14*
|
||||
Loading…
Add table
Add a link
Reference in a new issue