echo6-docs/runbooks/ia-download-mirror.md

# Download & Mirror from Internet Archive

Procedures for downloading items, filtering by format/pattern, bulk downloading from collections, and mirroring entire collections via the `ia` CLI on pi-nas.

---

## Prerequisites

- `ia` CLI installed on pi-nas (192.168.1.245) — v5.7.2
- Authenticated if downloading restricted items: `ia configure`
- Sufficient storage on pi-nas (check with `df -h`)
- Reference: `ia-cli-reference.md` for search/query syntax

---

## 1. Download a Single Item

An "item" is a logical unit on archive.org identified by its identifier (visible in the URL: `archive.org/details/<identifier>`).

```bash
# Download all files in an item to ./<identifier>/
ia download <identifier>

# Example
ia download prelinger_films
```

The default creates a directory named after the identifier containing all files (originals + derivatives).

### Gate

Verify the download directory exists and has files:

```bash
ls -la <identifier>/
```

---

## 2. Filtered Downloads

### By glob pattern

Download only files matching a shell glob pattern.

```bash
# Only PDFs
ia download <identifier> --glob="*.pdf"

# Only MP4 video files
ia download <identifier> --glob="*.mp4"

# Multiple patterns (pipe-separated)
ia download <identifier> --glob="*.pdf|*.epub"
```

### With exclusions

Exclude patterns require `--glob` to also be set.

```bash
# All MP4s except low-quality variants
ia download <identifier> --glob="*.mp4" --exclude="*512kb*"

# All files except metadata/review XMLs
ia download <identifier> --glob="*" --exclude="*_meta.xml|*_reviews.xml|*_files.xml"

# Multiple exclusions
ia download <identifier> --glob="*.mp4" --exclude="*512kb*|*_thumb*"
```

### By format name

Download files of a specific archive.org format (as shown by `ia metadata --formats`).

```bash
# Check available formats first
ia metadata <identifier> --formats

# Download only a specific format
ia download <identifier> --format="512Kb MPEG4"
ia download <identifier> --format="PDF"
ia download <identifier> --format="EPUB"
```

**Note:** `--format` is incompatible with `--glob` and `--exclude`. Use one approach or the other.

### On-the-fly formats

Some formats (EPUB, MOBI, DAISY, MARCXML) are generated on demand.

```bash
ia download <identifier> --on-the-fly --format="EPUB"
```

---

## 3. Download Options

### Control output location

```bash
# Download to a specific directory
ia download <identifier> --destdir=/mnt/archive/downloads/

# Flatten directory structure (no subdirectory per item)
ia download <identifier> --no-directories
```

### Resume interrupted downloads

```bash
# Resume — skips files that already exist and match checksum
ia download <identifier> --checksum

# Checksum mode compares MD5 hashes — safe to re-run
```

### Preserve timestamps

```bash
# Keep original timestamps from archive.org
ia download <identifier> --no-change-timestamp
```

### Dry run

```bash
# See what would be downloaded without actually downloading
ia download <identifier> --dry-run
```

---

## 4. Bulk Download from Search Results

Pipe search results directly into download. This is the primary method for downloading multiple items.

### Basic pattern

```bash
# Search → itemlist → download
ia search 'collection:prelinger mediatype:movies' --itemlist | \
  ia download --itemlist -

# The - tells ia download to read identifiers from stdin
```

### With filters

```bash
# Download only PDFs from all items in a collection
ia search 'collection:arrl_qst' --itemlist | \
  ia download --itemlist - --glob="*.pdf"

# Download only MP3s from an audio collection
ia search 'collection:librivoxaudio' --itemlist | \
  ia download --itemlist - --glob="*.mp3"
```

### With destination directory

```bash
# Download to a specific location
ia search 'collection:prelinger' --itemlist | \
  ia download --itemlist - --destdir=/mnt/archive/prelinger/
```

### Save itemlist for reuse

When a search is large, save the itemlist first so you can resume without re-searching.

```bash
# Step 1: Save itemlist
ia search 'collection:prelinger mediatype:movies' --itemlist > prelinger-items.txt

# Step 2: Check count
wc -l prelinger-items.txt

# Step 3: Download from file
ia download --itemlist prelinger-items.txt --glob="*.mp4"

# Step 4: Resume if interrupted (just re-run with --checksum)
ia download --itemlist prelinger-items.txt --glob="*.mp4" --checksum
```

---

## 5. Bulk Download with GNU Parallel

For faster bulk downloads, use GNU Parallel for concurrent item downloads.

```bash
# Install parallel if not present
sudo apt install -y parallel

# Download 5 items concurrently
ia search 'collection:prelinger' --itemlist | \
  parallel -j5 'ia download {} --glob="*.mp4"'

# With destination directory
ia search 'collection:prelinger' --itemlist | \
  parallel -j5 'ia download {} --glob="*.mp4" --destdir=/mnt/archive/prelinger/'

# From saved itemlist
parallel -j5 'ia download {} --glob="*.pdf"' < items.txt
```

**Caution:** Be respectful of archive.org bandwidth. 3-5 concurrent downloads is reasonable. Higher parallelism may trigger rate limiting.

---

## 6. Mirror an Entire Collection

Mirroring means downloading everything and being able to re-run to pick up new additions.

### Initial mirror

```bash
# Step 1: Create working directory
mkdir -p /mnt/archive/<collection-name>
cd /mnt/archive/<collection-name>

# Step 2: Generate itemlist
ia search 'collection:<collection-name>' --itemlist > itemlist.txt
echo "Found $(wc -l < itemlist.txt) items"

# Step 3: Download all items (adjust --glob as needed)
ia download --itemlist itemlist.txt --destdir=/mnt/archive/<collection-name>/

# Or with format filter
ia download --itemlist itemlist.txt --glob="*.pdf" --destdir=/mnt/archive/<collection-name>/
```

### Update an existing mirror

Re-run the same commands. Use `--checksum` to skip already-downloaded files.

```bash
cd /mnt/archive/<collection-name>

# Refresh itemlist (new items since last run)
ia search 'collection:<collection-name>' --itemlist > itemlist-new.txt

# Download only new/changed files
ia download --itemlist itemlist-new.txt --checksum --destdir=/mnt/archive/<collection-name>/
```

### Mirror with a script

For recurring mirrors, create a simple script:

```bash
#!/bin/bash
# mirror-collection.sh <collection-name> [glob-pattern]
COLLECTION="$1"
GLOB="${2:-*}"
DEST="/mnt/archive/$COLLECTION"

mkdir -p "$DEST"

echo "Refreshing itemlist for $COLLECTION..."
ia search "collection:$COLLECTION" --itemlist > "$DEST/itemlist.txt"
COUNT=$(wc -l < "$DEST/itemlist.txt")
echo "Found $COUNT items"

echo "Downloading (glob: $GLOB)..."
ia download --itemlist "$DEST/itemlist.txt" --glob="$GLOB" --checksum --destdir="$DEST/"

echo "Mirror complete: $DEST"
```

Usage:

```bash
chmod +x mirror-collection.sh
./mirror-collection.sh arrl_qst "*.pdf"
./mirror-collection.sh prelinger "*.mp4"
```

---

## 7. Practical Patterns

### Download all PDFs from a collection

```bash
ia search 'collection:arrl_qst' --itemlist | \
  ia download --itemlist - --glob="*.pdf" --destdir=/mnt/archive/arrl-qst/
```

### Download specific media types from a collection

```bash
# High-quality video only
ia search 'collection:prelinger' --itemlist | \
  ia download --itemlist - --format="MPEG4" --destdir=/mnt/archive/prelinger-video/

# Audio in MP3 format
ia search 'collection:librivoxaudio creator:"Mark Twain"' --itemlist | \
  ia download --itemlist - --glob="*64kb*.mp3" --destdir=/mnt/archive/twain-audio/
```

### Download items matching a date range

```bash
ia search 'collection:arrl_qst date:[1950-01-01 TO 1959-12-31]' --itemlist | \
  ia download --itemlist - --glob="*.pdf" --destdir=/mnt/archive/arrl-1950s/
```

### Download a single specific file from an item

```bash
# List files first
ia list <identifier>

# Download just one file
ia download <identifier> specific-file.pdf
```

### Download and preserve directory structure

```bash
# Default behavior — each item gets its own subdirectory
ia download --itemlist items.txt --destdir=/mnt/archive/output/
# Result: /mnt/archive/output/<identifier1>/files...
#         /mnt/archive/output/<identifier2>/files...
```

### Pipe a single file to stdout

```bash
# Stream a file without saving to disk
ia download <identifier> specific-file.pdf --stdout | less
ia download <identifier> data.json --stdout | jq .
```

---

## 8. Storage Planning

Before large downloads, estimate storage requirements.

```bash
# Count items in collection
ia search 'collection:<name>' --num-found

# Check a sample item's size
ia metadata <sample-identifier> | jq '[.files[].size | tonumber] | add / 1048576 | floor'
# Output in MB

# Check available storage on pi-nas
df -h /mnt/
```

### Rule of thumb

- Text collections (PDFs, EPUBs): ~10-100 MB per item
- Audio collections: ~100 MB - 1 GB per item
- Video collections: ~1-10 GB per item
- Software archives: highly variable

---

## Troubleshooting

### Download hangs or stalls

```bash
# Kill and resume with checksum verification
# Ctrl+C to stop, then re-run with --checksum
ia download --itemlist items.txt --glob="*.pdf" --checksum
```

### "Item not found" errors in bulk download

Some items in a collection may be restricted or taken down. These will fail individually but the batch continues. Check errors in output.

### Disk full during bulk download

```bash
# Check what's using space
du -sh /mnt/archive/*/ | sort -rh | head -20

# Resume after freeing space — checksum mode skips completed files
ia download --itemlist items.txt --checksum
```

### Rate limiting / 429 errors

Archive.org may throttle aggressive downloads.

- Reduce parallel jobs (if using GNU Parallel)
- Add delays between items: `parallel -j2 --delay 5 'ia download {}' < items.txt`
- Wait and retry later

### Corrupt downloads

```bash
# Re-download with checksum verification — replaces corrupt files
ia download <identifier> --checksum
```

### Permission denied on destination

```bash
# Ensure the download user owns the target directory
sudo chown -R $(whoami):$(whoami) /mnt/archive/
```

---

## Checklist: Collection Mirror

```
[ ] Identify collection identifier on archive.org
[ ] Check available storage (df -h)
[ ] Estimate collection size (--num-found + sample item size)
[ ] Generate itemlist (ia search --itemlist > itemlist.txt)
[ ] Review itemlist count (wc -l itemlist.txt)
[ ] Start download with appropriate filters (--glob, --format)
[ ] Verify downloaded files exist and are non-zero
[ ] If interrupted, resume with --checksum
[ ] Record collection details in project notes
```

---

*Last updated: 2026-02-14*
Migration: consolidate Echo6 docs to cortex with full infrastructure cleanup sync - Documents recent infrastructure cleanup (8 CTs destroyed, 35 DNS records removed, Headscale cleanup) - Adds 24 new runbooks covering Authentik, PeerTube, Meshtastic, RECON, Proxmox, Mailcow, Internet Archive, GPU routing - Adds project documentation for headscale, vaultwarden, peertube, matrix, mmud, advbbs, arr stack - Updates services.md, environment.md, caddy.md, authentik.md to match live infrastructure - Removes 4 deprecated runbook duplicates (canonical versions live in projects/) - Adds .gitignore for binary archives and editor temp files Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> 2026-04-13 06:02:16 +00:00			`# Download & Mirror from Internet Archive`

			Procedures for downloading items, filtering by format/pattern, bulk downloading from collections, and mirroring entire collections via the `ia` CLI on pi-nas.

			`---`

			`## Prerequisites`

			- `ia` CLI installed on pi-nas (192.168.1.245) — v5.7.2
			- Authenticated if downloading restricted items: `ia configure`
			- Sufficient storage on pi-nas (check with `df -h`)
			- Reference: `ia-cli-reference.md` for search/query syntax

			`---`

			`## 1. Download a Single Item`

			An "item" is a logical unit on archive.org identified by its identifier (visible in the URL: `archive.org/details/<identifier>`).

			```bash
			`# Download all files in an item to ./<identifier>/`
			`ia download <identifier>`

			`# Example`
			`ia download prelinger_films`
			```

			`The default creates a directory named after the identifier containing all files (originals + derivatives).`

			`### Gate`

			`Verify the download directory exists and has files:`

			```bash
			`ls -la <identifier>/`
			```

			`---`

			`## 2. Filtered Downloads`

			`### By glob pattern`

			`Download only files matching a shell glob pattern.`

			```bash
			`# Only PDFs`
			`ia download <identifier> --glob="*.pdf"`

			`# Only MP4 video files`
			`ia download <identifier> --glob="*.mp4"`

			`# Multiple patterns (pipe-separated)`
			`ia download <identifier> --glob=".pdf\|.epub"`
			```

			`### With exclusions`

			Exclude patterns require `--glob` to also be set.

			```bash
			`# All MP4s except low-quality variants`
			`ia download <identifier> --glob=".mp4" --exclude="512kb*"`

			`# All files except metadata/review XMLs`
			`ia download <identifier> --glob="" --exclude="_meta.xml\|_reviews.xml\|_files.xml"`

			`# Multiple exclusions`
			`ia download <identifier> --glob=".mp4" --exclude="512kb\|_thumb*"`
			```

			`### By format name`

			Download files of a specific archive.org format (as shown by `ia metadata --formats`).

			```bash
			`# Check available formats first`
			`ia metadata <identifier> --formats`

			`# Download only a specific format`
			`ia download <identifier> --format="512Kb MPEG4"`
			`ia download <identifier> --format="PDF"`
			`ia download <identifier> --format="EPUB"`
			```

			Note: `--format` is incompatible with `--glob` and `--exclude`. Use one approach or the other.

			`### On-the-fly formats`

			`Some formats (EPUB, MOBI, DAISY, MARCXML) are generated on demand.`

			```bash
			`ia download <identifier> --on-the-fly --format="EPUB"`
			```

			`---`

			`## 3. Download Options`

			`### Control output location`

			```bash
			`# Download to a specific directory`
			`ia download <identifier> --destdir=/mnt/archive/downloads/`

			`# Flatten directory structure (no subdirectory per item)`
			`ia download <identifier> --no-directories`
			```

			`### Resume interrupted downloads`

			```bash
			`# Resume — skips files that already exist and match checksum`
			`ia download <identifier> --checksum`

			`# Checksum mode compares MD5 hashes — safe to re-run`
			```

			`### Preserve timestamps`

			```bash
			`# Keep original timestamps from archive.org`
			`ia download <identifier> --no-change-timestamp`
			```

			`### Dry run`

			```bash
			`# See what would be downloaded without actually downloading`
			`ia download <identifier> --dry-run`
			```

			`---`

			`## 4. Bulk Download from Search Results`

			`Pipe search results directly into download. This is the primary method for downloading multiple items.`

			`### Basic pattern`

			```bash
			`# Search → itemlist → download`
			`ia search 'collection:prelinger mediatype:movies' --itemlist \| \`
			`ia download --itemlist -`

			`# The - tells ia download to read identifiers from stdin`
			```

			`### With filters`

			```bash
			`# Download only PDFs from all items in a collection`
			`ia search 'collection:arrl_qst' --itemlist \| \`
			`ia download --itemlist - --glob="*.pdf"`

			`# Download only MP3s from an audio collection`
			`ia search 'collection:librivoxaudio' --itemlist \| \`
			`ia download --itemlist - --glob="*.mp3"`
			```

			`### With destination directory`

			```bash
			`# Download to a specific location`
			`ia search 'collection:prelinger' --itemlist \| \`
			`ia download --itemlist - --destdir=/mnt/archive/prelinger/`
			```

			`### Save itemlist for reuse`

			`When a search is large, save the itemlist first so you can resume without re-searching.`

			```bash
			`# Step 1: Save itemlist`
			`ia search 'collection:prelinger mediatype:movies' --itemlist > prelinger-items.txt`

			`# Step 2: Check count`
			`wc -l prelinger-items.txt`

			`# Step 3: Download from file`
			`ia download --itemlist prelinger-items.txt --glob="*.mp4"`

			`# Step 4: Resume if interrupted (just re-run with --checksum)`
			`ia download --itemlist prelinger-items.txt --glob="*.mp4" --checksum`
			```

			`---`

			`## 5. Bulk Download with GNU Parallel`

			`For faster bulk downloads, use GNU Parallel for concurrent item downloads.`

			```bash
			`# Install parallel if not present`
			`sudo apt install -y parallel`

			`# Download 5 items concurrently`
			`ia search 'collection:prelinger' --itemlist \| \`
			`parallel -j5 'ia download {} --glob="*.mp4"'`

			`# With destination directory`
			`ia search 'collection:prelinger' --itemlist \| \`
			`parallel -j5 'ia download {} --glob="*.mp4" --destdir=/mnt/archive/prelinger/'`

			`# From saved itemlist`
			`parallel -j5 'ia download {} --glob="*.pdf"' < items.txt`
			```

			`Caution: Be respectful of archive.org bandwidth. 3-5 concurrent downloads is reasonable. Higher parallelism may trigger rate limiting.`

			`---`

			`## 6. Mirror an Entire Collection`

			`Mirroring means downloading everything and being able to re-run to pick up new additions.`

			`### Initial mirror`

			```bash
			`# Step 1: Create working directory`
			`mkdir -p /mnt/archive/<collection-name>`
			`cd /mnt/archive/<collection-name>`

			`# Step 2: Generate itemlist`
			`ia search 'collection:<collection-name>' --itemlist > itemlist.txt`
			`echo "Found $(wc -l < itemlist.txt) items"`

			`# Step 3: Download all items (adjust --glob as needed)`
			`ia download --itemlist itemlist.txt --destdir=/mnt/archive/<collection-name>/`

			`# Or with format filter`
			`ia download --itemlist itemlist.txt --glob="*.pdf" --destdir=/mnt/archive/<collection-name>/`
			```

			`### Update an existing mirror`

			Re-run the same commands. Use `--checksum` to skip already-downloaded files.

			```bash
			`cd /mnt/archive/<collection-name>`

			`# Refresh itemlist (new items since last run)`
			`ia search 'collection:<collection-name>' --itemlist > itemlist-new.txt`

			`# Download only new/changed files`
			`ia download --itemlist itemlist-new.txt --checksum --destdir=/mnt/archive/<collection-name>/`
			```

			`### Mirror with a script`

			`For recurring mirrors, create a simple script:`

			```bash
			`#!/bin/bash`
			`# mirror-collection.sh <collection-name> [glob-pattern]`
			`COLLECTION="$1"`
			`GLOB="${2:-*}"`
			`DEST="/mnt/archive/$COLLECTION"`

			`mkdir -p "$DEST"`

			`echo "Refreshing itemlist for $COLLECTION..."`
			`ia search "collection:$COLLECTION" --itemlist > "$DEST/itemlist.txt"`
			`COUNT=$(wc -l < "$DEST/itemlist.txt")`
			`echo "Found $COUNT items"`

			`echo "Downloading (glob: $GLOB)..."`
			`ia download --itemlist "$DEST/itemlist.txt" --glob="$GLOB" --checksum --destdir="$DEST/"`

			`echo "Mirror complete: $DEST"`
			```

			`Usage:`

			```bash
			`chmod +x mirror-collection.sh`
			`./mirror-collection.sh arrl_qst "*.pdf"`
			`./mirror-collection.sh prelinger "*.mp4"`
			```

			`---`

			`## 7. Practical Patterns`

			`### Download all PDFs from a collection`

			```bash
			`ia search 'collection:arrl_qst' --itemlist \| \`
			`ia download --itemlist - --glob="*.pdf" --destdir=/mnt/archive/arrl-qst/`
			```

			`### Download specific media types from a collection`

			```bash
			`# High-quality video only`
			`ia search 'collection:prelinger' --itemlist \| \`
			`ia download --itemlist - --format="MPEG4" --destdir=/mnt/archive/prelinger-video/`

			`# Audio in MP3 format`
			`ia search 'collection:librivoxaudio creator:"Mark Twain"' --itemlist \| \`
			`ia download --itemlist - --glob="64kb.mp3" --destdir=/mnt/archive/twain-audio/`
			```

			`### Download items matching a date range`

			```bash
			`ia search 'collection:arrl_qst date:[1950-01-01 TO 1959-12-31]' --itemlist \| \`
			`ia download --itemlist - --glob="*.pdf" --destdir=/mnt/archive/arrl-1950s/`
			```

			`### Download a single specific file from an item`

			```bash
			`# List files first`
			`ia list <identifier>`

			`# Download just one file`
			`ia download <identifier> specific-file.pdf`
			```

			`### Download and preserve directory structure`

			```bash
			`# Default behavior — each item gets its own subdirectory`
			`ia download --itemlist items.txt --destdir=/mnt/archive/output/`
			`# Result: /mnt/archive/output/<identifier1>/files...`
			`# /mnt/archive/output/<identifier2>/files...`
			```

			`### Pipe a single file to stdout`

			```bash
			`# Stream a file without saving to disk`
			`ia download <identifier> specific-file.pdf --stdout \| less`
			`ia download <identifier> data.json --stdout \| jq .`
			```

			`---`

			`## 8. Storage Planning`

			`Before large downloads, estimate storage requirements.`

			```bash
			`# Count items in collection`
			`ia search 'collection:<name>' --num-found`

			`# Check a sample item's size`
			`ia metadata <sample-identifier> \| jq '[.files[].size \| tonumber] \| add / 1048576 \| floor'`
			`# Output in MB`

			`# Check available storage on pi-nas`
			`df -h /mnt/`
			```

			`### Rule of thumb`

			`- Text collections (PDFs, EPUBs): ~10-100 MB per item`
			`- Audio collections: ~100 MB - 1 GB per item`
			`- Video collections: ~1-10 GB per item`
			`- Software archives: highly variable`

			`---`

			`## Troubleshooting`

			`### Download hangs or stalls`

			```bash
			`# Kill and resume with checksum verification`
			`# Ctrl+C to stop, then re-run with --checksum`
			`ia download --itemlist items.txt --glob="*.pdf" --checksum`
			```

			`### "Item not found" errors in bulk download`

			`Some items in a collection may be restricted or taken down. These will fail individually but the batch continues. Check errors in output.`

			`### Disk full during bulk download`

			```bash
			`# Check what's using space`
			`du -sh /mnt/archive/*/ \| sort -rh \| head -20`

			`# Resume after freeing space — checksum mode skips completed files`
			`ia download --itemlist items.txt --checksum`
			```

			`### Rate limiting / 429 errors`

			`Archive.org may throttle aggressive downloads.`

			`- Reduce parallel jobs (if using GNU Parallel)`
			- Add delays between items: `parallel -j2 --delay 5 'ia download {}' < items.txt`
			`- Wait and retry later`

			`### Corrupt downloads`

			```bash
			`# Re-download with checksum verification — replaces corrupt files`
			`ia download <identifier> --checksum`
			```

			`### Permission denied on destination`

			```bash
			`# Ensure the download user owns the target directory`
			`sudo chown -R $(whoami):$(whoami) /mnt/archive/`
			```

			`---`

			`## Checklist: Collection Mirror`

			```
			`[ ] Identify collection identifier on archive.org`
			`[ ] Check available storage (df -h)`
			`[ ] Estimate collection size (--num-found + sample item size)`
			`[ ] Generate itemlist (ia search --itemlist > itemlist.txt)`
			`[ ] Review itemlist count (wc -l itemlist.txt)`
			`[ ] Start download with appropriate filters (--glob, --format)`
			`[ ] Verify downloaded files exist and are non-zero`
			`[ ] If interrupted, resume with --checksum`
			`[ ] Record collection details in project notes`
			```

			`---`

			`Last updated: 2026-02-14`