# Download & Mirror from Internet Archive Procedures for downloading items, filtering by format/pattern, bulk downloading from collections, and mirroring entire collections via the `ia` CLI on pi-nas. --- ## Prerequisites - `ia` CLI installed on pi-nas (192.168.1.245) — v5.7.2 - Authenticated if downloading restricted items: `ia configure` - Sufficient storage on pi-nas (check with `df -h`) - Reference: `ia-cli-reference.md` for search/query syntax --- ## 1. Download a Single Item An "item" is a logical unit on archive.org identified by its identifier (visible in the URL: `archive.org/details/`). ```bash # Download all files in an item to .// ia download # Example ia download prelinger_films ``` The default creates a directory named after the identifier containing all files (originals + derivatives). ### Gate Verify the download directory exists and has files: ```bash ls -la / ``` --- ## 2. Filtered Downloads ### By glob pattern Download only files matching a shell glob pattern. ```bash # Only PDFs ia download --glob="*.pdf" # Only MP4 video files ia download --glob="*.mp4" # Multiple patterns (pipe-separated) ia download --glob="*.pdf|*.epub" ``` ### With exclusions Exclude patterns require `--glob` to also be set. ```bash # All MP4s except low-quality variants ia download --glob="*.mp4" --exclude="*512kb*" # All files except metadata/review XMLs ia download --glob="*" --exclude="*_meta.xml|*_reviews.xml|*_files.xml" # Multiple exclusions ia download --glob="*.mp4" --exclude="*512kb*|*_thumb*" ``` ### By format name Download files of a specific archive.org format (as shown by `ia metadata --formats`). ```bash # Check available formats first ia metadata --formats # Download only a specific format ia download --format="512Kb MPEG4" ia download --format="PDF" ia download --format="EPUB" ``` **Note:** `--format` is incompatible with `--glob` and `--exclude`. Use one approach or the other. ### On-the-fly formats Some formats (EPUB, MOBI, DAISY, MARCXML) are generated on demand. ```bash ia download --on-the-fly --format="EPUB" ``` --- ## 3. Download Options ### Control output location ```bash # Download to a specific directory ia download --destdir=/mnt/archive/downloads/ # Flatten directory structure (no subdirectory per item) ia download --no-directories ``` ### Resume interrupted downloads ```bash # Resume — skips files that already exist and match checksum ia download --checksum # Checksum mode compares MD5 hashes — safe to re-run ``` ### Preserve timestamps ```bash # Keep original timestamps from archive.org ia download --no-change-timestamp ``` ### Dry run ```bash # See what would be downloaded without actually downloading ia download --dry-run ``` --- ## 4. Bulk Download from Search Results Pipe search results directly into download. This is the primary method for downloading multiple items. ### Basic pattern ```bash # Search → itemlist → download ia search 'collection:prelinger mediatype:movies' --itemlist | \ ia download --itemlist - # The - tells ia download to read identifiers from stdin ``` ### With filters ```bash # Download only PDFs from all items in a collection ia search 'collection:arrl_qst' --itemlist | \ ia download --itemlist - --glob="*.pdf" # Download only MP3s from an audio collection ia search 'collection:librivoxaudio' --itemlist | \ ia download --itemlist - --glob="*.mp3" ``` ### With destination directory ```bash # Download to a specific location ia search 'collection:prelinger' --itemlist | \ ia download --itemlist - --destdir=/mnt/archive/prelinger/ ``` ### Save itemlist for reuse When a search is large, save the itemlist first so you can resume without re-searching. ```bash # Step 1: Save itemlist ia search 'collection:prelinger mediatype:movies' --itemlist > prelinger-items.txt # Step 2: Check count wc -l prelinger-items.txt # Step 3: Download from file ia download --itemlist prelinger-items.txt --glob="*.mp4" # Step 4: Resume if interrupted (just re-run with --checksum) ia download --itemlist prelinger-items.txt --glob="*.mp4" --checksum ``` --- ## 5. Bulk Download with GNU Parallel For faster bulk downloads, use GNU Parallel for concurrent item downloads. ```bash # Install parallel if not present sudo apt install -y parallel # Download 5 items concurrently ia search 'collection:prelinger' --itemlist | \ parallel -j5 'ia download {} --glob="*.mp4"' # With destination directory ia search 'collection:prelinger' --itemlist | \ parallel -j5 'ia download {} --glob="*.mp4" --destdir=/mnt/archive/prelinger/' # From saved itemlist parallel -j5 'ia download {} --glob="*.pdf"' < items.txt ``` **Caution:** Be respectful of archive.org bandwidth. 3-5 concurrent downloads is reasonable. Higher parallelism may trigger rate limiting. --- ## 6. Mirror an Entire Collection Mirroring means downloading everything and being able to re-run to pick up new additions. ### Initial mirror ```bash # Step 1: Create working directory mkdir -p /mnt/archive/ cd /mnt/archive/ # Step 2: Generate itemlist ia search 'collection:' --itemlist > itemlist.txt echo "Found $(wc -l < itemlist.txt) items" # Step 3: Download all items (adjust --glob as needed) ia download --itemlist itemlist.txt --destdir=/mnt/archive// # Or with format filter ia download --itemlist itemlist.txt --glob="*.pdf" --destdir=/mnt/archive// ``` ### Update an existing mirror Re-run the same commands. Use `--checksum` to skip already-downloaded files. ```bash cd /mnt/archive/ # Refresh itemlist (new items since last run) ia search 'collection:' --itemlist > itemlist-new.txt # Download only new/changed files ia download --itemlist itemlist-new.txt --checksum --destdir=/mnt/archive// ``` ### Mirror with a script For recurring mirrors, create a simple script: ```bash #!/bin/bash # mirror-collection.sh [glob-pattern] COLLECTION="$1" GLOB="${2:-*}" DEST="/mnt/archive/$COLLECTION" mkdir -p "$DEST" echo "Refreshing itemlist for $COLLECTION..." ia search "collection:$COLLECTION" --itemlist > "$DEST/itemlist.txt" COUNT=$(wc -l < "$DEST/itemlist.txt") echo "Found $COUNT items" echo "Downloading (glob: $GLOB)..." ia download --itemlist "$DEST/itemlist.txt" --glob="$GLOB" --checksum --destdir="$DEST/" echo "Mirror complete: $DEST" ``` Usage: ```bash chmod +x mirror-collection.sh ./mirror-collection.sh arrl_qst "*.pdf" ./mirror-collection.sh prelinger "*.mp4" ``` --- ## 7. Practical Patterns ### Download all PDFs from a collection ```bash ia search 'collection:arrl_qst' --itemlist | \ ia download --itemlist - --glob="*.pdf" --destdir=/mnt/archive/arrl-qst/ ``` ### Download specific media types from a collection ```bash # High-quality video only ia search 'collection:prelinger' --itemlist | \ ia download --itemlist - --format="MPEG4" --destdir=/mnt/archive/prelinger-video/ # Audio in MP3 format ia search 'collection:librivoxaudio creator:"Mark Twain"' --itemlist | \ ia download --itemlist - --glob="*64kb*.mp3" --destdir=/mnt/archive/twain-audio/ ``` ### Download items matching a date range ```bash ia search 'collection:arrl_qst date:[1950-01-01 TO 1959-12-31]' --itemlist | \ ia download --itemlist - --glob="*.pdf" --destdir=/mnt/archive/arrl-1950s/ ``` ### Download a single specific file from an item ```bash # List files first ia list # Download just one file ia download specific-file.pdf ``` ### Download and preserve directory structure ```bash # Default behavior — each item gets its own subdirectory ia download --itemlist items.txt --destdir=/mnt/archive/output/ # Result: /mnt/archive/output//files... # /mnt/archive/output//files... ``` ### Pipe a single file to stdout ```bash # Stream a file without saving to disk ia download specific-file.pdf --stdout | less ia download data.json --stdout | jq . ``` --- ## 8. Storage Planning Before large downloads, estimate storage requirements. ```bash # Count items in collection ia search 'collection:' --num-found # Check a sample item's size ia metadata | jq '[.files[].size | tonumber] | add / 1048576 | floor' # Output in MB # Check available storage on pi-nas df -h /mnt/ ``` ### Rule of thumb - Text collections (PDFs, EPUBs): ~10-100 MB per item - Audio collections: ~100 MB - 1 GB per item - Video collections: ~1-10 GB per item - Software archives: highly variable --- ## Troubleshooting ### Download hangs or stalls ```bash # Kill and resume with checksum verification # Ctrl+C to stop, then re-run with --checksum ia download --itemlist items.txt --glob="*.pdf" --checksum ``` ### "Item not found" errors in bulk download Some items in a collection may be restricted or taken down. These will fail individually but the batch continues. Check errors in output. ### Disk full during bulk download ```bash # Check what's using space du -sh /mnt/archive/*/ | sort -rh | head -20 # Resume after freeing space — checksum mode skips completed files ia download --itemlist items.txt --checksum ``` ### Rate limiting / 429 errors Archive.org may throttle aggressive downloads. - Reduce parallel jobs (if using GNU Parallel) - Add delays between items: `parallel -j2 --delay 5 'ia download {}' < items.txt` - Wait and retry later ### Corrupt downloads ```bash # Re-download with checksum verification — replaces corrupt files ia download --checksum ``` ### Permission denied on destination ```bash # Ensure the download user owns the target directory sudo chown -R $(whoami):$(whoami) /mnt/archive/ ``` --- ## Checklist: Collection Mirror ``` [ ] Identify collection identifier on archive.org [ ] Check available storage (df -h) [ ] Estimate collection size (--num-found + sample item size) [ ] Generate itemlist (ia search --itemlist > itemlist.txt) [ ] Review itemlist count (wc -l itemlist.txt) [ ] Start download with appropriate filters (--glob, --format) [ ] Verify downloaded files exist and are non-zero [ ] If interrupted, resume with --checksum [ ] Record collection details in project notes ``` --- *Last updated: 2026-02-14*