- Documents recent infrastructure cleanup (8 CTs destroyed, 35 DNS records removed, Headscale cleanup) - Adds 24 new runbooks covering Authentik, PeerTube, Meshtastic, RECON, Proxmox, Mailcow, Internet Archive, GPU routing - Adds project documentation for headscale, vaultwarden, peertube, matrix, mmud, advbbs, arr stack - Updates services.md, environment.md, caddy.md, authentik.md to match live infrastructure - Removes 4 deprecated runbook duplicates (canonical versions live in projects/) - Adds .gitignore for binary archives and editor temp files Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
11 KiB
GPU/CPU Fallback Routing
Route workloads to GPU or CPU based on pre-flight inspection of job properties (duration, file size, resolution, complexity). Small jobs go to GPU for speed; large jobs fall back to CPU to avoid VRAM exhaustion. Concurrent job control via flock prevents OOM kills — excess jobs fail fast and re-queue instead of competing for memory.
Use this when you have a GPU workload where some jobs exceed VRAM capacity, and the system needs to handle both small and large jobs without manual intervention or OOM kills.
Prerequisites
- NVIDIA GPU with working drivers (
nvidia-smireturns output) - Both GPU and CPU execution paths available for the workload
- A probe tool to inspect job properties before execution (e.g.,
ffprobe,mediainfo,file,wc) - A caller that retries on non-zero exit codes (scheduler, job queue, runner)
Inputs
Prompt the user for all of these before executing:
TARGET_HOST= # Machine with GPU (e.g., cortex)
WORKLOAD_BINARY= # The tool that processes jobs (e.g., "whisper-ctranslate2-real")
PROBE_TOOL= # Tool to inspect job properties (e.g., "ffprobe", "mediainfo")
GPU_VRAM_MB= # Total VRAM available (e.g., 16384 for 16GB)
WORKLOAD_VRAM_MB= # VRAM used per GPU job (e.g., 3700)
WORKLOAD_RAM_MB= # RAM used per CPU job (e.g., 11000)
THRESHOLD_VALUE= # Cutoff for GPU vs CPU routing (e.g., 3600 for seconds)
THRESHOLD_UNIT= # What the threshold measures (e.g., "seconds", "bytes", "pixels")
MAX_GPU_JOBS= # Max concurrent GPU jobs (e.g., 2)
MAX_CPU_JOBS= # Max concurrent CPU jobs (e.g., 1)
GPU_ARGS= # Arguments for GPU execution (e.g., "--device cuda --compute_type float16")
CPU_ARGS= # Arguments for CPU execution (e.g., "--device cpu --compute_type int8")
Step 1: Determine the Routing Threshold
Profile representative workloads to find the VRAM crossover point.
# Run a small job on GPU, monitor VRAM
ssh $TARGET_HOST "nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits"
# Run the workload...
# Check peak VRAM during execution
# Run a large job on GPU, watch for OOM
# If it OOM-kills or exceeds VRAM, that's your upper bound
The threshold should be set conservatively below the point where GPU jobs start failing. Common strategies:
| Workload Type | Probe Property | Typical Threshold |
|---|---|---|
| Audio transcription | Duration (seconds) | 1-2 hours |
| Image generation | Resolution (megapixels) | Based on model VRAM curve |
| Video encoding | Duration × resolution | Derived from VRAM budget |
| LLM inference | Token count / context length | Model-specific |
Gate
You must have a clear, measurable property that predicts VRAM usage. If the relationship between job properties and VRAM is unpredictable, this pattern won't work — use a different strategy (e.g., try GPU first, fall back on OOM).
Step 2: Write the Probe Function
The probe function inspects the job input and returns the routing metric.
# Generic probe template
probe_workload() {
local INPUT="$1"
local METRIC=0
if [[ -n "$INPUT" && -f "$INPUT" ]]; then
# Example: audio/video duration via ffprobe
METRIC=$($PROBE_TOOL -v quiet -show_entries format=duration \
-of csv=p=0 "$INPUT" 2>/dev/null | cut -d. -f1)
METRIC=${METRIC:-0}
# Example: file size in bytes
# METRIC=$(stat -c%s "$INPUT" 2>/dev/null)
# Example: image resolution (width × height)
# METRIC=$($PROBE_TOOL -v quiet -show_entries stream=width,height \
# -of csv=p=0 "$INPUT" 2>/dev/null | awk -F, '{print $1*$2}')
fi
echo "$METRIC"
}
Gate
Test the probe against known inputs:
# Small workload (should route to GPU)
probe_workload /path/to/small/input # Should be < THRESHOLD_VALUE
# Large workload (should route to CPU)
probe_workload /path/to/large/input # Should be >= THRESHOLD_VALUE
Step 3: Implement the Router
The router uses the probe result to select GPU or CPU execution path.
#!/bin/bash
# GPU/CPU Fallback Router
# Routes jobs based on $THRESHOLD_UNIT inspection
THRESHOLD=$THRESHOLD_VALUE
LOGFILE="/tmp/workload-router.log"
GPU_LOCK="/tmp/gpu-workload.lock"
CPU_LOCK="/tmp/cpu-workload.lock"
# ──── Probe ────
INPUT="<extract from $@>"
METRIC=$(probe_workload "$INPUT")
# ──── Route ────
if (( METRIC < THRESHOLD )); then
MODE="GPU"
DEVICE_ARGS="$GPU_ARGS"
LOCK_FILE="$GPU_LOCK"
MAX_CONCURRENT=$MAX_GPU_JOBS
else
MODE="CPU"
DEVICE_ARGS="$CPU_ARGS"
LOCK_FILE="$CPU_LOCK"
MAX_CONCURRENT=$MAX_CPU_JOBS
fi
# ──── Concurrency control ────
if (( MAX_CONCURRENT == 1 )); then
# Single-job lock: flock with fail-fast
exec 9>"$LOCK_FILE"
if ! flock --nonblock 9; then
echo "[ROUTER] $(date) mode=${MODE}-BLOCKED metric=${METRIC} (slot full, exiting)" >> "$LOGFILE"
exit 1 # Caller should retry later
fi
fi
# For MAX_CONCURRENT > 1, use numbered lock files:
# for i in $(seq 0 $((MAX_CONCURRENT - 1))); do
# SLOT_LOCK="${LOCK_FILE}.${i}"
# exec 9>"$SLOT_LOCK"
# if flock --nonblock 9; then
# break # Got a slot
# fi
# if (( i == MAX_CONCURRENT - 1 )); then
# echo "[ROUTER] $(date) mode=${MODE}-BLOCKED metric=${METRIC} (all slots full)" >> "$LOGFILE"
# exit 1
# fi
# done
# ──── Log and execute ────
echo "[ROUTER] $(date) mode=$MODE metric=${METRIC} args: $@" >> "$LOGFILE"
exec $WORKLOAD_BINARY "$@" $DEVICE_ARGS
Key design decisions
flock --nonblock: Non-blocking lock attempt. If the slot is taken, exit immediately instead of waiting. This prevents queue starvation where all runner slots are blocked waiting for CPU jobs.- Exit code 1: The caller (runner, scheduler) should interpret this as "retry later." Most job queues do this by default.
exec: Replace the router process with the workload binary. Signals, exit codes, and resource limits pass through cleanly.- Lock files in
/tmp: Automatically cleaned on reboot. No stale locks after crashes.
Step 4: Integrate with the Caller
Deploy the router using the binary wrapper interception pattern (see binary-wrapper-interception.md):
- Rename the real binary:
mv $BINARY → ${BINARY}-real - Write the router script
- Symlink:
ln -sf /path/to/router $BINARY
Or, if the caller supports configurable command paths, point it directly at the router.
Step 5: Verify Both Paths
GPU path
# Submit a small job
ssh $TARGET_HOST "$BINARY <small-input-args>"
# Verify GPU usage
ssh $TARGET_HOST "nvidia-smi --query-gpu=utilization.gpu,memory.used --format=csv,noheader"
# Check log
ssh $TARGET_HOST "tail -1 /tmp/workload-router.log"
# Should show: mode=GPU
CPU path
# Submit a large job
ssh $TARGET_HOST "$BINARY <large-input-args>"
# Verify CPU usage (no GPU spike)
ssh $TARGET_HOST "nvidia-smi --query-gpu=utilization.gpu,memory.used --format=csv,noheader"
# GPU should be idle
# Check RAM
ssh $TARGET_HOST "free -h"
# Check log
ssh $TARGET_HOST "tail -1 /tmp/workload-router.log"
# Should show: mode=CPU
Concurrency control
# Start a CPU job, then immediately try a second one
ssh $TARGET_HOST "$BINARY <large-input-1> &"
sleep 2
ssh $TARGET_HOST "$BINARY <large-input-2>"
# Second job should exit immediately with code 1
# Check log
ssh $TARGET_HOST "grep BLOCKED /tmp/workload-router.log"
# Should show: mode=CPU-BLOCKED
Step 6: Tune and Monitor
After initial deployment, monitor for a day and adjust:
# Distribution of GPU vs CPU jobs
ssh $TARGET_HOST "grep -c 'mode=GPU' /tmp/workload-router.log"
ssh $TARGET_HOST "grep -c 'mode=CPU' /tmp/workload-router.log"
ssh $TARGET_HOST "grep -c 'BLOCKED' /tmp/workload-router.log"
If BLOCKED count is high relative to CPU count, the threshold may be too aggressive (routing too many jobs to CPU). Consider raising the threshold or increasing MAX_CPU_JOBS if RAM allows.
Memory Budget Worksheet
GPU path:
VRAM per job: $WORKLOAD_VRAM_MB MB
Max GPU jobs: $MAX_GPU_JOBS
Total GPU VRAM: $GPU_VRAM_MB MB
Headroom: GPU_VRAM_MB - (WORKLOAD_VRAM_MB × MAX_GPU_JOBS) MB
→ Headroom must be positive
CPU path:
RAM per job: $WORKLOAD_RAM_MB MB
Max CPU jobs: $MAX_CPU_JOBS
System RAM: $(free -m | awk '/Mem:/{print $2}') MB
Other processes: ~2-4 GB (OS, services, buffers)
Headroom: SystemRAM - OtherProcs - (WORKLOAD_RAM_MB × MAX_CPU_JOBS) MB
→ Headroom must be positive
systemd MemoryMax: Should be set to MAX(GPU peak, CPU peak) + 20% buffer
Troubleshooting
GPU job OOM-kills despite being under threshold
The threshold is too high, or VRAM usage varies by input characteristics beyond what the probe measures. Lower the threshold or add a secondary probe (e.g., check resolution in addition to duration).
CPU jobs pile up and exhaust RAM
MAX_CPU_JOBS is too high, or the flock mechanism isn't working. Check that lock files are being created in /tmp/ and that the exec 9> file descriptor redirect is correct.
All jobs route to CPU
The probe is returning 0 or failing silently. Test the probe manually:
$PROBE_TOOL -v quiet -show_entries format=duration -of csv=p=0 /path/to/input
If it returns empty, the input file may not be accessible to the probe tool (permissions, path issues).
Blocked jobs never get retried
The caller doesn't retry on exit code 1. Check the caller's retry behavior. Some systems need specific exit codes (e.g., 75 for "temporary failure" in some mail systems). Adjust the exit code in the router to match what the caller expects.
Lock files persist after crash
/tmp is cleaned on reboot, so stale locks self-heal. For immediate cleanup: rm /tmp/cpu-workload.lock. The next job will recreate it.
Usage Examples
Whisper auto-captioning on PeerTube runner (cortex)
WORKLOAD_BINARY=/usr/local/bin/whisper-ctranslate2-real
PROBE_TOOL=ffprobe
GPU_VRAM_MB=16384 # RTX A4000
WORKLOAD_VRAM_MB=3700 # Whisper medium on float16
WORKLOAD_RAM_MB=11000 # Whisper medium on CPU int8 (peak for 9.5hr video)
THRESHOLD_VALUE=3600 # 1 hour in seconds
THRESHOLD_UNIT=seconds
MAX_GPU_JOBS=2 # Runner concurrency=2, but both can be GPU
MAX_CPU_JOBS=1 # Only 1 CPU job at a time (11GB peak, 20G MemoryMax)
GPU_ARGS="--device cuda --compute_type float16"
CPU_ARGS="--device cpu --compute_type int8"
Result: 4100+ videos captioned. ~20 videos over 1 hour routed to CPU.
GPU jobs: ~3.7GB VRAM, 88-99% GPU utilization
CPU jobs: ~8-11GB RAM, serialized via flock
MemoryMax=20G on the runner service as safety net.
Last updated: 2026-02-17