echo6-docs/runbooks/gpu-cpu-fallback-routing.md
Matt Johnson e9231ac24a Migration: consolidate Echo6 docs to cortex with full infrastructure cleanup sync
- Documents recent infrastructure cleanup (8 CTs destroyed, 35 DNS records removed, Headscale cleanup)
- Adds 24 new runbooks covering Authentik, PeerTube, Meshtastic, RECON, Proxmox, Mailcow, Internet Archive, GPU routing
- Adds project documentation for headscale, vaultwarden, peertube, matrix, mmud, advbbs, arr stack
- Updates services.md, environment.md, caddy.md, authentik.md to match live infrastructure
- Removes 4 deprecated runbook duplicates (canonical versions live in projects/)
- Adds .gitignore for binary archives and editor temp files

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-13 06:02:16 +00:00

11 KiB
Raw Blame History

GPU/CPU Fallback Routing

Route workloads to GPU or CPU based on pre-flight inspection of job properties (duration, file size, resolution, complexity). Small jobs go to GPU for speed; large jobs fall back to CPU to avoid VRAM exhaustion. Concurrent job control via flock prevents OOM kills — excess jobs fail fast and re-queue instead of competing for memory.

Use this when you have a GPU workload where some jobs exceed VRAM capacity, and the system needs to handle both small and large jobs without manual intervention or OOM kills.


Prerequisites

  • NVIDIA GPU with working drivers (nvidia-smi returns output)
  • Both GPU and CPU execution paths available for the workload
  • A probe tool to inspect job properties before execution (e.g., ffprobe, mediainfo, file, wc)
  • A caller that retries on non-zero exit codes (scheduler, job queue, runner)

Inputs

Prompt the user for all of these before executing:

TARGET_HOST=            # Machine with GPU (e.g., cortex)
WORKLOAD_BINARY=        # The tool that processes jobs (e.g., "whisper-ctranslate2-real")
PROBE_TOOL=             # Tool to inspect job properties (e.g., "ffprobe", "mediainfo")
GPU_VRAM_MB=            # Total VRAM available (e.g., 16384 for 16GB)
WORKLOAD_VRAM_MB=       # VRAM used per GPU job (e.g., 3700)
WORKLOAD_RAM_MB=        # RAM used per CPU job (e.g., 11000)
THRESHOLD_VALUE=        # Cutoff for GPU vs CPU routing (e.g., 3600 for seconds)
THRESHOLD_UNIT=         # What the threshold measures (e.g., "seconds", "bytes", "pixels")
MAX_GPU_JOBS=           # Max concurrent GPU jobs (e.g., 2)
MAX_CPU_JOBS=           # Max concurrent CPU jobs (e.g., 1)
GPU_ARGS=               # Arguments for GPU execution (e.g., "--device cuda --compute_type float16")
CPU_ARGS=               # Arguments for CPU execution (e.g., "--device cpu --compute_type int8")

Step 1: Determine the Routing Threshold

Profile representative workloads to find the VRAM crossover point.

# Run a small job on GPU, monitor VRAM
ssh $TARGET_HOST "nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits"
# Run the workload...
# Check peak VRAM during execution

# Run a large job on GPU, watch for OOM
# If it OOM-kills or exceeds VRAM, that's your upper bound

The threshold should be set conservatively below the point where GPU jobs start failing. Common strategies:

Workload Type Probe Property Typical Threshold
Audio transcription Duration (seconds) 1-2 hours
Image generation Resolution (megapixels) Based on model VRAM curve
Video encoding Duration × resolution Derived from VRAM budget
LLM inference Token count / context length Model-specific

Gate

You must have a clear, measurable property that predicts VRAM usage. If the relationship between job properties and VRAM is unpredictable, this pattern won't work — use a different strategy (e.g., try GPU first, fall back on OOM).


Step 2: Write the Probe Function

The probe function inspects the job input and returns the routing metric.

# Generic probe template
probe_workload() {
    local INPUT="$1"
    local METRIC=0

    if [[ -n "$INPUT" && -f "$INPUT" ]]; then
        # Example: audio/video duration via ffprobe
        METRIC=$($PROBE_TOOL -v quiet -show_entries format=duration \
            -of csv=p=0 "$INPUT" 2>/dev/null | cut -d. -f1)
        METRIC=${METRIC:-0}

        # Example: file size in bytes
        # METRIC=$(stat -c%s "$INPUT" 2>/dev/null)

        # Example: image resolution (width × height)
        # METRIC=$($PROBE_TOOL -v quiet -show_entries stream=width,height \
        #     -of csv=p=0 "$INPUT" 2>/dev/null | awk -F, '{print $1*$2}')
    fi

    echo "$METRIC"
}

Gate

Test the probe against known inputs:

# Small workload (should route to GPU)
probe_workload /path/to/small/input    # Should be < THRESHOLD_VALUE

# Large workload (should route to CPU)
probe_workload /path/to/large/input    # Should be >= THRESHOLD_VALUE

Step 3: Implement the Router

The router uses the probe result to select GPU or CPU execution path.

#!/bin/bash
# GPU/CPU Fallback Router
# Routes jobs based on $THRESHOLD_UNIT inspection

THRESHOLD=$THRESHOLD_VALUE
LOGFILE="/tmp/workload-router.log"
GPU_LOCK="/tmp/gpu-workload.lock"
CPU_LOCK="/tmp/cpu-workload.lock"

# ──── Probe ────
INPUT="<extract from $@>"
METRIC=$(probe_workload "$INPUT")

# ──── Route ────
if (( METRIC < THRESHOLD )); then
    MODE="GPU"
    DEVICE_ARGS="$GPU_ARGS"
    LOCK_FILE="$GPU_LOCK"
    MAX_CONCURRENT=$MAX_GPU_JOBS
else
    MODE="CPU"
    DEVICE_ARGS="$CPU_ARGS"
    LOCK_FILE="$CPU_LOCK"
    MAX_CONCURRENT=$MAX_CPU_JOBS
fi

# ──── Concurrency control ────
if (( MAX_CONCURRENT == 1 )); then
    # Single-job lock: flock with fail-fast
    exec 9>"$LOCK_FILE"
    if ! flock --nonblock 9; then
        echo "[ROUTER] $(date) mode=${MODE}-BLOCKED metric=${METRIC} (slot full, exiting)" >> "$LOGFILE"
        exit 1  # Caller should retry later
    fi
fi
# For MAX_CONCURRENT > 1, use numbered lock files:
# for i in $(seq 0 $((MAX_CONCURRENT - 1))); do
#     SLOT_LOCK="${LOCK_FILE}.${i}"
#     exec 9>"$SLOT_LOCK"
#     if flock --nonblock 9; then
#         break  # Got a slot
#     fi
#     if (( i == MAX_CONCURRENT - 1 )); then
#         echo "[ROUTER] $(date) mode=${MODE}-BLOCKED metric=${METRIC} (all slots full)" >> "$LOGFILE"
#         exit 1
#     fi
# done

# ──── Log and execute ────
echo "[ROUTER] $(date) mode=$MODE metric=${METRIC} args: $@" >> "$LOGFILE"

exec $WORKLOAD_BINARY "$@" $DEVICE_ARGS

Key design decisions

  • flock --nonblock: Non-blocking lock attempt. If the slot is taken, exit immediately instead of waiting. This prevents queue starvation where all runner slots are blocked waiting for CPU jobs.
  • Exit code 1: The caller (runner, scheduler) should interpret this as "retry later." Most job queues do this by default.
  • exec: Replace the router process with the workload binary. Signals, exit codes, and resource limits pass through cleanly.
  • Lock files in /tmp: Automatically cleaned on reboot. No stale locks after crashes.

Step 4: Integrate with the Caller

Deploy the router using the binary wrapper interception pattern (see binary-wrapper-interception.md):

  1. Rename the real binary: mv $BINARY → ${BINARY}-real
  2. Write the router script
  3. Symlink: ln -sf /path/to/router $BINARY

Or, if the caller supports configurable command paths, point it directly at the router.


Step 5: Verify Both Paths

GPU path

# Submit a small job
ssh $TARGET_HOST "$BINARY <small-input-args>"

# Verify GPU usage
ssh $TARGET_HOST "nvidia-smi --query-gpu=utilization.gpu,memory.used --format=csv,noheader"

# Check log
ssh $TARGET_HOST "tail -1 /tmp/workload-router.log"
# Should show: mode=GPU

CPU path

# Submit a large job
ssh $TARGET_HOST "$BINARY <large-input-args>"

# Verify CPU usage (no GPU spike)
ssh $TARGET_HOST "nvidia-smi --query-gpu=utilization.gpu,memory.used --format=csv,noheader"
# GPU should be idle

# Check RAM
ssh $TARGET_HOST "free -h"

# Check log
ssh $TARGET_HOST "tail -1 /tmp/workload-router.log"
# Should show: mode=CPU

Concurrency control

# Start a CPU job, then immediately try a second one
ssh $TARGET_HOST "$BINARY <large-input-1> &"
sleep 2
ssh $TARGET_HOST "$BINARY <large-input-2>"
# Second job should exit immediately with code 1

# Check log
ssh $TARGET_HOST "grep BLOCKED /tmp/workload-router.log"
# Should show: mode=CPU-BLOCKED

Step 6: Tune and Monitor

After initial deployment, monitor for a day and adjust:

# Distribution of GPU vs CPU jobs
ssh $TARGET_HOST "grep -c 'mode=GPU' /tmp/workload-router.log"
ssh $TARGET_HOST "grep -c 'mode=CPU' /tmp/workload-router.log"
ssh $TARGET_HOST "grep -c 'BLOCKED' /tmp/workload-router.log"

If BLOCKED count is high relative to CPU count, the threshold may be too aggressive (routing too many jobs to CPU). Consider raising the threshold or increasing MAX_CPU_JOBS if RAM allows.


Memory Budget Worksheet

GPU path:
  VRAM per job:     $WORKLOAD_VRAM_MB MB
  Max GPU jobs:     $MAX_GPU_JOBS
  Total GPU VRAM:   $GPU_VRAM_MB MB
  Headroom:         GPU_VRAM_MB - (WORKLOAD_VRAM_MB × MAX_GPU_JOBS) MB
  → Headroom must be positive

CPU path:
  RAM per job:      $WORKLOAD_RAM_MB MB
  Max CPU jobs:     $MAX_CPU_JOBS
  System RAM:       $(free -m | awk '/Mem:/{print $2}') MB
  Other processes:  ~2-4 GB (OS, services, buffers)
  Headroom:         SystemRAM - OtherProcs - (WORKLOAD_RAM_MB × MAX_CPU_JOBS) MB
  → Headroom must be positive

systemd MemoryMax:  Should be set to MAX(GPU peak, CPU peak) + 20% buffer

Troubleshooting

GPU job OOM-kills despite being under threshold

The threshold is too high, or VRAM usage varies by input characteristics beyond what the probe measures. Lower the threshold or add a secondary probe (e.g., check resolution in addition to duration).

CPU jobs pile up and exhaust RAM

MAX_CPU_JOBS is too high, or the flock mechanism isn't working. Check that lock files are being created in /tmp/ and that the exec 9> file descriptor redirect is correct.

All jobs route to CPU

The probe is returning 0 or failing silently. Test the probe manually:

$PROBE_TOOL -v quiet -show_entries format=duration -of csv=p=0 /path/to/input

If it returns empty, the input file may not be accessible to the probe tool (permissions, path issues).

Blocked jobs never get retried

The caller doesn't retry on exit code 1. Check the caller's retry behavior. Some systems need specific exit codes (e.g., 75 for "temporary failure" in some mail systems). Adjust the exit code in the router to match what the caller expects.

Lock files persist after crash

/tmp is cleaned on reboot, so stale locks self-heal. For immediate cleanup: rm /tmp/cpu-workload.lock. The next job will recreate it.


Usage Examples

Whisper auto-captioning on PeerTube runner (cortex)

WORKLOAD_BINARY=/usr/local/bin/whisper-ctranslate2-real
PROBE_TOOL=ffprobe
GPU_VRAM_MB=16384          # RTX A4000
WORKLOAD_VRAM_MB=3700      # Whisper medium on float16
WORKLOAD_RAM_MB=11000      # Whisper medium on CPU int8 (peak for 9.5hr video)
THRESHOLD_VALUE=3600       # 1 hour in seconds
THRESHOLD_UNIT=seconds
MAX_GPU_JOBS=2             # Runner concurrency=2, but both can be GPU
MAX_CPU_JOBS=1             # Only 1 CPU job at a time (11GB peak, 20G MemoryMax)
GPU_ARGS="--device cuda --compute_type float16"
CPU_ARGS="--device cpu --compute_type int8"

Result: 4100+ videos captioned. ~20 videos over 1 hour routed to CPU.
GPU jobs: ~3.7GB VRAM, 88-99% GPU utilization
CPU jobs: ~8-11GB RAM, serialized via flock
MemoryMax=20G on the runner service as safety net.

Last updated: 2026-02-17