# GPU/CPU Fallback Routing Route workloads to GPU or CPU based on pre-flight inspection of job properties (duration, file size, resolution, complexity). Small jobs go to GPU for speed; large jobs fall back to CPU to avoid VRAM exhaustion. Concurrent job control via flock prevents OOM kills — excess jobs fail fast and re-queue instead of competing for memory. Use this when you have a GPU workload where some jobs exceed VRAM capacity, and the system needs to handle both small and large jobs without manual intervention or OOM kills. --- ## Prerequisites - NVIDIA GPU with working drivers (`nvidia-smi` returns output) - Both GPU and CPU execution paths available for the workload - A probe tool to inspect job properties before execution (e.g., `ffprobe`, `mediainfo`, `file`, `wc`) - A caller that retries on non-zero exit codes (scheduler, job queue, runner) --- ## Inputs Prompt the user for all of these before executing: ``` TARGET_HOST= # Machine with GPU (e.g., cortex) WORKLOAD_BINARY= # The tool that processes jobs (e.g., "whisper-ctranslate2-real") PROBE_TOOL= # Tool to inspect job properties (e.g., "ffprobe", "mediainfo") GPU_VRAM_MB= # Total VRAM available (e.g., 16384 for 16GB) WORKLOAD_VRAM_MB= # VRAM used per GPU job (e.g., 3700) WORKLOAD_RAM_MB= # RAM used per CPU job (e.g., 11000) THRESHOLD_VALUE= # Cutoff for GPU vs CPU routing (e.g., 3600 for seconds) THRESHOLD_UNIT= # What the threshold measures (e.g., "seconds", "bytes", "pixels") MAX_GPU_JOBS= # Max concurrent GPU jobs (e.g., 2) MAX_CPU_JOBS= # Max concurrent CPU jobs (e.g., 1) GPU_ARGS= # Arguments for GPU execution (e.g., "--device cuda --compute_type float16") CPU_ARGS= # Arguments for CPU execution (e.g., "--device cpu --compute_type int8") ``` --- ## Step 1: Determine the Routing Threshold Profile representative workloads to find the VRAM crossover point. ```bash # Run a small job on GPU, monitor VRAM ssh $TARGET_HOST "nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits" # Run the workload... # Check peak VRAM during execution # Run a large job on GPU, watch for OOM # If it OOM-kills or exceeds VRAM, that's your upper bound ``` The threshold should be set conservatively below the point where GPU jobs start failing. Common strategies: | Workload Type | Probe Property | Typical Threshold | |---------------|----------------|-------------------| | Audio transcription | Duration (seconds) | 1-2 hours | | Image generation | Resolution (megapixels) | Based on model VRAM curve | | Video encoding | Duration × resolution | Derived from VRAM budget | | LLM inference | Token count / context length | Model-specific | ### Gate You must have a clear, measurable property that predicts VRAM usage. If the relationship between job properties and VRAM is unpredictable, this pattern won't work — use a different strategy (e.g., try GPU first, fall back on OOM). --- ## Step 2: Write the Probe Function The probe function inspects the job input and returns the routing metric. ```bash # Generic probe template probe_workload() { local INPUT="$1" local METRIC=0 if [[ -n "$INPUT" && -f "$INPUT" ]]; then # Example: audio/video duration via ffprobe METRIC=$($PROBE_TOOL -v quiet -show_entries format=duration \ -of csv=p=0 "$INPUT" 2>/dev/null | cut -d. -f1) METRIC=${METRIC:-0} # Example: file size in bytes # METRIC=$(stat -c%s "$INPUT" 2>/dev/null) # Example: image resolution (width × height) # METRIC=$($PROBE_TOOL -v quiet -show_entries stream=width,height \ # -of csv=p=0 "$INPUT" 2>/dev/null | awk -F, '{print $1*$2}') fi echo "$METRIC" } ``` ### Gate Test the probe against known inputs: ```bash # Small workload (should route to GPU) probe_workload /path/to/small/input # Should be < THRESHOLD_VALUE # Large workload (should route to CPU) probe_workload /path/to/large/input # Should be >= THRESHOLD_VALUE ``` --- ## Step 3: Implement the Router The router uses the probe result to select GPU or CPU execution path. ```bash #!/bin/bash # GPU/CPU Fallback Router # Routes jobs based on $THRESHOLD_UNIT inspection THRESHOLD=$THRESHOLD_VALUE LOGFILE="/tmp/workload-router.log" GPU_LOCK="/tmp/gpu-workload.lock" CPU_LOCK="/tmp/cpu-workload.lock" # ──── Probe ──── INPUT="" METRIC=$(probe_workload "$INPUT") # ──── Route ──── if (( METRIC < THRESHOLD )); then MODE="GPU" DEVICE_ARGS="$GPU_ARGS" LOCK_FILE="$GPU_LOCK" MAX_CONCURRENT=$MAX_GPU_JOBS else MODE="CPU" DEVICE_ARGS="$CPU_ARGS" LOCK_FILE="$CPU_LOCK" MAX_CONCURRENT=$MAX_CPU_JOBS fi # ──── Concurrency control ──── if (( MAX_CONCURRENT == 1 )); then # Single-job lock: flock with fail-fast exec 9>"$LOCK_FILE" if ! flock --nonblock 9; then echo "[ROUTER] $(date) mode=${MODE}-BLOCKED metric=${METRIC} (slot full, exiting)" >> "$LOGFILE" exit 1 # Caller should retry later fi fi # For MAX_CONCURRENT > 1, use numbered lock files: # for i in $(seq 0 $((MAX_CONCURRENT - 1))); do # SLOT_LOCK="${LOCK_FILE}.${i}" # exec 9>"$SLOT_LOCK" # if flock --nonblock 9; then # break # Got a slot # fi # if (( i == MAX_CONCURRENT - 1 )); then # echo "[ROUTER] $(date) mode=${MODE}-BLOCKED metric=${METRIC} (all slots full)" >> "$LOGFILE" # exit 1 # fi # done # ──── Log and execute ──── echo "[ROUTER] $(date) mode=$MODE metric=${METRIC} args: $@" >> "$LOGFILE" exec $WORKLOAD_BINARY "$@" $DEVICE_ARGS ``` ### Key design decisions - **`flock --nonblock`**: Non-blocking lock attempt. If the slot is taken, exit immediately instead of waiting. This prevents queue starvation where all runner slots are blocked waiting for CPU jobs. - **Exit code 1**: The caller (runner, scheduler) should interpret this as "retry later." Most job queues do this by default. - **`exec`**: Replace the router process with the workload binary. Signals, exit codes, and resource limits pass through cleanly. - **Lock files in `/tmp`**: Automatically cleaned on reboot. No stale locks after crashes. --- ## Step 4: Integrate with the Caller Deploy the router using the binary wrapper interception pattern (see `binary-wrapper-interception.md`): 1. Rename the real binary: `mv $BINARY → ${BINARY}-real` 2. Write the router script 3. Symlink: `ln -sf /path/to/router $BINARY` Or, if the caller supports configurable command paths, point it directly at the router. --- ## Step 5: Verify Both Paths ### GPU path ```bash # Submit a small job ssh $TARGET_HOST "$BINARY " # Verify GPU usage ssh $TARGET_HOST "nvidia-smi --query-gpu=utilization.gpu,memory.used --format=csv,noheader" # Check log ssh $TARGET_HOST "tail -1 /tmp/workload-router.log" # Should show: mode=GPU ``` ### CPU path ```bash # Submit a large job ssh $TARGET_HOST "$BINARY " # Verify CPU usage (no GPU spike) ssh $TARGET_HOST "nvidia-smi --query-gpu=utilization.gpu,memory.used --format=csv,noheader" # GPU should be idle # Check RAM ssh $TARGET_HOST "free -h" # Check log ssh $TARGET_HOST "tail -1 /tmp/workload-router.log" # Should show: mode=CPU ``` ### Concurrency control ```bash # Start a CPU job, then immediately try a second one ssh $TARGET_HOST "$BINARY &" sleep 2 ssh $TARGET_HOST "$BINARY " # Second job should exit immediately with code 1 # Check log ssh $TARGET_HOST "grep BLOCKED /tmp/workload-router.log" # Should show: mode=CPU-BLOCKED ``` --- ## Step 6: Tune and Monitor After initial deployment, monitor for a day and adjust: ```bash # Distribution of GPU vs CPU jobs ssh $TARGET_HOST "grep -c 'mode=GPU' /tmp/workload-router.log" ssh $TARGET_HOST "grep -c 'mode=CPU' /tmp/workload-router.log" ssh $TARGET_HOST "grep -c 'BLOCKED' /tmp/workload-router.log" ``` If BLOCKED count is high relative to CPU count, the threshold may be too aggressive (routing too many jobs to CPU). Consider raising the threshold or increasing MAX_CPU_JOBS if RAM allows. --- ## Memory Budget Worksheet ``` GPU path: VRAM per job: $WORKLOAD_VRAM_MB MB Max GPU jobs: $MAX_GPU_JOBS Total GPU VRAM: $GPU_VRAM_MB MB Headroom: GPU_VRAM_MB - (WORKLOAD_VRAM_MB × MAX_GPU_JOBS) MB → Headroom must be positive CPU path: RAM per job: $WORKLOAD_RAM_MB MB Max CPU jobs: $MAX_CPU_JOBS System RAM: $(free -m | awk '/Mem:/{print $2}') MB Other processes: ~2-4 GB (OS, services, buffers) Headroom: SystemRAM - OtherProcs - (WORKLOAD_RAM_MB × MAX_CPU_JOBS) MB → Headroom must be positive systemd MemoryMax: Should be set to MAX(GPU peak, CPU peak) + 20% buffer ``` --- ## Troubleshooting ### GPU job OOM-kills despite being under threshold The threshold is too high, or VRAM usage varies by input characteristics beyond what the probe measures. Lower the threshold or add a secondary probe (e.g., check resolution in addition to duration). ### CPU jobs pile up and exhaust RAM `MAX_CPU_JOBS` is too high, or the `flock` mechanism isn't working. Check that lock files are being created in `/tmp/` and that the `exec 9>` file descriptor redirect is correct. ### All jobs route to CPU The probe is returning 0 or failing silently. Test the probe manually: ```bash $PROBE_TOOL -v quiet -show_entries format=duration -of csv=p=0 /path/to/input ``` If it returns empty, the input file may not be accessible to the probe tool (permissions, path issues). ### Blocked jobs never get retried The caller doesn't retry on exit code 1. Check the caller's retry behavior. Some systems need specific exit codes (e.g., 75 for "temporary failure" in some mail systems). Adjust the exit code in the router to match what the caller expects. ### Lock files persist after crash `/tmp` is cleaned on reboot, so stale locks self-heal. For immediate cleanup: `rm /tmp/cpu-workload.lock`. The next job will recreate it. --- ## Usage Examples ### Whisper auto-captioning on PeerTube runner (cortex) ``` WORKLOAD_BINARY=/usr/local/bin/whisper-ctranslate2-real PROBE_TOOL=ffprobe GPU_VRAM_MB=16384 # RTX A4000 WORKLOAD_VRAM_MB=3700 # Whisper medium on float16 WORKLOAD_RAM_MB=11000 # Whisper medium on CPU int8 (peak for 9.5hr video) THRESHOLD_VALUE=3600 # 1 hour in seconds THRESHOLD_UNIT=seconds MAX_GPU_JOBS=2 # Runner concurrency=2, but both can be GPU MAX_CPU_JOBS=1 # Only 1 CPU job at a time (11GB peak, 20G MemoryMax) GPU_ARGS="--device cuda --compute_type float16" CPU_ARGS="--device cpu --compute_type int8" Result: 4100+ videos captioned. ~20 videos over 1 hour routed to CPU. GPU jobs: ~3.7GB VRAM, 88-99% GPU utilization CPU jobs: ~8-11GB RAM, serialized via flock MemoryMax=20G on the runner service as safety net. ``` --- *Last updated: 2026-02-17*