Migration: consolidate Echo6 docs to cortex with full infrastructure cleanup sync

- Documents recent infrastructure cleanup (8 CTs destroyed, 35 DNS records removed, Headscale cleanup) - Adds 24 new runbooks covering Authentik, PeerTube, Meshtastic, RECON, Proxmox, Mailcow, Internet Archive, GPU routing - Adds project documentation for headscale, vaultwarden, peertube, matrix, mmud, advbbs, arr stack - Updates services.md, environment.md, caddy.md, authentik.md to match live infrastructure - Removes 4 deprecated runbook duplicates (canonical versions live in projects/) - Adds .gitignore for binary archives and editor temp files Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-13 06:02:16 +00:00 · 2026-04-13 06:02:16 +00:00 · e9231ac24a
commit e9231ac24a
parent 89834796ff
93 changed files with 51223 additions and 254 deletions
--- a/runbooks/gpu-cpu-fallback-routing.md
+++ b/runbooks/gpu-cpu-fallback-routing.md
@ -0,0 +1,333 @@
+# GPU/CPU Fallback Routing
+
+Route workloads to GPU or CPU based on pre-flight inspection of job properties (duration, file size, resolution, complexity). Small jobs go to GPU for speed; large jobs fall back to CPU to avoid VRAM exhaustion. Concurrent job control via flock prevents OOM kills — excess jobs fail fast and re-queue instead of competing for memory.
+
+Use this when you have a GPU workload where some jobs exceed VRAM capacity, and the system needs to handle both small and large jobs without manual intervention or OOM kills.
+
+---
+
+## Prerequisites
+
+- NVIDIA GPU with working drivers (`nvidia-smi` returns output)
+- Both GPU and CPU execution paths available for the workload
+- A probe tool to inspect job properties before execution (e.g., `ffprobe`, `mediainfo`, `file`, `wc`)
+- A caller that retries on non-zero exit codes (scheduler, job queue, runner)
+
+---
+
+## Inputs
+
+Prompt the user for all of these before executing:
+
+```
+TARGET_HOST=            # Machine with GPU (e.g., cortex)
+WORKLOAD_BINARY=        # The tool that processes jobs (e.g., "whisper-ctranslate2-real")
+PROBE_TOOL=             # Tool to inspect job properties (e.g., "ffprobe", "mediainfo")
+GPU_VRAM_MB=            # Total VRAM available (e.g., 16384 for 16GB)
+WORKLOAD_VRAM_MB=       # VRAM used per GPU job (e.g., 3700)
+WORKLOAD_RAM_MB=        # RAM used per CPU job (e.g., 11000)
+THRESHOLD_VALUE=        # Cutoff for GPU vs CPU routing (e.g., 3600 for seconds)
+THRESHOLD_UNIT=         # What the threshold measures (e.g., "seconds", "bytes", "pixels")
+MAX_GPU_JOBS=           # Max concurrent GPU jobs (e.g., 2)
+MAX_CPU_JOBS=           # Max concurrent CPU jobs (e.g., 1)
+GPU_ARGS=               # Arguments for GPU execution (e.g., "--device cuda --compute_type float16")
+CPU_ARGS=               # Arguments for CPU execution (e.g., "--device cpu --compute_type int8")
+```
+
+---
+
+## Step 1: Determine the Routing Threshold
+
+Profile representative workloads to find the VRAM crossover point.
+
+```bash
+# Run a small job on GPU, monitor VRAM
+ssh $TARGET_HOST "nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits"
+# Run the workload...
+# Check peak VRAM during execution
+
+# Run a large job on GPU, watch for OOM
+# If it OOM-kills or exceeds VRAM, that's your upper bound
+```
+
+The threshold should be set conservatively below the point where GPU jobs start failing. Common strategies:
+
+| Workload Type | Probe Property | Typical Threshold |
+|---------------|----------------|-------------------|
+| Audio transcription | Duration (seconds) | 1-2 hours |
+| Image generation | Resolution (megapixels) | Based on model VRAM curve |
+| Video encoding | Duration × resolution | Derived from VRAM budget |
+| LLM inference | Token count / context length | Model-specific |
+
+### Gate
+
+You must have a clear, measurable property that predicts VRAM usage. If the relationship between job properties and VRAM is unpredictable, this pattern won't work — use a different strategy (e.g., try GPU first, fall back on OOM).
+
+---
+
+## Step 2: Write the Probe Function
+
+The probe function inspects the job input and returns the routing metric.
+
+```bash
+# Generic probe template
+probe_workload() {
+    local INPUT="$1"
+    local METRIC=0
+
+    if [[ -n "$INPUT" && -f "$INPUT" ]]; then
+        # Example: audio/video duration via ffprobe
+        METRIC=$($PROBE_TOOL -v quiet -show_entries format=duration \
+            -of csv=p=0 "$INPUT" 2>/dev/null | cut -d. -f1)
+        METRIC=${METRIC:-0}
+
+        # Example: file size in bytes
+        # METRIC=$(stat -c%s "$INPUT" 2>/dev/null)
+
+        # Example: image resolution (width × height)
+        # METRIC=$($PROBE_TOOL -v quiet -show_entries stream=width,height \
+        #     -of csv=p=0 "$INPUT" 2>/dev/null | awk -F, '{print $1*$2}')
+    fi
+
+    echo "$METRIC"
+}
+```
+
+### Gate
+
+Test the probe against known inputs:
+
+```bash
+# Small workload (should route to GPU)
+probe_workload /path/to/small/input    # Should be < THRESHOLD_VALUE
+
+# Large workload (should route to CPU)
+probe_workload /path/to/large/input    # Should be >= THRESHOLD_VALUE
+```
+
+---
+
+## Step 3: Implement the Router
+
+The router uses the probe result to select GPU or CPU execution path.
+
+```bash
+#!/bin/bash
+# GPU/CPU Fallback Router
+# Routes jobs based on $THRESHOLD_UNIT inspection
+
+THRESHOLD=$THRESHOLD_VALUE
+LOGFILE="/tmp/workload-router.log"
+GPU_LOCK="/tmp/gpu-workload.lock"
+CPU_LOCK="/tmp/cpu-workload.lock"
+
+# ──── Probe ────
+INPUT="<extract from $@>"
+METRIC=$(probe_workload "$INPUT")
+
+# ──── Route ────
+if (( METRIC < THRESHOLD )); then
+    MODE="GPU"
+    DEVICE_ARGS="$GPU_ARGS"
+    LOCK_FILE="$GPU_LOCK"
+    MAX_CONCURRENT=$MAX_GPU_JOBS
+else
+    MODE="CPU"
+    DEVICE_ARGS="$CPU_ARGS"
+    LOCK_FILE="$CPU_LOCK"
+    MAX_CONCURRENT=$MAX_CPU_JOBS
+fi
+
+# ──── Concurrency control ────
+if (( MAX_CONCURRENT == 1 )); then
+    # Single-job lock: flock with fail-fast
+    exec 9>"$LOCK_FILE"
+    if ! flock --nonblock 9; then
+        echo "[ROUTER] $(date) mode=${MODE}-BLOCKED metric=${METRIC} (slot full, exiting)" >> "$LOGFILE"
+        exit 1  # Caller should retry later
+    fi
+fi
+# For MAX_CONCURRENT > 1, use numbered lock files:
+# for i in $(seq 0 $((MAX_CONCURRENT - 1))); do
+#     SLOT_LOCK="${LOCK_FILE}.${i}"
+#     exec 9>"$SLOT_LOCK"
+#     if flock --nonblock 9; then
+#         break  # Got a slot
+#     fi
+#     if (( i == MAX_CONCURRENT - 1 )); then
+#         echo "[ROUTER] $(date) mode=${MODE}-BLOCKED metric=${METRIC} (all slots full)" >> "$LOGFILE"
+#         exit 1
+#     fi
+# done
+
+# ──── Log and execute ────
+echo "[ROUTER] $(date) mode=$MODE metric=${METRIC} args: $@" >> "$LOGFILE"
+
+exec $WORKLOAD_BINARY "$@" $DEVICE_ARGS
+```
+
+### Key design decisions
+
+- **`flock --nonblock`**: Non-blocking lock attempt. If the slot is taken, exit immediately instead of waiting. This prevents queue starvation where all runner slots are blocked waiting for CPU jobs.
+- **Exit code 1**: The caller (runner, scheduler) should interpret this as "retry later." Most job queues do this by default.
+- **`exec`**: Replace the router process with the workload binary. Signals, exit codes, and resource limits pass through cleanly.
+- **Lock files in `/tmp`**: Automatically cleaned on reboot. No stale locks after crashes.
+
+---
+
+## Step 4: Integrate with the Caller
+
+Deploy the router using the binary wrapper interception pattern (see `binary-wrapper-interception.md`):
+
+1. Rename the real binary: `mv $BINARY → ${BINARY}-real`
+2. Write the router script
+3. Symlink: `ln -sf /path/to/router $BINARY`
+
+Or, if the caller supports configurable command paths, point it directly at the router.
+
+---
+
+## Step 5: Verify Both Paths
+
+### GPU path
+
+```bash
+# Submit a small job
+ssh $TARGET_HOST "$BINARY <small-input-args>"
+
+# Verify GPU usage
+ssh $TARGET_HOST "nvidia-smi --query-gpu=utilization.gpu,memory.used --format=csv,noheader"
+
+# Check log
+ssh $TARGET_HOST "tail -1 /tmp/workload-router.log"
+# Should show: mode=GPU
+```
+
+### CPU path
+
+```bash
+# Submit a large job
+ssh $TARGET_HOST "$BINARY <large-input-args>"
+
+# Verify CPU usage (no GPU spike)
+ssh $TARGET_HOST "nvidia-smi --query-gpu=utilization.gpu,memory.used --format=csv,noheader"
+# GPU should be idle
+
+# Check RAM
+ssh $TARGET_HOST "free -h"
+
+# Check log
+ssh $TARGET_HOST "tail -1 /tmp/workload-router.log"
+# Should show: mode=CPU
+```
+
+### Concurrency control
+
+```bash
+# Start a CPU job, then immediately try a second one
+ssh $TARGET_HOST "$BINARY <large-input-1> &"
+sleep 2
+ssh $TARGET_HOST "$BINARY <large-input-2>"
+# Second job should exit immediately with code 1
+
+# Check log
+ssh $TARGET_HOST "grep BLOCKED /tmp/workload-router.log"
+# Should show: mode=CPU-BLOCKED
+```
+
+---
+
+## Step 6: Tune and Monitor
+
+After initial deployment, monitor for a day and adjust:
+
+```bash
+# Distribution of GPU vs CPU jobs
+ssh $TARGET_HOST "grep -c 'mode=GPU' /tmp/workload-router.log"
+ssh $TARGET_HOST "grep -c 'mode=CPU' /tmp/workload-router.log"
+ssh $TARGET_HOST "grep -c 'BLOCKED' /tmp/workload-router.log"
+```
+
+If BLOCKED count is high relative to CPU count, the threshold may be too aggressive (routing too many jobs to CPU). Consider raising the threshold or increasing MAX_CPU_JOBS if RAM allows.
+
+---
+
+## Memory Budget Worksheet
+
+```
+GPU path:
+  VRAM per job:     $WORKLOAD_VRAM_MB MB
+  Max GPU jobs:     $MAX_GPU_JOBS
+  Total GPU VRAM:   $GPU_VRAM_MB MB
+  Headroom:         GPU_VRAM_MB - (WORKLOAD_VRAM_MB × MAX_GPU_JOBS) MB
+  → Headroom must be positive
+
+CPU path:
+  RAM per job:      $WORKLOAD_RAM_MB MB
+  Max CPU jobs:     $MAX_CPU_JOBS
+  System RAM:       $(free -m | awk '/Mem:/{print $2}') MB
+  Other processes:  ~2-4 GB (OS, services, buffers)
+  Headroom:         SystemRAM - OtherProcs - (WORKLOAD_RAM_MB × MAX_CPU_JOBS) MB
+  → Headroom must be positive
+
+systemd MemoryMax:  Should be set to MAX(GPU peak, CPU peak) + 20% buffer
+```
+
+---
+
+## Troubleshooting
+
+### GPU job OOM-kills despite being under threshold
+
+The threshold is too high, or VRAM usage varies by input characteristics beyond what the probe measures. Lower the threshold or add a secondary probe (e.g., check resolution in addition to duration).
+
+### CPU jobs pile up and exhaust RAM
+
+`MAX_CPU_JOBS` is too high, or the `flock` mechanism isn't working. Check that lock files are being created in `/tmp/` and that the `exec 9>` file descriptor redirect is correct.
+
+### All jobs route to CPU
+
+The probe is returning 0 or failing silently. Test the probe manually:
+
+```bash
+$PROBE_TOOL -v quiet -show_entries format=duration -of csv=p=0 /path/to/input
+```
+
+If it returns empty, the input file may not be accessible to the probe tool (permissions, path issues).
+
+### Blocked jobs never get retried
+
+The caller doesn't retry on exit code 1. Check the caller's retry behavior. Some systems need specific exit codes (e.g., 75 for "temporary failure" in some mail systems). Adjust the exit code in the router to match what the caller expects.
+
+### Lock files persist after crash
+
+`/tmp` is cleaned on reboot, so stale locks self-heal. For immediate cleanup: `rm /tmp/cpu-workload.lock`. The next job will recreate it.
+
+---
+
+## Usage Examples
+
+### Whisper auto-captioning on PeerTube runner (cortex)
+
+```
+WORKLOAD_BINARY=/usr/local/bin/whisper-ctranslate2-real
+PROBE_TOOL=ffprobe
+GPU_VRAM_MB=16384          # RTX A4000
+WORKLOAD_VRAM_MB=3700      # Whisper medium on float16
+WORKLOAD_RAM_MB=11000      # Whisper medium on CPU int8 (peak for 9.5hr video)
+THRESHOLD_VALUE=3600       # 1 hour in seconds
+THRESHOLD_UNIT=seconds
+MAX_GPU_JOBS=2             # Runner concurrency=2, but both can be GPU
+MAX_CPU_JOBS=1             # Only 1 CPU job at a time (11GB peak, 20G MemoryMax)
+GPU_ARGS="--device cuda --compute_type float16"
+CPU_ARGS="--device cpu --compute_type int8"
+
+Result: 4100+ videos captioned. ~20 videos over 1 hour routed to CPU.
+GPU jobs: ~3.7GB VRAM, 88-99% GPU utilization
+CPU jobs: ~8-11GB RAM, serialized via flock
+MemoryMax=20G on the runner service as safety net.
+```
+
+---
+
+*Last updated: 2026-02-17*