⚠️ This page exists because something broke.
These are real crash logs from actual testing sessions. Your results may vary depending on hardware, drivers, and OpenClaw version.

Hardware Reality Check

CUDA OOM Errors & The Cold, Hard Math

πŸ’‘ Stop Debugging. Start Calculating.

You've read the crash logs in the Survival Guide. You know the pain.

This page is no longer about fixing errors. It's about fixing your decision-making process.

I have tested every option below with my own wallet. Here is the cold, hard math on how to stop losing time.

Snapshot from February 2026. Information may go stale as software updates. Always verify with current documentation.

Choose Your Path

Don't guess. Identify your profile and see the math.

Option A: The Local Purist (RTX 3090/4090)

Who is this for:

You have $1500+ sunk cost and enjoy heat.

The Reality:

24GB VRAM is the absolute floor for decent R1 performance.

Verdict:

Great for privacy, terrible for ROI unless you run it 24/7.

Still want to try local? Read the crash logs below ↓

Option B: The Pragmatist (Cloud VPS / H100)

Who is this for:

You value your time at more than minimum hourly rates.

The Reality:

Spin up an H100, run your heavy batch job, kill it for the price of a coffee.

The Math:

A used 3090 costs ~$800. That's 1,600 hours of rental time. Will you actually use it that much?

Deploy on Vultr (Check Pricing) β†’

Real Crash Logs (From My Testing)

Still unconvinced? Here's exactly what broke on my hardware.

πŸ’” Crash 1: The "Almost Made It" Heartbreak (RTX 3090 / 24GB)

Context: Running DeepSeek-R1-Distill-Llama-32B (Q4_K_M quantization). Model loaded successfully and started reasoning. But once context hit ~6k tokens, the KV Cache spiked and killed it.

terminal
user@dev-machine:~/openclaw$ openclaw start --model deepseek-r1-distill-llama-32b
[2026-02-01 14:23:07] INFO: Initializing Gateway...
[2026-02-01 14:23:08] INFO: Loading Model [32B Q4_K_M] via Ollama...
[2026-02-01 14:23:15] INFO: Model loaded (23.1 GB / 24.0 GB)
[2026-02-01 14:23:15] INFO: Starting agent loop...
[2026-02-01 14:24:42] WARN: KV Cache growing (context: 5,847 tokens)
[2026-02-01 14:24:43] WARN: KV Cache full, attempting eviction...
[2026-02-01 14:24:43] ERROR: CUDA out of memory. Tried to allocate 128.00 MiB
  (GPU 0; 23.99 GiB total capacity; 23.10 GiB already allocated; 0 bytes free)
  PyTorch attempted to reserve residual memory but failed due to fragmentation.
[System Halted] Agent crashed during reasoning chain.

πŸ’‘ Technical Note: Ollama reports unified VRAM + spillover usage, while PyTorch reports physical device capacity. The "23.99 GiB total capacity" is real hardware. The 23.1 GB allocation includes KV cache which grows with context length.

Things I tried that did NOT help:

  • β€’ Reducing batch size (OLLAMA_NUM_BATCH=1)
  • β€’ Setting OLLAMA_NUM_PARALLEL=1 (single request queue)
  • β€’ Offloading 2 layers to CPU (made it painfully slow, still OOM'd)
  • β€’ Aggressive quantization (Q3_K_M broke reasoning quality entirely)

Verdict:

24GB is the "Uncanny Valley". It loads, but crashes once context hits ~6k tokens. You can't run agent workflows without either (a) crippling the model's reasoning with tiny contexts, or (b) upgrading to 40GB+ cards.

Tested on: Ubuntu 22.04 / RTX 3090 24GB / 64GB RAM / Feb 1, 2026

Fix (The Cost of Sanity)

I tried aggressive quantization. I tried CPU offloading. I closed every Chrome tab.
At this point, renting a GPU isn't giving upβ€”it's basic math.

$0.80/hr (Cloud) < 4 hours of debugging hardware (Your Rate)
Stop Debugging & Start Shipping (Path A) β†’

Crash 2: MacBook Air M2 (16GB) + R1 8B - Usable but Slow

terminal
$ ollama run deepseek-r1:8b
[pulling model] downloading...
[running] model loaded, generating...
user@macbook ~$ # Response takes 40+ seconds to complete
# 3.2 tokens/sec - technically works, practically unusable

Things I tried that did NOT help:

  • β€’ Closing all other apps (freed ~2GB, negligible speed improvement)
  • β€’ Reducing context to 2048 tokens (slightly faster, but model loses coherence)
  • β€’ Using Metal acceleration flags (already enabled by default in Ollama)

Technical Reality:

Measured with default Ollama settings (ctx β‰ˆ 4k). Lowering context improves speed slightly, but makes the model too dumb for complex agent tasks.

  • β€’ Tested on: macOS 14.5 / MacBook Air M2 / 16GB RAM / Jan 28, 2026
  • β€’ What broke: 3.2 tokens/sec is too slow for interactive agent workflows
  • β€’ What NOT tested: Larger Mac models (M3 Max, Mac Studio with more RAM)

Fix (The Cost of Sanity)

Apple Silicon is great for inference, but agent workflows need low latency. Every 40-second response adds up to hours of waiting.

$0.80/hr (Cloud) < Time value of waiting for slow inference
Stop Debugging & Start Shipping (Path A) β†’

Crash 3: System Hangs (Kernel Swap Death Spiral)

terminal
$ top
PID    COMMAND        %CPU  %MEM
1234   ollama         120   45.2
5678   python         95    38.7
# System becomes unresponsive
# Everything slows to crawl
# Force reboot required

Things I tried that did NOT help:

  • β€’ Increasing swap size (just delayed the inevitable)
  • β€’ Setting OLLAMA_NUM_GPU=0 to force CPU (still ate 32GB+ RAM)
  • β€’ Killing all background processes (bought ~5 minutes before OOM)

Verdict:

Don't run large models on RAM-only systems. The kernel swap death spiral is realβ€”once it starts swapping, everything freezes and you need a hard reboot.

  • β€’ Tested on: Ubuntu 22.04 / 32GB RAM (no GPU) / Feb 2, 2026
  • β€’ What broke: System ran out of RAM and started swapping, became unresponsive
  • β€’ What NOT tested: Systems with 64GB+ RAM (might work, but why suffer?)

Fix (The Cost of Sanity)

RAM-only inference is a false economy. You'll spend more time waiting for crashes than you will shipping code.

$0.80/hr (Cloud) < Dealing with frozen systems & lost work
Stop Debugging & Start Shipping (Path A) β†’

Final Calculation

If you spend 4 hours fixing a CUDA driver to save $2 on hosting, you are valuing your time at minimum hourly rates.

If you are an engineer, your time is worth more.
Stop fighting physics. Rent the compute.