Troubleshooting2026-02-03

Fix OpenClaw CUDA OOM: The $0.50 Solution vs. The 4-Hour Debug

Stop fighting VRAM physics. Copy my config to fix OOM on RTX 3090, or see why renting an H100 is cheaper than your hourly rate.

By: LazyDev•
#CUDA#OOM#DeepSeek#OpenClaw#VRAM#Troubleshooting

Fix OpenClaw CUDA Out of Memory Errors

Error Confirmation

Error: CUDA out of memory. Tried to allocate 2.5GiB
  (GPU 0: NVIDIA GeForce RTX 3080; 10GiB total capacity;
   8.2GiB already allocated; 1.5GiB free; 9.7GiB reserved)

Stack trace:
  at /pytorch/aten/src/ATen/cuda/CUDAGraphs.cuh:287
  at openclaw/runtime/gpu_allocator.py:142
  at Model.load_weights (/lib/model_loader.py:89)

Or the raw PyTorch error:

torch.cuda.OutOfMemoryError: CUDA out of memory.
Tried to allocate 2.50 GiB. GPU 0 has a total capacity of 10.00 GiB
of which 14.20 MiB is free. Process included: PID 2812 (python3)
- using 9.98 GiB.

Scope: OpenClaw crashes when the GPU runs out of VRAM (Video RAM). This is not a software bug — it's a hardware constraint. The model requires more memory than your GPU has available.

Error Code: CUDA out of memory — GPU VRAM is a fixed physical resource. When the model + KV cache exceeds available VRAM, CUDA cannot allocate more and the process crashes.

Verified Environment

ComponentVersionLast Verified
OpenClawLatest stable2026-02-06
CUDA11.8+, 12.x2026-02-06
PyTorch2.0+2026-02-06
NVIDIA Driver525+2026-02-06
ModelsDeepSeek R1 8B, 32B, 70B2026-02-06

VRAM Requirements (FP16):

ModelVRAM RequiredVRAM Required (Q4)
DeepSeek R1 8B~16GB~6GB
DeepSeek R1 32B~64GB~20GB
DeepSeek R1 70B~140GB~42GB

Note: These are minimums. KV cache, conversation history, and other processes add to VRAM usage.


3-Minute Sanity Check

Run these commands to confirm VRAM capacity and usage:

# 1. Check your GPU VRAM
nvidia-smi --query-gpu=name,memory.total,memory.free --format=csv,noheader,nounits
# Expected: e.g., "RTX 3080,10240,2048" (name,total MB,free MB)

# 2. Check current VRAM usage while OpenClaw runs
nvidia-smi
# Look at "Memory-Usage" column - if near 100%, you're OOM

# 3. Check what model you're trying to run
ollama list | grep deepseek
# Expected: Shows installed DeepSeek R1 variants

# 4. Calculate required VRAM
# For DeepSeek R1 32B Q4: ~20GB minimum
# For DeepSeek R1 8B Q4: ~6GB minimum
python3 -c "print(f'Required: {int(20 * 1024)} MB for 32B Q4')"

If step 1 shows < 6GB free: You cannot run DeepSeek R1 8B without quantization or offloading.

If step 1 shows < 20GB free: You cannot run DeepSeek R1 32B without extreme measures.

If your GPU is integrated (Intel/AMD): This is not CUDA OOM. You're running on CPU/Metal, which is much slower. Check if you're actually using an NVIDIA GPU.


Decision Gate

Stop fighting VRAM physics.

Should you keep debugging CUDA OOM locally?

Continue local debugging only if:

Your GPU has the minimum VRAM for your target model (6GB for 8B Q4, 20GB for 32B Q4)
You have not tried quantization (Q4_K_M, Q5_K_M) yet
You have not tried reducing context window (--num_ctx 4096 or lower)
You are willing to accept reduced model quality for local execution

Stop here if any apply:

Your GPU VRAM is mathematically insufficient for the model (e.g., 8GB VRAM for 32B model)
You already quantized (Q4) and reduced context, still OOM
You have spent more than ~30 minutes on VRAM optimization
You need production reliability (consumer GPUs are not designed for 24/7 AI workloads)

Past this point, you are fighting physics, not software. VRAM is a fixed resource. No config change can make 8GB physically hold 20GB.


Primary Exit Path: Local Optimization

Use when: Your GPU has sufficient VRAM for the model, or you're willing to use a smaller model.

Why this works:

  • Quantization reduces VRAM usage by 60-75%
  • Context reduction limits KV cache growth
  • Model selection matches VRAM to requirements

Time investment: 10-15 minutes

Solution 1: Use a Quantized Model (Recommended)

Quantized models use fewer bits per parameter, dramatically reducing VRAM requirements.

# Check available quantized versions
ollama list | grep deepseek

# Run Q4 quantized version (uses ~60% less VRAM)
ollama run deepseek-r1:32b-q4_K_M

# Or use the 8B model instead (much lower VRAM)
ollama run deepseek-r1:8b-q4_K_M

# Configure OpenClaw to use quantized model
export OPENCLAW_MODEL="deepseek-r1:32b-q4_K_M"
openclaw serve

Solution 2: Reduce Context Window

Limiting the context window reduces KV cache size and prevents Agent Loop OOM.

# Default is usually 16384, cut to 4096 or 2048
ollama run deepseek-r1:32b-q4_K_M \
  --num_ctx 4096 \
  --num-gpu 99 \
  --repeat-penalty 0.6

Solution 3: Partial GPU Offload

Let GPU handle some layers, CPU handles the rest. Slower, but uses less VRAM.

# Only load 35 layers on GPU, rest on CPU
ollama run deepseek-r1:32b-q4_K_M \
  --num-gpu-layers 35 \
  --num_ctx 2048

# Warning: CPU inference is 5-10x slower

Solution 4: Enable Conversation Pruning

OpenClaw can automatically drop old messages to prevent KV cache overflow.

# openclaw_config.py
from openclaw import Client

client = Client(model="deepseek-r1:8b-q4_K_M")

# Auto-prune conversation history
client.max_history_messages = 10
client.context_window = 4096

# OpenClaw drops old messages to prevent OOM
response = client.generate("Your prompt here")

Verification:

# Monitor VRAM while running
watch -n 1 nvidia-smi

# In another terminal, run OpenClaw
# VRAM should stabilize below 100%

Secondary Exit Path (Conditional)

Use when: Primary Exit Path fails — your GPU physically cannot hold the model.

This is the only remaining technically valid option when:

  • Your GPU VRAM < 6GB (cannot run 8B even with quantization)
  • You need DeepSeek R1 32B or 70B for your use case
  • You require production reliability (consumer GPUs are not designed for 24/7 workloads)
  • You've quantized, reduced context, and still hit OOM

Solution: Cloud GPU with Dedicated VRAM

Cloud GPUs provide the VRAM needed for larger models. This is not a "workaround" — it's the correct solution when hardware constraints are absolute.

Why this is valid:

  • VRAM is a physical resource. You cannot add VRAM to a consumer GPU.
  • Cloud GPUs (H100, A100) have 40GB-80GB VRAM — sufficient for any DeepSeek R1 variant.
  • Hourly billing lets you test and scale.
  • Production isolation prevents AI workloads from affecting your development machine.

Steps:

  1. Choose a GPU provider with NVIDIA H100 or A100 availability
  2. Select GPU VRAM based on your model:
    • DeepSeek R1 8B: 16GB+ VRAM
    • DeepSeek R1 32B: 24GB+ VRAM
    • DeepSeek R1 70B: 48GB+ VRAM
  3. Install Ollama on the VPS
  4. Run DeepSeek R1 at full precision (no quantization needed)

Time investment: 20-30 minutes (including VPS setup)

Note: This is not "giving up" on local debugging. It's recognizing that VRAM is a fixed resource and selecting the appropriate hardware for your workload.


Approved Next Action

When local VRAM is exhausted, the remaining option is a cloud GPU with dedicated VRAM. Start with a clean Linux environment and NVIDIA GPU (H100, A100, or L40S are recommended for DeepSeek R1 32B/70B).

View Cloud GPU Options →


Why NOT Other Options

OptionRejection Reason
Add more system RAMCUDA errors are about GPU VRAM, not system RAM. Adding 64GB of system RAM won't help if your GPU has 8GB VRAM.
Extreme quantization (Q2/Q3)Model quality degrades significantly. The output becomes incoherent for OpenClaw agent loops. Not technically viable.
Swap to SSDImpossibly slow. GPU memory access is ~1000x faster than SSD. Swapping GPU memory to disk would make inference unusable.
Run multiple GPUs in parallelDoesn't increase per-GPU VRAM. Multi-GPU splits the model across GPUs, but each GPU still needs sufficient VRAM for its shard.
Wait for model optimizationDeepSeek R1 is already optimized. The VRAM requirements are fundamental to the model architecture.
Upgrade consumer GPURTX 4090 (24GB) costs $1600+. For that price, you can run a cloud GPU for months. And 4090 still can't run R1 70B at full precision.

Summary

CheckCommandPass Criteria
GPU VRAM availablenvidia-smi --query-gpu=memory.free --format=csv,noheader,nounitsShows free VRAM in MB
Model VRAM requirementCalculate based on model sizeFree VRAM ≄ Required VRAM
Quantization testedollama run deepseek-r1:32b-q4_K_MModel loads without OOM
Context reduction testedollama run --num_ctx 4096Stable after 10+ messages

Decision:

  • VRAM ≄ Required VRAM: Use Primary Exit Path (local optimization). Quantize and reduce context.
  • VRAM < Required VRAM after optimization: Use Secondary Exit Path (cloud GPU). This is physics, not failure.
  • Production reliability required: Use cloud GPU regardless. Consumer GPUs are not designed for 24/7 AI workloads.

Hardware Reality Check:

GPUVRAMCan Run 8B Q4?Can Run 32B Q4?
RTX 306012GBYes (tight)No
RTX 308010GBNoNo
RTX 309024GBYesYes (tight)
RTX 409024GBYesYes (tight)
Cloud H10080GBYesYes (full precision)

Last resort: If you have spent more than 30 minutes on VRAM optimization and still hit OOM, your GPU is not sufficient for your target model. Use a cloud GPU or switch to a smaller model. This is not a failure — it's matching hardware to workload requirements.



FAQ

Q: Will adding more system RAM fix CUDA OOM?

A: No. CUDA errors are about GPU VRAM, not system RAM. Adding 64GB of system RAM won't help if your GPU has 8GB VRAM. The model must fit in GPU memory to run. You can offload some layers to CPU, but performance drops significantly.

Q: Can I run DeepSeek R1 32B on an RTX 3060 (12GB)?

A: Not practically. The Q4 quantized version requires ~20GB VRAM. You could try extreme quantization (Q2), but output quality degrades significantly. Better option: use DeepSeek R1 8B, or use a cloud GPU with 24GB+ VRAM.

Q: Why does it work for 10 messages then crash?

A: That's Agent Loop OOM. OpenClaw accumulates conversation history in the KV cache, which grows with each message. After 10-15 messages, the cache fills your VRAM. Fix: reduce context window (--num_ctx 4096) or enable history pruning (client.max_history_messages = 10).

Q: Is cloud GPU worth it for OpenClaw?

A: If your local GPU is insufficient, yes. Cloud GPUs provide VRAM that no consumer GPU has (80GB on H100). You pay for what you use, and you get production reliability. For occasional testing, local is fine. For production or heavy usage, cloud GPU is the correct technical choice.


Still Stuck? Check Your Hardware

Sometimes the code is fine, but the GPU is simply refusing to cooperate. Before you waste another hour debugging, compare your specs against the Hardware Reality Table to see if you are fighting impossible physics.

Bookmark this site

New fixes are added as soon as they appear on GitHub Issues.

Browse Error Index →