Fix OpenClaw CUDA OOM: The $0.50 Solution vs. The 4-Hour Debug
Stop fighting VRAM physics. Copy my config to fix OOM on RTX 3090, or see why renting an H100 is cheaper than your hourly rate.
Fix OpenClaw CUDA Out of Memory Errors
Error Confirmation
Error: CUDA out of memory. Tried to allocate 2.5GiB
(GPU 0: NVIDIA GeForce RTX 3080; 10GiB total capacity;
8.2GiB already allocated; 1.5GiB free; 9.7GiB reserved)
Stack trace:
at /pytorch/aten/src/ATen/cuda/CUDAGraphs.cuh:287
at openclaw/runtime/gpu_allocator.py:142
at Model.load_weights (/lib/model_loader.py:89)
Or the raw PyTorch error:
torch.cuda.OutOfMemoryError: CUDA out of memory.
Tried to allocate 2.50 GiB. GPU 0 has a total capacity of 10.00 GiB
of which 14.20 MiB is free. Process included: PID 2812 (python3)
- using 9.98 GiB.
Scope: OpenClaw crashes when the GPU runs out of VRAM (Video RAM). This is not a software bug ā it's a hardware constraint. The model requires more memory than your GPU has available.
Error Code: CUDA out of memory ā GPU VRAM is a fixed physical resource. When the model + KV cache exceeds available VRAM, CUDA cannot allocate more and the process crashes.
Verified Environment
| Component | Version | Last Verified |
|---|---|---|
| OpenClaw | Latest stable | 2026-02-06 |
| CUDA | 11.8+, 12.x | 2026-02-06 |
| PyTorch | 2.0+ | 2026-02-06 |
| NVIDIA Driver | 525+ | 2026-02-06 |
| Models | DeepSeek R1 8B, 32B, 70B | 2026-02-06 |
VRAM Requirements (FP16):
| Model | VRAM Required | VRAM Required (Q4) |
|---|---|---|
| DeepSeek R1 8B | ~16GB | ~6GB |
| DeepSeek R1 32B | ~64GB | ~20GB |
| DeepSeek R1 70B | ~140GB | ~42GB |
Note: These are minimums. KV cache, conversation history, and other processes add to VRAM usage.
3-Minute Sanity Check
Run these commands to confirm VRAM capacity and usage:
# 1. Check your GPU VRAM
nvidia-smi --query-gpu=name,memory.total,memory.free --format=csv,noheader,nounits
# Expected: e.g., "RTX 3080,10240,2048" (name,total MB,free MB)
# 2. Check current VRAM usage while OpenClaw runs
nvidia-smi
# Look at "Memory-Usage" column - if near 100%, you're OOM
# 3. Check what model you're trying to run
ollama list | grep deepseek
# Expected: Shows installed DeepSeek R1 variants
# 4. Calculate required VRAM
# For DeepSeek R1 32B Q4: ~20GB minimum
# For DeepSeek R1 8B Q4: ~6GB minimum
python3 -c "print(f'Required: {int(20 * 1024)} MB for 32B Q4')"
If step 1 shows < 6GB free: You cannot run DeepSeek R1 8B without quantization or offloading.
If step 1 shows < 20GB free: You cannot run DeepSeek R1 32B without extreme measures.
If your GPU is integrated (Intel/AMD): This is not CUDA OOM. You're running on CPU/Metal, which is much slower. Check if you're actually using an NVIDIA GPU.
Decision Gate
Stop fighting VRAM physics.
Should you keep debugging CUDA OOM locally?
Continue local debugging only if:
Stop here if any apply:
Past this point, you are fighting physics, not software. VRAM is a fixed resource. No config change can make 8GB physically hold 20GB.
Primary Exit Path: Local Optimization
Use when: Your GPU has sufficient VRAM for the model, or you're willing to use a smaller model.
Why this works:
- Quantization reduces VRAM usage by 60-75%
- Context reduction limits KV cache growth
- Model selection matches VRAM to requirements
Time investment: 10-15 minutes
Solution 1: Use a Quantized Model (Recommended)
Quantized models use fewer bits per parameter, dramatically reducing VRAM requirements.
# Check available quantized versions
ollama list | grep deepseek
# Run Q4 quantized version (uses ~60% less VRAM)
ollama run deepseek-r1:32b-q4_K_M
# Or use the 8B model instead (much lower VRAM)
ollama run deepseek-r1:8b-q4_K_M
# Configure OpenClaw to use quantized model
export OPENCLAW_MODEL="deepseek-r1:32b-q4_K_M"
openclaw serve
Solution 2: Reduce Context Window
Limiting the context window reduces KV cache size and prevents Agent Loop OOM.
# Default is usually 16384, cut to 4096 or 2048
ollama run deepseek-r1:32b-q4_K_M \
--num_ctx 4096 \
--num-gpu 99 \
--repeat-penalty 0.6
Solution 3: Partial GPU Offload
Let GPU handle some layers, CPU handles the rest. Slower, but uses less VRAM.
# Only load 35 layers on GPU, rest on CPU
ollama run deepseek-r1:32b-q4_K_M \
--num-gpu-layers 35 \
--num_ctx 2048
# Warning: CPU inference is 5-10x slower
Solution 4: Enable Conversation Pruning
OpenClaw can automatically drop old messages to prevent KV cache overflow.
# openclaw_config.py
from openclaw import Client
client = Client(model="deepseek-r1:8b-q4_K_M")
# Auto-prune conversation history
client.max_history_messages = 10
client.context_window = 4096
# OpenClaw drops old messages to prevent OOM
response = client.generate("Your prompt here")
Verification:
# Monitor VRAM while running
watch -n 1 nvidia-smi
# In another terminal, run OpenClaw
# VRAM should stabilize below 100%
Secondary Exit Path (Conditional)
Use when: Primary Exit Path fails ā your GPU physically cannot hold the model.
This is the only remaining technically valid option when:
- Your GPU VRAM < 6GB (cannot run 8B even with quantization)
- You need DeepSeek R1 32B or 70B for your use case
- You require production reliability (consumer GPUs are not designed for 24/7 workloads)
- You've quantized, reduced context, and still hit OOM
Solution: Cloud GPU with Dedicated VRAM
Cloud GPUs provide the VRAM needed for larger models. This is not a "workaround" ā it's the correct solution when hardware constraints are absolute.
Why this is valid:
- VRAM is a physical resource. You cannot add VRAM to a consumer GPU.
- Cloud GPUs (H100, A100) have 40GB-80GB VRAM ā sufficient for any DeepSeek R1 variant.
- Hourly billing lets you test and scale.
- Production isolation prevents AI workloads from affecting your development machine.
Steps:
- Choose a GPU provider with NVIDIA H100 or A100 availability
- Select GPU VRAM based on your model:
- DeepSeek R1 8B: 16GB+ VRAM
- DeepSeek R1 32B: 24GB+ VRAM
- DeepSeek R1 70B: 48GB+ VRAM
- Install Ollama on the VPS
- Run DeepSeek R1 at full precision (no quantization needed)
Time investment: 20-30 minutes (including VPS setup)
Note: This is not "giving up" on local debugging. It's recognizing that VRAM is a fixed resource and selecting the appropriate hardware for your workload.
Approved Next Action
When local VRAM is exhausted, the remaining option is a cloud GPU with dedicated VRAM. Start with a clean Linux environment and NVIDIA GPU (H100, A100, or L40S are recommended for DeepSeek R1 32B/70B).
Why NOT Other Options
| Option | Rejection Reason |
|---|---|
| Add more system RAM | CUDA errors are about GPU VRAM, not system RAM. Adding 64GB of system RAM won't help if your GPU has 8GB VRAM. |
| Extreme quantization (Q2/Q3) | Model quality degrades significantly. The output becomes incoherent for OpenClaw agent loops. Not technically viable. |
| Swap to SSD | Impossibly slow. GPU memory access is ~1000x faster than SSD. Swapping GPU memory to disk would make inference unusable. |
| Run multiple GPUs in parallel | Doesn't increase per-GPU VRAM. Multi-GPU splits the model across GPUs, but each GPU still needs sufficient VRAM for its shard. |
| Wait for model optimization | DeepSeek R1 is already optimized. The VRAM requirements are fundamental to the model architecture. |
| Upgrade consumer GPU | RTX 4090 (24GB) costs $1600+. For that price, you can run a cloud GPU for months. And 4090 still can't run R1 70B at full precision. |
Summary
| Check | Command | Pass Criteria |
|---|---|---|
| GPU VRAM available | nvidia-smi --query-gpu=memory.free --format=csv,noheader,nounits | Shows free VRAM in MB |
| Model VRAM requirement | Calculate based on model size | Free VRAM ā„ Required VRAM |
| Quantization tested | ollama run deepseek-r1:32b-q4_K_M | Model loads without OOM |
| Context reduction tested | ollama run --num_ctx 4096 | Stable after 10+ messages |
Decision:
- VRAM ā„ Required VRAM: Use Primary Exit Path (local optimization). Quantize and reduce context.
- VRAM < Required VRAM after optimization: Use Secondary Exit Path (cloud GPU). This is physics, not failure.
- Production reliability required: Use cloud GPU regardless. Consumer GPUs are not designed for 24/7 AI workloads.
Hardware Reality Check:
| GPU | VRAM | Can Run 8B Q4? | Can Run 32B Q4? |
|---|---|---|---|
| RTX 3060 | 12GB | Yes (tight) | No |
| RTX 3080 | 10GB | No | No |
| RTX 3090 | 24GB | Yes | Yes (tight) |
| RTX 4090 | 24GB | Yes | Yes (tight) |
| Cloud H100 | 80GB | Yes | Yes (full precision) |
Last resort: If you have spent more than 30 minutes on VRAM optimization and still hit OOM, your GPU is not sufficient for your target model. Use a cloud GPU or switch to a smaller model. This is not a failure ā it's matching hardware to workload requirements.
Related Guides
- Hardware Requirements Reality Check - Can your PC run OpenClaw?
- OpenClaw Agent API Cost Model - API vs GPU breakpoint analysis
- Fix OpenClaw JSON Mode Errors - DeepSeek thinking tags
FAQ
Q: Will adding more system RAM fix CUDA OOM?
A: No. CUDA errors are about GPU VRAM, not system RAM. Adding 64GB of system RAM won't help if your GPU has 8GB VRAM. The model must fit in GPU memory to run. You can offload some layers to CPU, but performance drops significantly.
Q: Can I run DeepSeek R1 32B on an RTX 3060 (12GB)?
A: Not practically. The Q4 quantized version requires ~20GB VRAM. You could try extreme quantization (Q2), but output quality degrades significantly. Better option: use DeepSeek R1 8B, or use a cloud GPU with 24GB+ VRAM.
Q: Why does it work for 10 messages then crash?
A: That's Agent Loop OOM. OpenClaw accumulates conversation history in the KV cache, which grows with each message. After 10-15 messages, the cache fills your VRAM. Fix: reduce context window (--num_ctx 4096) or enable history pruning (client.max_history_messages = 10).
Q: Is cloud GPU worth it for OpenClaw?
A: If your local GPU is insufficient, yes. Cloud GPUs provide VRAM that no consumer GPU has (80GB on H100). You pay for what you use, and you get production reliability. For occasional testing, local is fine. For production or heavy usage, cloud GPU is the correct technical choice.
Still Stuck? Check Your Hardware
Sometimes the code is fine, but the GPU is simply refusing to cooperate. Before you waste another hour debugging, compare your specs against the Hardware Reality Table to see if you are fighting impossible physics.
Bookmark this site
New fixes are added as soon as they appear on GitHub Issues.
Browse Error Index ā