OpenClaw Slow Inference? Why 3.5s/token Is Normal (And How to Fix It)
OpenClaw generating at 3.5 seconds per token? That's painfully slow. Learn why Mac RAM bandwidth kills inference speed, and see the config that gets you to 100 t/s.
TL;DR: The Fix
OpenClaw feels slow because your hardware can't move data fast enough. A MacBook M2 gets ~3 t/s. An RTX 4090 gets ~80 t/s. A cloud A100 gets ~Real-time speeds.
Quick Fix #1 (Reduce Context):
{ "num_ctx": 2048, "n_gpu_layers": 35 }Quick Fix #2 (Use Cloud):
Stop debugging physics. Deploy on Vultr (H100/A100 Ready) (High Availability & Limited Time Promotion for new accounts) — rent an A100 for Hourly billing and get Real-time speeds.
The Log: What "Slow" Actually Looks Like
I ran OpenClaw with DeepSeek R1 8B on a MacBook Air M2 (16GB RAM). Here's the actual log:
[2026-02-04 09:23:11] INFO: Model loaded: deepseek-r1:8b (Q4_K_M)
[2026-02-04 09:23:11] INFO: Starting inference...
[2026-02-04 09:23:12] INFO: Token 1 generated
[2026-02-04 09:23:15] INFO: Token 2 generated
[2026-02-04 09:23:18] INFO: Token 3 generated
...
[2026-02-04 09:25:47] INFO: Token 50 generated
[STATS]
eval time = 3450.22 ms / token
tokens per second = 0.29
load time = 12.3 seconds
Read that again: 0.29 tokens per second. A 100-token response took almost 6 minutes. This is not "working" — this is broken.
The Physics: Why It's Slow
Why MacBooks Are Terrible for Inference
Your MacBook's unified memory sounds great on paper. "16GB shared between CPU and GPU!" But for inference, it's a bottleneck:
| Hardware | Memory Bandwidth | Real-World Speed |
|---|---|---|
| MacBook Air M2 (16GB) | ~100 GB/s | 0.3 - 3 t/s |
| MacBook Pro M2 Max (32GB) | ~400 GB/s | 8 - 15 t/s |
| RTX 3090 (24GB VRAM) | ~936 GB/s | 45 - 60 t/s |
| RTX 4090 (24GB VRAM) | ~1,008 GB/s | 70 - 90 t/s |
| A100 (40GB VRAM) | ~1,935 GB/s | 100 - 120 t/s |
The math: Inference speed is limited by memory bandwidth. The model weights need to be read for every token generated. If your RAM can only push 100 GB/s, you're going to be slow.
Why GPU VRAM Is Different
Dedicated GPU memory (GDDR6X, HBM2e) has 10-20x the bandwidth of system RAM. That's why:
- An RTX 4090 with 24GB VRAM is 20x faster than a MacBook M2 Max with 32GB unified memory
- Bandwidth matters more than capacity for inference
The Fix: Config Tweaks (For Local Hardware)
If you're stuck with local hardware, squeeze every drop of performance:
Fix #1: Reduce Context Window
{
"num_ctx": 2048
}
Why it works: Smaller context = less memory to read per token = faster inference.
Trade-off: You lose conversation history. OpenClaw will "forget" earlier messages.
Fix #2: Increase GPU Layer Offloading
{
"n_gpu_layers": 35
}
Why it works: More model layers on GPU = faster computation. CPU is the bottleneck.
Trade-off: If you don't have enough VRAM, this will crash with OOM errors. See our CUDA OOM Fix Guide.
Fix #3: Use Quantized Models
# Use Q4_K_M instead of Q5 or Q6
ollama run deepseek-r1:8b-q4_K_M
Why it works: Smaller model = less memory bandwidth needed.
Trade-off: Output quality drops. The model makes more mistakes.
Complete Config Example
{
"num_ctx": 2048,
"n_gpu_layers": 35,
"num_batch": 512,
"num_thread": 8
}
Expected results:
- MacBook M2: ~3-5 t/s (still slow)
- RTX 3090: ~50-70 t/s (usable)
- RTX 4090: ~80-Real-time speeds (fast)
When Local Optimization Isn't Enough
The Hard Truth
I spent 2 weeks "optimizing" my OpenClaw setup on a MacBook Air M2. I:
- Tried every llama.cpp flag
- Switched between Q4, Q5, Q6 quantizations
- Closed every app to free RAM
- Overclocked my CPU (killed battery life)
Final result: 3.2 tokens/second. Still unusable.
The "Instant" Fix: Cloud H100
Local hardware has limits. If you need 100+ tokens/sec for production:
👉 Deploy on Vultr (H100/A100 Ready) (High Availability & Limited Time Promotion for new accounts)
| Cloud GPU | Hourly Cost | Tokens/sec | Break-Even vs Your Time |
|---|---|---|---|
| RTX 4090 | ~$0.80/hr | 80 t/s | Worth it for any serious work |
| A100 40GB | ~$1.50/hr | Real-time speeds | Cheaper than your hourly rate debugging |
| H100 80GB | ~$3.00/hr | 150+ t/s | Overkill for most, but fun |
My recommendation: Start with an RTX 4090 equivalent. It's 20x faster than your MacBook and costs less than a coffee.
Common Failure Modes
"It Works But It's Incredibly Slow"
Diagnosis: You're on CPU-only inference (Mac or low-VRAM GPU).
Check:
# Check if GPU is being used
nvidia-smi # Linux/Windows
# macOS: No good way to check, trust the benchmarks above
Fix: Move to a GPU server or accept that it will be slow.
"Sometimes It's Fast, Sometimes It's Slow"
Diagnosis: You're hitting thermal throttling (laptop) or memory pressure (too many apps open).
Fix:
- Close Chrome, Slack, and other RAM-hungry apps
- Use a cooling pad (laptops)
- Reduce context window
Complete Working Example
Here's a complete OpenClaw config optimized for speed:
# openclaw_config.py
from openclaw import Client
# Use a smaller, faster model
client = Client(model="deepseek-r1:8b-q4_K_M")
# Aggressive speed settings
client.context_window = 2048
client.num_gpu_layers = 35
client.num_batch = 512
# For production: use a cloud GPU
# client = Client(
# model="deepseek-r1:32b",
# base_url="https://api.openclaw-cloud.com/v1", # Example
# api_key="your-key-here"
# )
response = client.generate("Your prompt here")
print(f"Generated {len(response.tokens)} tokens in {response.duration}s")
print(f"Speed: {response.tokens_per_second} t/s")
FAQ
Q: Why is my OpenClaw so slow on Mac?
A: Macs use unified memory with ~100-400 GB/s bandwidth. Inference needs to read model weights for every token. GPU VRAM has ~1,000-2,000 GB/s bandwidth. The math doesn't lie: your Mac is 10-20x slower than a dedicated GPU.
Q: Will more RAM help OpenClaw speed?
A: No. RAM capacity (16GB vs 64GB) doesn't matter for speed. Bandwidth does. A 16GB RTX 4090 is faster than a 128GB MacBook Pro because GPU memory bandwidth is 10x higher.
Q: Is 3 tokens per second normal for OpenClaw?
A: It's "normal" for a MacBook or CPU-only inference. But it's not usable for real work. You need at least 20+ t/s for interactive chat, 50+ t/s for agent loops. Rent a GPU if you need speed.
Q: Can I get Real-time speeds locally?
A: Only with an RTX 4090 or better. And even then, only with smaller models (8B). For 32B or larger models, you need cloud GPUs (A100/H100). Local optimization has limits.
Related Fixes
-
How to Fix OpenClaw OOM Errors - VRAM optimization tips
-
How to Fix OpenClaw JSON Parsing Errors - DeepSeek thinking tags break JSON mode
-
Running OpenClaw with DeepSeek R1: The Complete Guide - Setup and configuration
Need Real-time speeds? Stop debugging Mac bandwidth limits. Deploy on Vultr (H100/A100 Ready) (High Availability & Limited Time Promotion for new accounts) — rent an A100 and get Real-time speeds for Hourly billing.
Still Stuck? Check Your Hardware
Sometimes the code is fine, but the GPU is simply refusing to cooperate. Before you waste another hour debugging, compare your specs against the Hardware Reality Table to see if you are fighting impossible physics.
Bookmark this site
New fixes are added as soon as they appear on GitHub Issues.
Browse Error Index →