Beginner-Intermediate 📅 Last Updated: July 1, 2026 ⏱️ 11 min read 🛠️ Troubleshooting
Slow Ollama is almost always one of six things: (1) GPU not detected (running on CPU), (2) model too big for VRAM (spilling to RAM), (3) wrong quantization, (4) context window too large, (5) concurrent requests competing for resources, or (6) CPU fallback without GPU. Run nvidia-smi and ollama ps to diagnose in 30 seconds. The table below maps every problem to its fix.
Read this if: Ollama feels slow, tokens are crawling, the GPU fan isn't spinning up, or you switched from ChatGPT and local feels unusably slow.
Skip if: Ollama is already fast — you don't have a problem.
Machine: MSI laptop (dual GPU)
GPU 1: NVIDIA RTX 5070 Ti Laptop (12GB)
GPU 2: NVIDIA RTX 5070 (12GB)
CPU: Intel Core Ultra 7 255HX (20 cores)
RAM: 96GB
OS: Ubuntu 26.04 LTS
Date: July 2026
Run these two commands first. They tell you 80% of what you need to know:
# 1. Is your GPU detected and being used?
nvidia-smi
# Look for:
# - Your GPU listed at the top
# - Python/ollama process in the process list
# - VRAM usage (should be 3-20GB if model is loaded)
# 2. What's loaded in Ollama and where?
ollama ps
# Sample output:
# NAME SIZE PROCESSOR UNTIL
# qwen2.5:14b 9.8 GB 100% GPU 4 minutes from now
#
# BAD signs:
# - PROCESSOR shows "CPU" instead of "GPU"
# - PROCESSOR shows "50%/50% CPU/GPU" (split = spilling)
# - SIZE is larger than your VRAM
Is Ollama slow?
│
├─ nvidia-smi shows GPU? ──── NO →→ Fix #1: Install CUDA drivers
│ (ollama falls back to CPU)
│
├─ YES → ollama ps: GPU or CPU?
│ │
│ ├─ CPU only ─────→→→ Fix #2: Model too big for VRAM,
│ │ or driver mismatch
│ │
│ ├─ GPU/CPU split →→→ Fix #3: Model spilling to RAM.
│ │ Use smaller model or quantization
│ │
│ └─ 100% GPU but slow?
│ │
│ ├─ Context >8K? →→ Fix #4: Reduce context length
│ │
│ ├─ Multiple requests? → Fix #5: Concurrency limit
│ │
│ └─ Still slow? →→→ Fix #6: Quantization too high
│ (try Q4 instead of Q8)
| Problem | Cause | Fix | Speed Impact |
|---|---|---|---|
| Model runs but very slow (2–8 t/s) | Running on CPU, not GPU | Install NVIDIA drivers + CUDA toolkit | 5–10× faster |
| GPU detected but model uses CPU | Model exceeds VRAM, spills to RAM | Use smaller model or lower quantization (Q4) | 3–5× faster |
ollama ps shows CPU/GPU split | Partial offload — model too big | Switch to Q4 or smaller parameter model | 2–4× faster |
| Fast at first, slows down mid-conversation | Context window growing, KV cache bloating | Reduce num_ctx to 4096–8192 | 1.5–3× faster |
| Multiple users, all slow | Concurrent requests competing for GPU | Set OLLAMA_NUM_PARALLEL=1 | Smoother per-request |
| Q8 or FP16 model, barely faster than CPU | Quantization too high for VRAM bandwidth | Use Q4_K_M quantization | 1.5–2× faster |
| First token fast, rest slow | Prompt processing is fine, generation is VRAM-bandwidth limited | Close other VRAM apps (browser, games) | 1.2–1.5× faster |
This is the most common cause of slow Ollama. If your GPU isn't detected, Ollama silently falls back to CPU — which is 5–10× slower.
# Check if NVIDIA driver is loaded:
nvidia-smi
# If you get "command not found" or "no devices":
sudo ubuntu-drivers autoinstall
sudo reboot
# After reboot, verify:
nvidia-smi # Should show your GPU
# Then check Ollama sees it:
ollama ps # PROCESSOR column should say "GPU"
If your model needs 16GB VRAM but you have 12GB, Ollama offloads part of it to system RAM. This causes a massive speed penalty because system RAM is much slower than VRAM.
# Check what's happening:
ollama ps
# If you see "45% CPU / 55% GPU" — it's spilling.
# Solution: use a smaller quantization
# Instead of the default (which may be Q8):
ollama run qwen2.5:14b # check size
# Create a smaller Modelfile:
cat > Modelfile <<'EOF'
FROM qwen2.5:14b
PARAMETER num_ctx 4096
EOF
ollama create qwen14b-fast -f Modelfile
ollama run qwen14b-fast
The context window (num_ctx) determines how much conversation history the model processes. Larger context = more VRAM for KV cache = slower generation. Default is often 4096, but some setups push it to 32K+.
# Check current context:
# In Open WebUI: Settings → Advanced → Default context length
# Or in a Modelfile:
cat > Modefile <<'EOF'
FROM qwen2.5:14b
PARAMETER num_ctx 4096
EOF
# Context length impact on VRAM (qwen2.5:14b):
# 4096 tokens → ~9.8GB VRAM, ~32 t/s
# 8192 tokens → ~11.2GB VRAM, ~28 t/s
# 16384 tokens → ~14.1GB VRAM, won't fit 12GB → CPU spill!
Most people don't need 32K context for chat. 4096–8192 is plenty for most conversations. If you need long context (analyzing documents), expect slower speeds or use a model designed for it.
Testing on RTX 5070 Ti (12GB), same model (qwen2.5:14b), measuring tokens/sec:
| Configuration | Processor | Tokens/sec | Verdict |
|---|---|---|---|
| CPU fallback (no GPU driver) | 100% CPU | 4 t/s | ❌ Unusable |
| GPU + Q8 quantization + 16K ctx | 60% CPU / 40% GPU | 11 t/s | ⚠️ Spilling to RAM |
| GPU + Q4 + 8K context | 100% GPU | 28 t/s | ✅ Fast |
| GPU + Q4 + 4K context | 100% GPU | 32 t/s | ✅ Best balance |
| Dual GPU (5070 Ti + 5070) + Q4 + 4K | 100% GPU (split) | 35 t/s | ✅ Optimal |
Result: From 4 t/s to 35 t/s — a 9× improvement by fixing the driver, quantization, and context length.
If you're running Open WebUI with multiple users (or multiple tabs), requests compete for GPU time. Each request adds latency for all others.
# Limit Ollama to process one request at a time:
sudo systemctl edit ollama
# Add:
[Service]
Environment="OLLAMA_NUM_PARALLEL=1"
# Restart:
sudo systemctl restart ollama
Run
nvidia-smiandollama psright now. Ifollama psshows anything other than "100% GPU", that's your problem. Fix the driver first, then drop to Q4 quantization, then set context to 4096. Those three changes take most people from "unusable" to "faster than ChatGPT." On our test machine,qwen2.5:14bat Q4 with 4K context hits 32 tokens/sec on a single GPU — that's faster than most cloud APIs.
The most common cause is running a model that exceeds your VRAM, forcing slower system RAM or CPU inference. An 8B model needs about 5-6GB VRAM at Q4 - if your GPU only has 4GB, performance drops dramatically. Run 'ollama ps' to check memory loading.
GPU VRAM typically delivers 10-50x faster inference due to massive parallel processing and higher memory bandwidth. An 8B model might generate 60 tokens/sec on an RTX 3060 but only 8 tokens/sec on CPU. Single-digit speeds mean CPU fallback.
Yes, dramatically - a 3B model runs 2-3x faster than 8B on the same hardware, and 70B is 5-10x slower than 8B. Choose the smallest model that meets your quality needs, and use quantized versions (Q4_K_M) to reduce memory by up to 70%.
Run 'ollama ps' and look at the PROCESSOR column - it should say '100% GPU'. On Windows, check Task Manager GPU memory during inference. If you see CPU or partial GPU, update drivers and ensure CUDA or ROCm is installed.
Quantization reduces model precision from 16-bit to 4-bit or 8-bit, shrinking memory by 50-75% with minimal quality loss. A Q4 Llama 3 (8B) uses about 4.7GB instead of 16GB. Ollama downloads Q4_K_M by default, the best speed-to-quality tradeoff.
Send me the output of nvidia-smi, ollama ps, and your model list. I will tell you exactly what to fix and what speeds to expect. $99 Setup Review.
The $19 Starter Kit includes a performance tuning guide, Modelfile templates for every VRAM tier, and a benchmark comparison spreadsheet.
See the Starter Kit →Get the free Local AI Setup Checklist delivered to your inbox.
Get the Free ChecklistLast Updated: July 1, 2026 — Benchmarks from RTX 5070 Ti + RTX 5070 dual-GPU testing. Ollama on Ubuntu 26.04.