Beginner-Intermediate 📅 Last Updated: July 1, 2026 ⏱️ 11 min read 🛠️ Troubleshooting

Why Is Ollama So Slow? Common Fixes for Local AI Performance

⚡ Quick Answer

Slow Ollama is almost always one of six things: (1) GPU not detected (running on CPU), (2) model too big for VRAM (spilling to RAM), (3) wrong quantization, (4) context window too large, (5) concurrent requests competing for resources, or (6) CPU fallback without GPU. Run nvidia-smi and ollama ps to diagnose in 30 seconds. The table below maps every problem to its fix.

Who This Is For

Read this if: Ollama feels slow, tokens are crawling, the GPU fan isn't spinning up, or you switched from ChatGPT and local feels unusably slow.

Skip if: Ollama is already fast — you don't have a problem.

What You Need

🔬 Tested On

Machine: MSI laptop (dual GPU)
GPU 1: NVIDIA RTX 5070 Ti Laptop (12GB)
GPU 2: NVIDIA RTX 5070 (12GB)
CPU: Intel Core Ultra 7 255HX (20 cores)
RAM: 96GB
OS: Ubuntu 26.04 LTS
Date: July 2026

The 30-Second Diagnosis

Run these two commands first. They tell you 80% of what you need to know:

# 1. Is your GPU detected and being used?
nvidia-smi

# Look for:
#   - Your GPU listed at the top
#   - Python/ollama process in the process list
#   - VRAM usage (should be 3-20GB if model is loaded)

# 2. What's loaded in Ollama and where?
ollama ps

# Sample output:
# NAME           SIZE     PROCESSOR     UNTIL
# qwen2.5:14b    9.8 GB   100% GPU      4 minutes from now
#
# BAD signs:
#   - PROCESSOR shows "CPU" instead of "GPU"
#   - PROCESSOR shows "50%/50% CPU/GPU" (split = spilling)
#   - SIZE is larger than your VRAM

The Diagnosis Flowchart

Is Ollama slow?
    │
    ├─ nvidia-smi shows GPU? ──── NO →→ Fix #1: Install CUDA drivers
    │                                   (ollama falls back to CPU)
    │
    ├─ YES → ollama ps: GPU or CPU?
    │         │
    │         ├─ CPU only ─────→→→ Fix #2: Model too big for VRAM,
    │         │                       or driver mismatch
    │         │
    │         ├─ GPU/CPU split →→→ Fix #3: Model spilling to RAM.
    │         │                       Use smaller model or quantization
    │         │
    │         └─ 100% GPU but slow?
    │               │
    │               ├─ Context >8K? →→ Fix #4: Reduce context length
    │               │
    │               ├─ Multiple requests? → Fix #5: Concurrency limit
    │               │
    │               └─ Still slow? →→→ Fix #6: Quantization too high
    │                                  (try Q4 instead of Q8)

Problem → Cause → Fix → Impact Table

ProblemCauseFixSpeed Impact
Model runs but very slow (2–8 t/s)Running on CPU, not GPUInstall NVIDIA drivers + CUDA toolkit5–10× faster
GPU detected but model uses CPUModel exceeds VRAM, spills to RAMUse smaller model or lower quantization (Q4)3–5× faster
ollama ps shows CPU/GPU splitPartial offload — model too bigSwitch to Q4 or smaller parameter model2–4× faster
Fast at first, slows down mid-conversationContext window growing, KV cache bloatingReduce num_ctx to 4096–81921.5–3× faster
Multiple users, all slowConcurrent requests competing for GPUSet OLLAMA_NUM_PARALLEL=1Smoother per-request
Q8 or FP16 model, barely faster than CPUQuantization too high for VRAM bandwidthUse Q4_K_M quantization1.5–2× faster
First token fast, rest slowPrompt processing is fine, generation is VRAM-bandwidth limitedClose other VRAM apps (browser, games)1.2–1.5× faster

Fix #1: GPU Not Detected (CPU Fallback)

This is the most common cause of slow Ollama. If your GPU isn't detected, Ollama silently falls back to CPU — which is 5–10× slower.

# Check if NVIDIA driver is loaded:
nvidia-smi

# If you get "command not found" or "no devices":
sudo ubuntu-drivers autoinstall
sudo reboot

# After reboot, verify:
nvidia-smi  # Should show your GPU

# Then check Ollama sees it:
ollama ps  # PROCESSOR column should say "GPU"

Fix #2: Model Too Big for VRAM

If your model needs 16GB VRAM but you have 12GB, Ollama offloads part of it to system RAM. This causes a massive speed penalty because system RAM is much slower than VRAM.

# Check what's happening:
ollama ps

# If you see "45% CPU / 55% GPU" — it's spilling.
# Solution: use a smaller quantization

# Instead of the default (which may be Q8):
ollama run qwen2.5:14b  # check size

# Create a smaller Modelfile:
cat > Modelfile <<'EOF'
FROM qwen2.5:14b
PARAMETER num_ctx 4096
EOF

ollama create qwen14b-fast -f Modelfile
ollama run qwen14b-fast

Fix #3: Reduce Context Length

The context window (num_ctx) determines how much conversation history the model processes. Larger context = more VRAM for KV cache = slower generation. Default is often 4096, but some setups push it to 32K+.

# Check current context:
# In Open WebUI: Settings → Advanced → Default context length

# Or in a Modelfile:
cat > Modefile <<'EOF'
FROM qwen2.5:14b
PARAMETER num_ctx 4096
EOF

# Context length impact on VRAM (qwen2.5:14b):
# 4096 tokens  → ~9.8GB VRAM, ~32 t/s
# 8192 tokens  → ~11.2GB VRAM, ~28 t/s
# 16384 tokens → ~14.1GB VRAM, won't fit 12GB → CPU spill!

⚠️ Context Is a VRAM Killer

Most people don't need 32K context for chat. 4096–8192 is plenty for most conversations. If you need long context (analyzing documents), expect slower speeds or use a model designed for it.

Real Benchmarks: Before and After Fixes

Testing on RTX 5070 Ti (12GB), same model (qwen2.5:14b), measuring tokens/sec:

ConfigurationProcessorTokens/secVerdict
CPU fallback (no GPU driver)100% CPU4 t/s❌ Unusable
GPU + Q8 quantization + 16K ctx60% CPU / 40% GPU11 t/s⚠️ Spilling to RAM
GPU + Q4 + 8K context100% GPU28 t/s✅ Fast
GPU + Q4 + 4K context100% GPU32 t/s✅ Best balance
Dual GPU (5070 Ti + 5070) + Q4 + 4K100% GPU (split)35 t/s✅ Optimal

Result: From 4 t/s to 35 t/s — a 9× improvement by fixing the driver, quantization, and context length.

Fix #4: Limit Concurrent Requests

If you're running Open WebUI with multiple users (or multiple tabs), requests compete for GPU time. Each request adds latency for all others.

# Limit Ollama to process one request at a time:
sudo systemctl edit ollama

# Add:
[Service]
Environment="OLLAMA_NUM_PARALLEL=1"

# Restart:
sudo systemctl restart ollama

What I Would Do

Run nvidia-smi and ollama ps right now. If ollama ps shows anything other than "100% GPU", that's your problem. Fix the driver first, then drop to Q4 quantization, then set context to 4096. Those three changes take most people from "unusable" to "faster than ChatGPT." On our test machine, qwen2.5:14b at Q4 with 4K context hits 32 tokens/sec on a single GPU — that's faster than most cloud APIs.

Frequently Asked Questions

Why is my Ollama model generating text so slowly?

The most common cause is running a model that exceeds your VRAM, forcing slower system RAM or CPU inference. An 8B model needs about 5-6GB VRAM at Q4 - if your GPU only has 4GB, performance drops dramatically. Run 'ollama ps' to check memory loading.

How much faster is GPU VRAM compared to system RAM?

GPU VRAM typically delivers 10-50x faster inference due to massive parallel processing and higher memory bandwidth. An 8B model might generate 60 tokens/sec on an RTX 3060 but only 8 tokens/sec on CPU. Single-digit speeds mean CPU fallback.

Does model size affect Ollama speed?

Yes, dramatically - a 3B model runs 2-3x faster than 8B on the same hardware, and 70B is 5-10x slower than 8B. Choose the smallest model that meets your quality needs, and use quantized versions (Q4_K_M) to reduce memory by up to 70%.

How do I check if Ollama is using my GPU?

Run 'ollama ps' and look at the PROCESSOR column - it should say '100% GPU'. On Windows, check Task Manager GPU memory during inference. If you see CPU or partial GPU, update drivers and ensure CUDA or ROCm is installed.

What is quantization and does it speed up Ollama?

Quantization reduces model precision from 16-bit to 4-bit or 8-bit, shrinking memory by 50-75% with minimal quality loss. A Q4 Llama 3 (8B) uses about 4.7GB instead of 16GB. Ollama downloads Q4_K_M by default, the best speed-to-quality tradeoff.

🔧 Still Slow? Let Me Diagnose It

Send me the output of nvidia-smi, ollama ps, and your model list. I will tell you exactly what to fix and what speeds to expect. $99 Setup Review.

Get a Setup Review →

📦 Get the Performance Tuning Checklist

The $19 Starter Kit includes a performance tuning guide, Modelfile templates for every VRAM tier, and a benchmark comparison spreadsheet.

See the Starter Kit →

Want this guide as a printable checklist?

Get the free Local AI Setup Checklist delivered to your inbox.

Get the Free Checklist

Last Updated: July 1, 2026 — Benchmarks from RTX 5070 Ti + RTX 5070 dual-GPU testing. Ollama on Ubuntu 26.04.