Beginner-Intermediate 📅 Last Updated: July 1, 2026 ⏱️ 11 min read 🛠️ Troubleshooting

Why Is Ollama So Slow? Common Fixes for Local AI Performance

Q: Why is my Ollama model generating text so slowly?

The most common cause is running a model that exceeds your VRAM, forcing slower system RAM or CPU inference. An 8B model needs about 5-6GB VRAM at Q4 - if your GPU only has 4GB, performance drops dramatically. Run 'ollama ps' to check memory loading.

Q: How much faster is GPU VRAM compared to system RAM?

GPU VRAM typically delivers 10-50x faster inference due to massive parallel processing and higher memory bandwidth. An 8B model might generate 60 tokens/sec on an RTX 3060 but only 8 tokens/sec on CPU. Single-digit speeds mean CPU fallback.

Q: Does model size affect Ollama speed?

Yes, dramatically - a 3B model runs 2-3x faster than 8B on the same hardware, and 70B is 5-10x slower than 8B. Choose the smallest model that meets your quality needs, and use quantized versions (Q4_K_M) to reduce memory by up to 70%.

Q: How do I check if Ollama is using my GPU?

Run 'ollama ps' and look at the PROCESSOR column - it should say '100% GPU'. On Windows, check Task Manager GPU memory during inference. If you see CPU or partial GPU, update drivers and ensure CUDA or ROCm is installed.

Q: What is quantization and does it speed up Ollama?

Quantization reduces model precision from 16-bit to 4-bit or 8-bit, shrinking memory by 50-75% with minimal quality loss. A Q4 Llama 3 (8B) uses about 4.7GB instead of 16GB. Ollama downloads Q4_K_M by default, the best speed-to-quality tradeoff.

⚡ Quick Answer

Slow Ollama is almost always one of six things: (1) GPU not detected (running on CPU), (2) model too big for VRAM (spilling to RAM), (3) wrong quantization, (4) context window too large, (5) concurrent requests competing for resources, or (6) CPU fallback without GPU. Run nvidia-smi and ollama ps to diagnose in 30 seconds. The table below maps every problem to its fix.

Who This Is For

Read this if: Ollama feels slow, tokens are crawling, the GPU fan isn't spinning up, or you switched from ChatGPT and local feels unusably slow.

Skip if: Ollama is already fast — you don't have a problem.

What You Need

Ollama installed and a model loaded
Terminal access
An NVIDIA GPU (this guide focuses on GPU; CPU-only is covered as a fallback)

🔬 Tested On

Machine: MSI laptop (dual GPU)
GPU 1: NVIDIA RTX 5070 Ti Laptop (12GB)
GPU 2: NVIDIA RTX 5070 (12GB)
CPU: Intel Core Ultra 7 255HX (20 cores)
RAM: 96GB
OS: Ubuntu 26.04 LTS
Date: July 2026

The 30-Second Diagnosis

Run these two commands first. They tell you 80% of what you need to know:

# 1. Is your GPU detected and being used?
nvidia-smi

# Look for:
#   - Your GPU listed at the top
#   - Python/ollama process in the process list
#   - VRAM usage (should be 3-20GB if model is loaded)

# 2. What's loaded in Ollama and where?
ollama ps

# Sample output:
# NAME           SIZE     PROCESSOR     UNTIL
# qwen2.5:14b    9.8 GB   100% GPU      4 minutes from now
#
# BAD signs:
#   - PROCESSOR shows "CPU" instead of "GPU"
#   - PROCESSOR shows "50%/50% CPU/GPU" (split = spilling)
#   - SIZE is larger than your VRAM

The Diagnosis Flowchart

Is Ollama slow?
    │
    ├─ nvidia-smi shows GPU? ──── NO →→ Fix #1: Install CUDA drivers
    │                                   (ollama falls back to CPU)
    │
    ├─ YES → ollama ps: GPU or CPU?
    │         │
    │         ├─ CPU only ─────→→→ Fix #2: Model too big for VRAM,
    │         │                       or driver mismatch
    │         │
    │         ├─ GPU/CPU split →→→ Fix #3: Model spilling to RAM.
    │         │                       Use smaller model or quantization
    │         │
    │         └─ 100% GPU but slow?
    │               │
    │               ├─ Context >8K? →→ Fix #4: Reduce context length
    │               │
    │               ├─ Multiple requests? → Fix #5: Concurrency limit
    │               │
    │               └─ Still slow? →→→ Fix #6: Quantization too high
    │                                  (try Q4 instead of Q8)

Problem → Cause → Fix → Impact Table

Problem	Cause	Fix	Speed Impact
Model runs but very slow (2–8 t/s)	Running on CPU, not GPU	Install NVIDIA drivers + CUDA toolkit	5–10× faster
GPU detected but model uses CPU	Model exceeds VRAM, spills to RAM	Use smaller model or lower quantization (Q4)	3–5× faster
`ollama ps` shows CPU/GPU split	Partial offload — model too big	Switch to Q4 or smaller parameter model	2–4× faster
Fast at first, slows down mid-conversation	Context window growing, KV cache bloating	Reduce `num_ctx` to 4096–8192	1.5–3× faster
Multiple users, all slow	Concurrent requests competing for GPU	Set `OLLAMA_NUM_PARALLEL=1`	Smoother per-request
Q8 or FP16 model, barely faster than CPU	Quantization too high for VRAM bandwidth	Use Q4_K_M quantization	1.5–2× faster
First token fast, rest slow	Prompt processing is fine, generation is VRAM-bandwidth limited	Close other VRAM apps (browser, games)	1.2–1.5× faster

Fix #1: GPU Not Detected (CPU Fallback)

This is the most common cause of slow Ollama. If your GPU isn't detected, Ollama silently falls back to CPU — which is 5–10× slower.

# Check if NVIDIA driver is loaded:
nvidia-smi

# If you get "command not found" or "no devices":
sudo ubuntu-drivers autoinstall
sudo reboot

# After reboot, verify:
nvidia-smi  # Should show your GPU

# Then check Ollama sees it:
ollama ps  # PROCESSOR column should say "GPU"

Fix #2: Model Too Big for VRAM

If your model needs 16GB VRAM but you have 12GB, Ollama offloads part of it to system RAM. This causes a massive speed penalty because system RAM is much slower than VRAM.

# Check what's happening:
ollama ps

# If you see "45% CPU / 55% GPU" — it's spilling.
# Solution: use a smaller quantization

# Instead of the default (which may be Q8):
ollama run qwen2.5:14b  # check size

# Create a smaller Modelfile:
cat > Modelfile <<'EOF'
FROM qwen2.5:14b
PARAMETER num_ctx 4096
EOF

ollama create qwen14b-fast -f Modelfile
ollama run qwen14b-fast

Fix #3: Reduce Context Length

The context window (num_ctx) determines how much conversation history the model processes. Larger context = more VRAM for KV cache = slower generation. Default is often 4096, but some setups push it to 32K+.

# Check current context:
# In Open WebUI: Settings → Advanced → Default context length

# Or in a Modelfile:
cat > Modefile <<'EOF'
FROM qwen2.5:14b
PARAMETER num_ctx 4096
EOF

# Context length impact on VRAM (qwen2.5:14b):
# 4096 tokens  → ~9.8GB VRAM, ~32 t/s
# 8192 tokens  → ~11.2GB VRAM, ~28 t/s
# 16384 tokens → ~14.1GB VRAM, won't fit 12GB → CPU spill!


⚠️ Context Is a VRAM Killer
Most people don't need 32K context for chat. 4096–8192 is plenty for most conversations. If you need long context (analyzing documents), expect slower speeds or use a model designed for it.
Real Benchmarks: Before and After Fixes
Testing on RTX 5070 Ti (12GB), same model (qwen2.5:14b), measuring tokens/sec:


Configuration Processor Tokens/sec Verdict
CPU fallback (no GPU driver) 100% CPU 4 t/s ❌ Unusable
GPU + Q8 quantization + 16K ctx 60% CPU / 40% GPU 11 t/s ⚠️ Spilling to RAM
GPU + Q4 + 8K context 100% GPU 28 t/s ✅ Fast
GPU + Q4 + 4K context 100% GPU 32 t/s ✅ Best balance
Dual GPU (5070 Ti + 5070) + Q4 + 4K 100% GPU (split) 35 t/s ✅ Optimal


Result: From 4 t/s to 35 t/s — a 9× improvement by fixing the driver, quantization, and context length.
Fix #4: Limit Concurrent Requests
If you're running Open WebUI with multiple users (or multiple tabs), requests compete for GPU time. Each request adds latency for all others.
# Limit Ollama to process one request at a time:
sudo systemctl edit ollama

# Add:
[Service]
Environment="OLLAMA_NUM_PARALLEL=1"

# Restart:
sudo systemctl restart ollama
What I Would Do
Run nvidia-smi and ollama ps right now. If ollama ps shows anything other than "100% GPU", that's your problem. Fix the driver first, then drop to Q4 quantization, then set context to 4096. Those three changes take most people from "unusable" to "faster than ChatGPT." On our test machine, qwen2.5:14b at Q4 with 4K context hits 32 tokens/sec on a single GPU — that's faster than most cloud APIs.

  
    Frequently Asked Questions
    
      Why is my Ollama model generating text so slowly?
      The most common cause is running a model that exceeds your VRAM, forcing slower system RAM or CPU inference. An 8B model needs about 5-6GB VRAM at Q4 - if your GPU only has 4GB, performance drops dramatically. Run 'ollama ps' to check memory loading.
    
    
      How much faster is GPU VRAM compared to system RAM?
      GPU VRAM typically delivers 10-50x faster inference due to massive parallel processing and higher memory bandwidth. An 8B model might generate 60 tokens/sec on an RTX 3060 but only 8 tokens/sec on CPU. Single-digit speeds mean CPU fallback.
    
    
      Does model size affect Ollama speed?
      Yes, dramatically - a 3B model runs 2-3x faster than 8B on the same hardware, and 70B is 5-10x slower than 8B. Choose the smallest model that meets your quality needs, and use quantized versions (Q4_K_M) to reduce memory by up to 70%.
    
    
      How do I check if Ollama is using my GPU?
      Run 'ollama ps' and look at the PROCESSOR column - it should say '100% GPU'. On Windows, check Task Manager GPU memory during inference. If you see CPU or partial GPU, update drivers and ensure CUDA or ROCm is installed.
    
    
      What is quantization and does it speed up Ollama?
      Quantization reduces model precision from 16-bit to 4-bit or 8-bit, shrinking memory by 50-75% with minimal quality loss. A Q4 Llama 3 (8B) uses about 4.7GB instead of 16GB. Ollama downloads Q4_K_M by default, the best speed-to-quality tradeoff.
    
  

  Next Guides
→ How Much VRAM Do You Need?
→ Best Setup for 12GB, 24GB, 32GB VRAM
→ Ollama vs Open WebUI vs LM Studio
🔧 Still Slow? Let Me Diagnose It
Send me the output of nvidia-smi, ollama ps, and your model list. I will tell you exactly what to fix and what speeds to expect. $99 Setup Review.
Get a Setup Review →
📦 Get the Performance Tuning Checklist
The $19 Starter Kit includes a performance tuning guide, Modelfile templates for every VRAM tier, and a benchmark comparison spreadsheet.
See the Starter Kit →
    
        Want this guide as a printable checklist?
        Get the free Local AI Setup Checklist delivered to your inbox.
        Get the Free Checklist
    
    Last Updated: July 1, 2026 — Benchmarks from RTX 5070 Ti + RTX 5070 dual-GPU testing. Ollama on Ubuntu 26.04.

Configuration	Processor	Tokens/sec	Verdict
CPU fallback (no GPU driver)	100% CPU	4 t/s	❌ Unusable
GPU + Q8 quantization + 16K ctx	60% CPU / 40% GPU	11 t/s	⚠️ Spilling to RAM
GPU + Q4 + 8K context	100% GPU	28 t/s	✅ Fast
GPU + Q4 + 4K context	100% GPU	32 t/s	✅ Best balance
Dual GPU (5070 Ti + 5070) + Q4 + 4K	100% GPU (split)	35 t/s	✅ Optimal