Performance

Local AI Speed Benchmarks โ€” How Fast Are Models on Your Hardware?

6 min read ยท Apr 11, 2026

Why Speed Matters

When running AI locally, speed directly affects your experience:

  • Interactive chat: You want responses to appear in real-time, not trickle in word by word
  • Batch processing: Analyzing documents or generating articles needs to complete in reasonable time
  • Code assistance: Waiting 30 seconds for code completion kills your flow
  • Automation: AI agents that run workflows need fast responses to be practical

The standard measurement is tokens per second (tok/s). More tokens per second = faster responses.

What’s a token? Roughly ยพ of a word in English. So 10 tok/s โ‰ˆ 7-8 words per second. A typical paragraph (50 words) takes about 6-7 seconds at 10 tok/s.

What Speed Should You Expect?

Comfort Levels

ExperienceTokens/SecondWhat It Feels Like
Painfully slow< 3 tok/sWaiting, good for batch work only
Usable3-8 tok/sNoticeable delay, but workable
Comfortable8-15 tok/sGood chat experience
Fast15-30 tok/sNear real-time, very smooth
Blazing30+ tok/sAs fast or faster than cloud AI

GPU Tier Benchmarks

How to Read These Numbers

  • Model sizes shown in parameters (e.g., 7B = 7 billion parameters)
  • Quantization affects both speed and quality (Q4 = 4-bit, Q8 = 8-bit, FP16 = full precision)
  • Real-world performance varies based on CPU, RAM speed, and system configuration

RTX 3060 (12GB VRAM)

Best budget GPU for local AI. Handles most 7B-13B models at Q4.

ModelQuantVRAM UsedSpeed (tok/s)
Llama 3.2 3BQ4~3 GB45-55
Phi-3 Mini 3.8BQ4~3.5 GB40-50
Llama 3.1 8BQ4~6 GB22-28
Mistral 7BQ4~5.5 GB25-32
Qwen 2.5 7BQ4~5.5 GB24-30
Llama 3.1 8BQ8~10 GB12-16

RTX 4060 (8GB VRAM)

Popular mid-range. Limited by VRAM, but fast for what fits.

ModelQuantVRAM UsedSpeed (tok/s)
Llama 3.2 3BQ4~3 GB55-70
Phi-3 Mini 3.8BQ4~3.5 GB50-65
Llama 3.1 8BQ4~6 GB30-40
Mistral 7BQ4~5.5 GB35-45
Qwen 2.5 7BQ4~5.5 GB32-42
Llama 3.1 8BQ8~10 GBโš ๏ธ Exceeds 8GB

โš ๏ธ Note: RTX 4060’s 8GB VRAM limits you to Q4 quantization for 8B models. Can’t run Q8 or larger models on GPU alone.

RTX 4070 (12GB VRAM)

Sweet spot for price/performance. Similar capacity to 3060 but much faster.

ModelQuantVRAM UsedSpeed (tok/s)
Llama 3.2 3BQ4~3 GB75-90
Llama 3.1 8BQ4~6 GB40-55
Llama 3.1 8BQ8~10 GB22-28
Qwen 2.5 14BQ4~10 GB18-24
Mistral 7BQ4~5.5 GB45-60
Llama 3.1 70BQ4~42 GBโš ๏ธ CPU offload

RTX 4090 (24GB VRAM)

Enthusiast tier. Can run 70B models with partial GPU offloading.

ModelQuantVRAM UsedSpeed (tok/s)
Llama 3.1 8BQ4~6 GB100-130
Qwen 2.5 14BQ4~10 GB55-75
Qwen 2.5 32BQ4~20 GB30-40
Llama 3.1 70BQ4~42 GB12-18 (partial GPU)
Mistral 7BQ4~5.5 GB120-160
Mixtral 8x7BQ4~28 GB8-12 (partial GPU)

CPU-Only Performance

No GPU? You can still run AI, just slower.

ModelQuantRAM NeededSpeed (tok/s)
Llama 3.2 3BQ4~4 GB8-15
Phi-3 Mini 3.8BQ4~5 GB6-12
Llama 3.1 8BQ4~7 GB3-6
Mistral 7BQ4~6.5 GB4-8

๐Ÿ’ก Tip: CPU performance varies hugely based on your processor. Modern CPUs (Ryzen 7000+, Intel 13th Gen+) are significantly faster than older chips. More RAM bandwidth = faster AI on CPU.

Apple Silicon

Apple’s unified memory gives Macs a unique advantage for AI.

ChipMemoryBest ModelSpeed (tok/s)
M116 GBLlama 3.1 8B Q412-18
M216 GBLlama 3.1 8B Q415-22
M2 Pro18 GBLlama 3.1 8B Q814-20
M324 GBQwen 2.5 14B Q418-25
M3 Pro36 GBQwen 2.5 32B Q415-22
M3 Max64 GBLlama 3.1 70B Q410-15
M432 GBQwen 2.5 32B Q422-30

See our Apple Silicon AI Guide for detailed Mac optimization tips.

How Quantization Affects Speed

Quantization reduces model precision to save memory and increase speed. Here’s the tradeoff:

QuantizationQuality LossMemory SavedSpeed Gain
FP16 (full)None (baseline)NoneBaseline
Q8Minimal~50%~10-20% faster
Q6Very slight~60%~15-25% faster
Q4Slight~75%~20-40% faster
Q3Noticeable~80%~30-50% faster
Q2Significant~85%~40-60% faster

Recommendation: Use Q4 as your default. The quality loss is minimal and the speed/memory gains are substantial. Only use FP16/Q8 for critical tasks where maximum quality matters.

How to Benchmark Your Own Setup

Using Ollama

# Run any model with timing
ollama run llama3.1
# Type your prompt and note the speed shown

Ollama shows tokens/second as it generates responses.

Using llama.cpp

# Benchmark a specific model
./llama-cli -m model.gguf -p "Tell me about AI" -n 512 -t 8

Using LM Studio

LM Studio has a built-in benchmark tool in the settings.

Online Benchmarks

  • huggingface.co/spaces/open-llm-leaderboard โ€” Community benchmarks
  • artificialanalysis.ai โ€” Comprehensive model comparisons

What Makes Speed Vary?

Hardware Factors

  • GPU VRAM โ€” More VRAM = larger models on GPU = faster
  • GPU Memory Bandwidth โ€” Higher bandwidth = faster inference (this is why RTX 4090 is so fast)
  • RAM Speed โ€” For CPU inference, faster RAM = faster AI
  • CPU Cores โ€” More cores = better CPU inference (up to a point)

Software Factors

  • Backend โ€” Vulkan, CUDA, Metal, CPU each have different performance
  • Context Length โ€” Longer conversations use more memory and slow down
  • Batch Size โ€” Larger batches are more efficient but use more memory
  • Number of Layers on GPU โ€” Partial GPU offloading can be faster or slower than pure CPU depending on your hardware

Speed Optimization Tips

  1. Use Q4 quantization โ€” Best speed/quality balance
  2. Reduce context length โ€” Clear chat history periodically
  3. Use a GPU โ€” Even a budget GPU is 5-10x faster than CPU
  4. Close other apps โ€” Free up VRAM and RAM
  5. Update your drivers โ€” Latest GPU drivers often improve AI performance
  6. Choose the right model โ€” Don’t use a 70B model for simple tasks; 7B-8B is often enough

The Bottom Line

Your SituationRecommended Setup
Just trying it outAny CPU with 16GB RAM, Llama 3.2 3B
Budget GPURTX 3060 12GB, Llama 3.1 8B Q4
Best valueRTX 4070 12GB, Qwen 2.5 14B Q4
Maximum performanceRTX 4090 24GB, Qwen 2.5 32B Q4
Mac userM3+ with 24GB+, Llama 3.1 8B or Qwen 14B

๐Ÿ’ก Pro Tip: Speed isn’t everything. A slower model that’s more accurate is often better than a fast model that hallucinates. Start with quality, then optimize for speed. Learn more about model selection in our Best Local LLMs 2026 guide and hardware requirements in our GPU & VRAM Guide.


Want the complete guide to running AI fast and efficiently? Get the Local AI Setup Kit โ€” everything you need in one professional PDF.

Want the complete guide?

Get the Local AI Setup Kit โ€” everything in one professional PDF. Cover page, table of contents, and 8 structured chapters.

Get the Kit โ†’