Why Speed Matters
When running AI locally, speed directly affects your experience:
- Interactive chat: You want responses to appear in real-time, not trickle in word by word
- Batch processing: Analyzing documents or generating articles needs to complete in reasonable time
- Code assistance: Waiting 30 seconds for code completion kills your flow
- Automation: AI agents that run workflows need fast responses to be practical
The standard measurement is tokens per second (tok/s). More tokens per second = faster responses.
What’s a token? Roughly ยพ of a word in English. So 10 tok/s โ 7-8 words per second. A typical paragraph (50 words) takes about 6-7 seconds at 10 tok/s.
What Speed Should You Expect?
Comfort Levels
| Experience | Tokens/Second | What It Feels Like |
|---|---|---|
| Painfully slow | < 3 tok/s | Waiting, good for batch work only |
| Usable | 3-8 tok/s | Noticeable delay, but workable |
| Comfortable | 8-15 tok/s | Good chat experience |
| Fast | 15-30 tok/s | Near real-time, very smooth |
| Blazing | 30+ tok/s | As fast or faster than cloud AI |
GPU Tier Benchmarks
How to Read These Numbers
- Model sizes shown in parameters (e.g., 7B = 7 billion parameters)
- Quantization affects both speed and quality (Q4 = 4-bit, Q8 = 8-bit, FP16 = full precision)
- Real-world performance varies based on CPU, RAM speed, and system configuration
RTX 3060 (12GB VRAM)
Best budget GPU for local AI. Handles most 7B-13B models at Q4.
| Model | Quant | VRAM Used | Speed (tok/s) |
|---|---|---|---|
| Llama 3.2 3B | Q4 | ~3 GB | 45-55 |
| Phi-3 Mini 3.8B | Q4 | ~3.5 GB | 40-50 |
| Llama 3.1 8B | Q4 | ~6 GB | 22-28 |
| Mistral 7B | Q4 | ~5.5 GB | 25-32 |
| Qwen 2.5 7B | Q4 | ~5.5 GB | 24-30 |
| Llama 3.1 8B | Q8 | ~10 GB | 12-16 |
RTX 4060 (8GB VRAM)
Popular mid-range. Limited by VRAM, but fast for what fits.
| Model | Quant | VRAM Used | Speed (tok/s) |
|---|---|---|---|
| Llama 3.2 3B | Q4 | ~3 GB | 55-70 |
| Phi-3 Mini 3.8B | Q4 | ~3.5 GB | 50-65 |
| Llama 3.1 8B | Q4 | ~6 GB | 30-40 |
| Mistral 7B | Q4 | ~5.5 GB | 35-45 |
| Qwen 2.5 7B | Q4 | ~5.5 GB | 32-42 |
| Llama 3.1 8B | Q8 | ~10 GB | โ ๏ธ Exceeds 8GB |
โ ๏ธ Note: RTX 4060’s 8GB VRAM limits you to Q4 quantization for 8B models. Can’t run Q8 or larger models on GPU alone.
RTX 4070 (12GB VRAM)
Sweet spot for price/performance. Similar capacity to 3060 but much faster.
| Model | Quant | VRAM Used | Speed (tok/s) |
|---|---|---|---|
| Llama 3.2 3B | Q4 | ~3 GB | 75-90 |
| Llama 3.1 8B | Q4 | ~6 GB | 40-55 |
| Llama 3.1 8B | Q8 | ~10 GB | 22-28 |
| Qwen 2.5 14B | Q4 | ~10 GB | 18-24 |
| Mistral 7B | Q4 | ~5.5 GB | 45-60 |
| Llama 3.1 70B | Q4 | ~42 GB | โ ๏ธ CPU offload |
RTX 4090 (24GB VRAM)
Enthusiast tier. Can run 70B models with partial GPU offloading.
| Model | Quant | VRAM Used | Speed (tok/s) |
|---|---|---|---|
| Llama 3.1 8B | Q4 | ~6 GB | 100-130 |
| Qwen 2.5 14B | Q4 | ~10 GB | 55-75 |
| Qwen 2.5 32B | Q4 | ~20 GB | 30-40 |
| Llama 3.1 70B | Q4 | ~42 GB | 12-18 (partial GPU) |
| Mistral 7B | Q4 | ~5.5 GB | 120-160 |
| Mixtral 8x7B | Q4 | ~28 GB | 8-12 (partial GPU) |
CPU-Only Performance
No GPU? You can still run AI, just slower.
| Model | Quant | RAM Needed | Speed (tok/s) |
|---|---|---|---|
| Llama 3.2 3B | Q4 | ~4 GB | 8-15 |
| Phi-3 Mini 3.8B | Q4 | ~5 GB | 6-12 |
| Llama 3.1 8B | Q4 | ~7 GB | 3-6 |
| Mistral 7B | Q4 | ~6.5 GB | 4-8 |
๐ก Tip: CPU performance varies hugely based on your processor. Modern CPUs (Ryzen 7000+, Intel 13th Gen+) are significantly faster than older chips. More RAM bandwidth = faster AI on CPU.
Apple Silicon
Apple’s unified memory gives Macs a unique advantage for AI.
| Chip | Memory | Best Model | Speed (tok/s) |
|---|---|---|---|
| M1 | 16 GB | Llama 3.1 8B Q4 | 12-18 |
| M2 | 16 GB | Llama 3.1 8B Q4 | 15-22 |
| M2 Pro | 18 GB | Llama 3.1 8B Q8 | 14-20 |
| M3 | 24 GB | Qwen 2.5 14B Q4 | 18-25 |
| M3 Pro | 36 GB | Qwen 2.5 32B Q4 | 15-22 |
| M3 Max | 64 GB | Llama 3.1 70B Q4 | 10-15 |
| M4 | 32 GB | Qwen 2.5 32B Q4 | 22-30 |
See our Apple Silicon AI Guide for detailed Mac optimization tips.
How Quantization Affects Speed
Quantization reduces model precision to save memory and increase speed. Here’s the tradeoff:
| Quantization | Quality Loss | Memory Saved | Speed Gain |
|---|---|---|---|
| FP16 (full) | None (baseline) | None | Baseline |
| Q8 | Minimal | ~50% | ~10-20% faster |
| Q6 | Very slight | ~60% | ~15-25% faster |
| Q4 | Slight | ~75% | ~20-40% faster |
| Q3 | Noticeable | ~80% | ~30-50% faster |
| Q2 | Significant | ~85% | ~40-60% faster |
Recommendation: Use Q4 as your default. The quality loss is minimal and the speed/memory gains are substantial. Only use FP16/Q8 for critical tasks where maximum quality matters.
How to Benchmark Your Own Setup
Using Ollama
# Run any model with timing
ollama run llama3.1
# Type your prompt and note the speed shown
Ollama shows tokens/second as it generates responses.
Using llama.cpp
# Benchmark a specific model
./llama-cli -m model.gguf -p "Tell me about AI" -n 512 -t 8
Using LM Studio
LM Studio has a built-in benchmark tool in the settings.
Online Benchmarks
- huggingface.co/spaces/open-llm-leaderboard โ Community benchmarks
- artificialanalysis.ai โ Comprehensive model comparisons
What Makes Speed Vary?
Hardware Factors
- GPU VRAM โ More VRAM = larger models on GPU = faster
- GPU Memory Bandwidth โ Higher bandwidth = faster inference (this is why RTX 4090 is so fast)
- RAM Speed โ For CPU inference, faster RAM = faster AI
- CPU Cores โ More cores = better CPU inference (up to a point)
Software Factors
- Backend โ Vulkan, CUDA, Metal, CPU each have different performance
- Context Length โ Longer conversations use more memory and slow down
- Batch Size โ Larger batches are more efficient but use more memory
- Number of Layers on GPU โ Partial GPU offloading can be faster or slower than pure CPU depending on your hardware
Speed Optimization Tips
- Use Q4 quantization โ Best speed/quality balance
- Reduce context length โ Clear chat history periodically
- Use a GPU โ Even a budget GPU is 5-10x faster than CPU
- Close other apps โ Free up VRAM and RAM
- Update your drivers โ Latest GPU drivers often improve AI performance
- Choose the right model โ Don’t use a 70B model for simple tasks; 7B-8B is often enough
The Bottom Line
| Your Situation | Recommended Setup |
|---|---|
| Just trying it out | Any CPU with 16GB RAM, Llama 3.2 3B |
| Budget GPU | RTX 3060 12GB, Llama 3.1 8B Q4 |
| Best value | RTX 4070 12GB, Qwen 2.5 14B Q4 |
| Maximum performance | RTX 4090 24GB, Qwen 2.5 32B Q4 |
| Mac user | M3+ with 24GB+, Llama 3.1 8B or Qwen 14B |
๐ก Pro Tip: Speed isn’t everything. A slower model that’s more accurate is often better than a fast model that hallucinates. Start with quality, then optimize for speed. Learn more about model selection in our Best Local LLMs 2026 guide and hardware requirements in our GPU & VRAM Guide.
Want the complete guide to running AI fast and efficiently? Get the Local AI Setup Kit โ everything you need in one professional PDF.
Want the complete guide?
Get the Local AI Setup Kit โ everything in one professional PDF. Cover page, table of contents, and 8 structured chapters.