Why Speed Matters

When running AI locally, speed directly affects your experience:

Interactive chat: You want responses to appear in real-time, not trickle in word by word
Batch processing: Analyzing documents or generating articles needs to complete in reasonable time
Code assistance: Waiting 30 seconds for code completion kills your flow
Automation: AI agents that run workflows need fast responses to be practical

The standard measurement is tokens per second (tok/s). More tokens per second = faster responses.

What’s a token? Roughly ¾ of a word in English. So 10 tok/s ≈ 7-8 words per second. A typical paragraph (50 words) takes about 6-7 seconds at 10 tok/s.

What Speed Should You Expect?

Comfort Levels

Experience	Tokens/Second	What It Feels Like
Painfully slow	< 3 tok/s	Waiting, good for batch work only
Usable	3-8 tok/s	Noticeable delay, but workable
Comfortable	8-15 tok/s	Good chat experience
Fast	15-30 tok/s	Near real-time, very smooth
Blazing	30+ tok/s	As fast or faster than cloud AI

GPU Tier Benchmarks

How to Read These Numbers

Model sizes shown in parameters (e.g., 7B = 7 billion parameters)
Quantization affects both speed and quality (Q4 = 4-bit, Q8 = 8-bit, FP16 = full precision)
Real-world performance varies based on CPU, RAM speed, and system configuration

RTX 3060 (12GB VRAM)

Best budget GPU for local AI. Handles most 7B-13B models at Q4.

Model	Quant	VRAM Used	Speed (tok/s)
Llama 3.2 3B	Q4	~3 GB	45-55
Phi-3 Mini 3.8B	Q4	~3.5 GB	40-50
Llama 3.1 8B	Q4	~6 GB	22-28
Mistral 7B	Q4	~5.5 GB	25-32
Qwen 2.5 7B	Q4	~5.5 GB	24-30
Llama 3.1 8B	Q8	~10 GB	12-16

RTX 4060 (8GB VRAM)

Popular mid-range. Limited by VRAM, but fast for what fits.

Model	Quant	VRAM Used	Speed (tok/s)
Llama 3.2 3B	Q4	~3 GB	55-70
Phi-3 Mini 3.8B	Q4	~3.5 GB	50-65
Llama 3.1 8B	Q4	~6 GB	30-40
Mistral 7B	Q4	~5.5 GB	35-45
Qwen 2.5 7B	Q4	~5.5 GB	32-42
Llama 3.1 8B	Q8	~10 GB	⚠️ Exceeds 8GB

⚠️ Note: RTX 4060’s 8GB VRAM limits you to Q4 quantization for 8B models. Can’t run Q8 or larger models on GPU alone.

RTX 4070 (12GB VRAM)

Sweet spot for price/performance. Similar capacity to 3060 but much faster.

Model	Quant	VRAM Used	Speed (tok/s)
Llama 3.2 3B	Q4	~3 GB	75-90
Llama 3.1 8B	Q4	~6 GB	40-55
Llama 3.1 8B	Q8	~10 GB	22-28
Qwen 2.5 14B	Q4	~10 GB	18-24
Mistral 7B	Q4	~5.5 GB	45-60
Llama 3.1 70B	Q4	~42 GB	⚠️ CPU offload

RTX 4090 (24GB VRAM)

Enthusiast tier. Can run 70B models with partial GPU offloading.

Model	Quant	VRAM Used	Speed (tok/s)
Llama 3.1 8B	Q4	~6 GB	100-130
Qwen 2.5 14B	Q4	~10 GB	55-75
Qwen 2.5 32B	Q4	~20 GB	30-40
Llama 3.1 70B	Q4	~42 GB	12-18 (partial GPU)
Mistral 7B	Q4	~5.5 GB	120-160
Mixtral 8x7B	Q4	~28 GB	8-12 (partial GPU)

CPU-Only Performance

No GPU? You can still run AI, just slower.

Model	Quant	RAM Needed	Speed (tok/s)
Llama 3.2 3B	Q4	~4 GB	8-15
Phi-3 Mini 3.8B	Q4	~5 GB	6-12
Llama 3.1 8B	Q4	~7 GB	3-6
Mistral 7B	Q4	~6.5 GB	4-8

💡 Tip: CPU performance varies hugely based on your processor. Modern CPUs (Ryzen 7000+, Intel 13th Gen+) are significantly faster than older chips. More RAM bandwidth = faster AI on CPU.

Apple Silicon

Apple’s unified memory gives Macs a unique advantage for AI.

Chip	Memory	Best Model	Speed (tok/s)
M1	16 GB	Llama 3.1 8B Q4	12-18
M2	16 GB	Llama 3.1 8B Q4	15-22
M2 Pro	18 GB	Llama 3.1 8B Q8	14-20
M3	24 GB	Qwen 2.5 14B Q4	18-25
M3 Pro	36 GB	Qwen 2.5 32B Q4	15-22
M3 Max	64 GB	Llama 3.1 70B Q4	10-15
M4	32 GB	Qwen 2.5 32B Q4	22-30

See our Apple Silicon AI Guide for detailed Mac optimization tips.

How Quantization Affects Speed

Quantization reduces model precision to save memory and increase speed. Here’s the tradeoff:

Quantization	Quality Loss	Memory Saved	Speed Gain
FP16 (full)	None (baseline)	None	Baseline
Q8	Minimal	~50%	~10-20% faster
Q6	Very slight	~60%	~15-25% faster
Q4	Slight	~75%	~20-40% faster
Q3	Noticeable	~80%	~30-50% faster
Q2	Significant	~85%	~40-60% faster

Recommendation: Use Q4 as your default. The quality loss is minimal and the speed/memory gains are substantial. Only use FP16/Q8 for critical tasks where maximum quality matters.

How to Benchmark Your Own Setup

Using Ollama

# Run any model with timing
ollama run llama3.1
# Type your prompt and note the speed shown

Ollama shows tokens/second as it generates responses.

Using llama.cpp

# Benchmark a specific model
./llama-cli -m model.gguf -p "Tell me about AI" -n 512 -t 8

Using LM Studio

LM Studio has a built-in benchmark tool in the settings.

Online Benchmarks

huggingface.co/spaces/open-llm-leaderboard — Community benchmarks
artificialanalysis.ai — Comprehensive model comparisons

What Makes Speed Vary?

Hardware Factors

GPU VRAM — More VRAM = larger models on GPU = faster
GPU Memory Bandwidth — Higher bandwidth = faster inference (this is why RTX 4090 is so fast)
RAM Speed — For CPU inference, faster RAM = faster AI
CPU Cores — More cores = better CPU inference (up to a point)

Software Factors

Backend — Vulkan, CUDA, Metal, CPU each have different performance
Context Length — Longer conversations use more memory and slow down
Batch Size — Larger batches are more efficient but use more memory
Number of Layers on GPU — Partial GPU offloading can be faster or slower than pure CPU depending on your hardware

Speed Optimization Tips

Use Q4 quantization — Best speed/quality balance
Reduce context length — Clear chat history periodically
Use a GPU — Even a budget GPU is 5-10x faster than CPU
Close other apps — Free up VRAM and RAM
Update your drivers — Latest GPU drivers often improve AI performance
Choose the right model — Don’t use a 70B model for simple tasks; 7B-8B is often enough

The Bottom Line

Your Situation	Recommended Setup
Just trying it out	Any CPU with 16GB RAM, Llama 3.2 3B
Budget GPU	RTX 3060 12GB, Llama 3.1 8B Q4
Best value	RTX 4070 12GB, Qwen 2.5 14B Q4
Maximum performance	RTX 4090 24GB, Qwen 2.5 32B Q4
Mac user	M3+ with 24GB+, Llama 3.1 8B or Qwen 14B

💡 Pro Tip: Speed isn’t everything. A slower model that’s more accurate is often better than a fast model that hallucinates. Start with quality, then optimize for speed. Learn more about model selection in our Best Local LLMs 2026 guide and hardware requirements in our GPU & VRAM Guide.

Want the complete guide to running AI fast and efficiently? Get the Local AI Setup Kit — everything you need in one professional PDF.

Want the complete guide?

Get the Local AI Setup Kit — everything in one professional PDF. Cover page, table of contents, and 8 structured chapters.

Get the Kit →

Continue Reading

🌀

Local AI Speed Benchmarks — How Fast Are Models on Your Hardware?