Intermediate 📅 Last Updated: July 1, 2026 ⏱️ 12 min read 💻 Hardware

How Much VRAM Do You Need for Local AI?

⚡ Quick Answer

6GB VRAM runs 7B models (good for basic chat). 12GB VRAM runs 8B–14B models (the sweet spot — quality + speed). 24GB VRAM runs up to 32B models or multiple smaller models simultaneously. 32GB+ lets you run 70B models quantized. More VRAM = bigger models = better quality. Quantization (4-bit) roughly halves the VRAM requirement.

Who This Is For

Read this if: You are buying a GPU, deciding what model to run, or wondering why your current model is slow or crashing.

Skip if: You already know your VRAM and just want model recommendations — go to Best Local AI Models for Beginners.

The VRAM-to-Model Size Table

This is the most important table on this site. Bookmark it.

Your VRAM	Best Model Size	Example Models	Quality Level
4GB	1B–3B	phi3:mini, qwen2.5:1.5b, gemma2:2b	Basic — fast but limited reasoning
6GB	3B–7B	llama3.1:8b (q4), mistral:7b, qwen2.5:7b	Good — usable for chat and writing
8GB	7B–8B	llama3.1:8b, qwen2.5:7b, gemma2:9b (tight)	Good — comfortable chat + light coding
12GB	8B–14B	llama3.1:8b (full), qwen2.5:14b, command-r	Excellent — the sweet spot
16GB	14B–22B	qwen2.5:14b (full), deepseek-coder:33b (q3)	Excellent — strong reasoning
24GB	22B–34B	qwen2.5:32b (q4), command-r (full), mixtral 8x7b	Outstanding — near-GPT-4 quality
32GB+	34B–70B	llama3.1:70b (q4), qwen2.5:72b (q3)	Top-tier — best open models available

🔬 Tested On

Machine: MSI laptop (dual GPU — 24GB total VRAM)
GPU 1: NVIDIA RTX 5070 Ti Laptop (12GB)
GPU 2: NVIDIA RTX 5070 (12GB)
CPU: Intel Core Ultra 7 255HX (20 cores)
RAM: 96GB
OS: Ubuntu 26.04 LTS
Date: July 2026

What Is Quantization? (The VRAM Multiplier)

Quantization compresses models to use less VRAM with minimal quality loss. Ollama does this automatically. Here is how it works:

Quantization	VRAM for 8B Model	Quality Loss	When to Use
FP16 (16-bit, uncompressed)	~16GB	None (reference)	Research, benchmarking
Q8 (8-bit)	~9GB	Negligible	Best quality if VRAM allows
Q4 (4-bit)	~5GB	Slight	Best balance — use this by default
Q3 (3-bit)	~4GB	Noticeable on complex tasks	Last resort for low VRAM

Rule of thumb: Q4_0 quantization (the default for most Ollama models) gives you roughly 50–60% VRAM savings compared to the uncompressed model, with barely noticeable quality loss.

How to Calculate Your VRAM Needs

VRAM needed ≈ (model parameters in billions) × (bytes per parameter at quantization level)

Examples (Q4 / 4-bit):
  7B model  → 7 × 0.7GB = ~5GB VRAM
  8B model  → 8 × 0.7GB = ~5.6GB VRAM
  14B model → 14 × 0.7GB = ~10GB VRAM
  32B model → 32 × 0.7GB = ~22GB VRAM
  70B model → 70 × 0.7GB = ~49GB VRAM (needs 2× 24GB GPUs or heavy quantization)

Add 1–2GB overhead for context window and KV cache. Larger context windows use more VRAM.

Real Benchmarks from Our Test Machine

Testing on RTX 5070 Ti (12GB VRAM), Q4 quantization:

Model	Size	VRAM Used	Tokens/sec	Verdict
phi3:mini (3.8B)	2.3GB	2.8GB	~85 t/s	Blazing fast, basic quality
llama3.1:8b	4.7GB	5.5GB	~55 t/s	Great balance
qwen2.5:14b	8.9GB	9.8GB	~32 t/s	Best quality at 12GB
qwen2.5:32b (q4)	19.8GB	Won't fit single GPU	—	Needs 24GB or split across GPUs

With our dual-GPU setup (24GB total), the 32B model runs at ~18 t/s split across both GPUs.

What If You Don't Have a GPU? (CPU-Only)

You can run local AI on CPU only, but it is significantly slower:

Model Size	RAM Needed	Speed (Modern CPU)	Usable?
3B	8GB	15–25 t/s	✅ Yes — feels responsive
7B–8B	16GB	5–12 t/s	⚠️ Usable but slow
14B	32GB	2–5 t/s	❌ Painful

Recommendation: If you don't have a GPU, stick to 3B–7B models. phi3:mini or llama3.1:8b on CPU is usable for basic tasks.

GPU Buying Guide by VRAM Tier

💰 Affiliate Note

Some links below may be affiliate links. We only recommend GPUs we believe offer good value for local AI.

6GB VRAM (Budget Entry)

Used RTX 3060 6GB (~$200)
Runs 7B models comfortably
Good for testing if local AI is for you

8GB VRAM (Minimum Recommended)

RTX 4060 8GB (~$300) or used RTX 3080 10GB
Runs 7B–8B models well
Best budget choice for beginners

12GB VRAM (The Sweet Spot)

RTX 4070 12GB (~$550) or used RTX 3060 12GB (~$280)
Runs up to 14B models — excellent quality
Our recommendation for most people

24GB VRAM (Enthusiast)

RTX 3090 24GB (~$700 used) or RTX 4090 24GB (~$1600)
Runs up to 32B models or multiple models simultaneously
For power users and developers

32GB+ VRAM (Multi-GPU or Datacenter)

2× RTX 3090/4090, or Mac Studio with 64GB+ unified memory
Runs 70B models
For researchers and serious tinkerers

Common Mistakes

Mistake 1: Buying More GPU Than You Need

If you just want to chat with AI locally, a 12GB card is plenty. Don't buy a 4090 unless you plan to run 32B+ models or train/fine-tune.

Mistake 2: Ignoring RAM

Your system RAM matters too. If your model spills from VRAM to system RAM, performance tanks. Get at least 32GB RAM if you're running 8B+ models.

Mistake 3: Not Using Quantization

Running FP16 (uncompressed) wastes VRAM. Q4 quantization is nearly indistinguishable in quality and saves 50%+ VRAM. Always use Q4 unless you have a specific reason not to.

What I Would Do

If you are starting: get a 12GB GPU (RTX 4070 or used RTX 3060 12GB). Run qwen2.5:14b at Q4. That combination gives you near-GPT-3.5 quality completely offline and private. If you already have 8GB, you're fine — just stick to 7B–8B models with Q4 quantization.

Frequently Asked Questions

How much VRAM do I need to run local AI models?

For a good beginner experience, 8GB VRAM runs 7B-8B models like Llama 3.1 at Q4 comfortably. 12GB allows larger contexts or 13B models. 24GB (RTX 3090/4090) unlocks 30B-34B models. For 70B models, you need 48GB+, meaning dual GPUs or workstation cards.

Can I run an 8B model with 8GB of VRAM?

Yes, an 8B model at Q4_K_M uses about 4.7-6GB VRAM depending on context length, so 8GB is sufficient. With 8GB you can handle 4,000-8,000 token context windows. For 16K+ contexts or Q8 quantization, you need 12GB or more.

How much VRAM does a 70B model require?

A 70B model at Q4 needs approximately 40-48GB VRAM, requiring an RTX A6000 (48GB) or dual RTX 3090/4090 (24GB each). At Q2 you can squeeze into 24-28GB, but quality degrades noticeably. Most home users cannot run 70B without significant investment.

Can I supplement GPU VRAM with system RAM?

Yes, Ollama supports offloading layers to system RAM when VRAM is insufficient, but this causes a 5-10x speed penalty. Apple Silicon Macs handle this better due to unified memory. On PC, shared RAM is a fallback, not a recommendation.

How do I measure VRAM usage while running models?

Use 'nvidia-smi -l 1' for real-time VRAM on NVIDIA, or 'ollama ps' to see model memory consumption. On macOS, check Activity Monitor GPU History. Keep usage below 90% to avoid crashes.

🔧 Want a GPU Recommendation for Your Exact Situation?

Send me your specs, budget, and goals. I will tell you the best GPU and model setup for your money. $99.

Get a $99 Setup Review →

📦 Get the Model Picker + All Checklists

The $19 Starter Kit includes a detailed model picker by VRAM with exact commands.

See the Starter Kit →

Want this guide as a printable checklist?

Get the free Local AI Setup Checklist delivered to your inbox.

Get the Free Checklist

Last Updated: July 1, 2026 — Benchmarks from RTX 5070 Ti + RTX 5070 (dual 12GB). Model availability as of July 2026.