Intermediate 📅 Last Updated: July 1, 2026 ⏱️ 12 min read 💻 Hardware

How Much VRAM Do You Need for Local AI?

⚡ Quick Answer

6GB VRAM runs 7B models (good for basic chat). 12GB VRAM runs 8B–14B models (the sweet spot — quality + speed). 24GB VRAM runs up to 32B models or multiple smaller models simultaneously. 32GB+ lets you run 70B models quantized. More VRAM = bigger models = better quality. Quantization (4-bit) roughly halves the VRAM requirement.

Who This Is For

Read this if: You are buying a GPU, deciding what model to run, or wondering why your current model is slow or crashing.

Skip if: You already know your VRAM and just want model recommendations — go to Best Local AI Models for Beginners.

The VRAM-to-Model Size Table

This is the most important table on this site. Bookmark it.

Your VRAMBest Model SizeExample ModelsQuality Level
4GB1B–3Bphi3:mini, qwen2.5:1.5b, gemma2:2bBasic — fast but limited reasoning
6GB3B–7Bllama3.1:8b (q4), mistral:7b, qwen2.5:7bGood — usable for chat and writing
8GB7B–8Bllama3.1:8b, qwen2.5:7b, gemma2:9b (tight)Good — comfortable chat + light coding
12GB8B–14Bllama3.1:8b (full), qwen2.5:14b, command-rExcellent — the sweet spot
16GB14B–22Bqwen2.5:14b (full), deepseek-coder:33b (q3)Excellent — strong reasoning
24GB22B–34Bqwen2.5:32b (q4), command-r (full), mixtral 8x7bOutstanding — near-GPT-4 quality
32GB+34B–70Bllama3.1:70b (q4), qwen2.5:72b (q3)Top-tier — best open models available

🔬 Tested On

Machine: MSI laptop (dual GPU — 24GB total VRAM)
GPU 1: NVIDIA RTX 5070 Ti Laptop (12GB)
GPU 2: NVIDIA RTX 5070 (12GB)
CPU: Intel Core Ultra 7 255HX (20 cores)
RAM: 96GB
OS: Ubuntu 26.04 LTS
Date: July 2026

What Is Quantization? (The VRAM Multiplier)

Quantization compresses models to use less VRAM with minimal quality loss. Ollama does this automatically. Here is how it works:

QuantizationVRAM for 8B ModelQuality LossWhen to Use
FP16 (16-bit, uncompressed)~16GBNone (reference)Research, benchmarking
Q8 (8-bit)~9GBNegligibleBest quality if VRAM allows
Q4 (4-bit)~5GBSlightBest balance — use this by default
Q3 (3-bit)~4GBNoticeable on complex tasksLast resort for low VRAM

Rule of thumb: Q4_0 quantization (the default for most Ollama models) gives you roughly 50–60% VRAM savings compared to the uncompressed model, with barely noticeable quality loss.

How to Calculate Your VRAM Needs

VRAM needed ≈ (model parameters in billions) × (bytes per parameter at quantization level)

Examples (Q4 / 4-bit):
  7B model  → 7 × 0.7GB = ~5GB VRAM
  8B model  → 8 × 0.7GB = ~5.6GB VRAM
  14B model → 14 × 0.7GB = ~10GB VRAM
  32B model → 32 × 0.7GB = ~22GB VRAM
  70B model → 70 × 0.7GB = ~49GB VRAM (needs 2× 24GB GPUs or heavy quantization)

Add 1–2GB overhead for context window and KV cache. Larger context windows use more VRAM.

Real Benchmarks from Our Test Machine

Testing on RTX 5070 Ti (12GB VRAM), Q4 quantization:

ModelSizeVRAM UsedTokens/secVerdict
phi3:mini (3.8B)2.3GB2.8GB~85 t/sBlazing fast, basic quality
llama3.1:8b4.7GB5.5GB~55 t/sGreat balance
qwen2.5:14b8.9GB9.8GB~32 t/sBest quality at 12GB
qwen2.5:32b (q4)19.8GBWon't fit single GPUNeeds 24GB or split across GPUs

With our dual-GPU setup (24GB total), the 32B model runs at ~18 t/s split across both GPUs.

What If You Don't Have a GPU? (CPU-Only)

You can run local AI on CPU only, but it is significantly slower:

Model SizeRAM NeededSpeed (Modern CPU)Usable?
3B8GB15–25 t/s✅ Yes — feels responsive
7B–8B16GB5–12 t/s⚠️ Usable but slow
14B32GB2–5 t/s❌ Painful

Recommendation: If you don't have a GPU, stick to 3B–7B models. phi3:mini or llama3.1:8b on CPU is usable for basic tasks.

GPU Buying Guide by VRAM Tier

💰 Affiliate Note

Some links below may be affiliate links. We only recommend GPUs we believe offer good value for local AI.

6GB VRAM (Budget Entry)

8GB VRAM (Minimum Recommended)

12GB VRAM (The Sweet Spot)

24GB VRAM (Enthusiast)

32GB+ VRAM (Multi-GPU or Datacenter)

Common Mistakes

Mistake 1: Buying More GPU Than You Need

If you just want to chat with AI locally, a 12GB card is plenty. Don't buy a 4090 unless you plan to run 32B+ models or train/fine-tune.

Mistake 2: Ignoring RAM

Your system RAM matters too. If your model spills from VRAM to system RAM, performance tanks. Get at least 32GB RAM if you're running 8B+ models.

Mistake 3: Not Using Quantization

Running FP16 (uncompressed) wastes VRAM. Q4 quantization is nearly indistinguishable in quality and saves 50%+ VRAM. Always use Q4 unless you have a specific reason not to.

What I Would Do

If you are starting: get a 12GB GPU (RTX 4070 or used RTX 3060 12GB). Run qwen2.5:14b at Q4. That combination gives you near-GPT-3.5 quality completely offline and private. If you already have 8GB, you're fine — just stick to 7B–8B models with Q4 quantization.

Frequently Asked Questions

How much VRAM do I need to run local AI models?

For a good beginner experience, 8GB VRAM runs 7B-8B models like Llama 3.1 at Q4 comfortably. 12GB allows larger contexts or 13B models. 24GB (RTX 3090/4090) unlocks 30B-34B models. For 70B models, you need 48GB+, meaning dual GPUs or workstation cards.

Can I run an 8B model with 8GB of VRAM?

Yes, an 8B model at Q4_K_M uses about 4.7-6GB VRAM depending on context length, so 8GB is sufficient. With 8GB you can handle 4,000-8,000 token context windows. For 16K+ contexts or Q8 quantization, you need 12GB or more.

How much VRAM does a 70B model require?

A 70B model at Q4 needs approximately 40-48GB VRAM, requiring an RTX A6000 (48GB) or dual RTX 3090/4090 (24GB each). At Q2 you can squeeze into 24-28GB, but quality degrades noticeably. Most home users cannot run 70B without significant investment.

Can I supplement GPU VRAM with system RAM?

Yes, Ollama supports offloading layers to system RAM when VRAM is insufficient, but this causes a 5-10x speed penalty. Apple Silicon Macs handle this better due to unified memory. On PC, shared RAM is a fallback, not a recommendation.

How do I measure VRAM usage while running models?

Use 'nvidia-smi -l 1' for real-time VRAM on NVIDIA, or 'ollama ps' to see model memory consumption. On macOS, check Activity Monitor GPU History. Keep usage below 90% to avoid crashes.

🔧 Want a GPU Recommendation for Your Exact Situation?

Send me your specs, budget, and goals. I will tell you the best GPU and model setup for your money. $99.

Get a $99 Setup Review →

📦 Get the Model Picker + All Checklists

The $19 Starter Kit includes a detailed model picker by VRAM with exact commands.

See the Starter Kit →

Want this guide as a printable checklist?

Get the free Local AI Setup Checklist delivered to your inbox.

Get the Free Checklist

Last Updated: July 1, 2026 — Benchmarks from RTX 5070 Ti + RTX 5070 (dual 12GB). Model availability as of July 2026.