Intermediate 📅 Last Updated: July 1, 2026 ⏱️ 12 min read 💻 Hardware
6GB VRAM runs 7B models (good for basic chat). 12GB VRAM runs 8B–14B models (the sweet spot — quality + speed). 24GB VRAM runs up to 32B models or multiple smaller models simultaneously. 32GB+ lets you run 70B models quantized. More VRAM = bigger models = better quality. Quantization (4-bit) roughly halves the VRAM requirement.
Read this if: You are buying a GPU, deciding what model to run, or wondering why your current model is slow or crashing.
Skip if: You already know your VRAM and just want model recommendations — go to Best Local AI Models for Beginners.
This is the most important table on this site. Bookmark it.
| Your VRAM | Best Model Size | Example Models | Quality Level |
|---|---|---|---|
| 4GB | 1B–3B | phi3:mini, qwen2.5:1.5b, gemma2:2b | Basic — fast but limited reasoning |
| 6GB | 3B–7B | llama3.1:8b (q4), mistral:7b, qwen2.5:7b | Good — usable for chat and writing |
| 8GB | 7B–8B | llama3.1:8b, qwen2.5:7b, gemma2:9b (tight) | Good — comfortable chat + light coding |
| 12GB | 8B–14B | llama3.1:8b (full), qwen2.5:14b, command-r | Excellent — the sweet spot |
| 16GB | 14B–22B | qwen2.5:14b (full), deepseek-coder:33b (q3) | Excellent — strong reasoning |
| 24GB | 22B–34B | qwen2.5:32b (q4), command-r (full), mixtral 8x7b | Outstanding — near-GPT-4 quality |
| 32GB+ | 34B–70B | llama3.1:70b (q4), qwen2.5:72b (q3) | Top-tier — best open models available |
Machine: MSI laptop (dual GPU — 24GB total VRAM)
GPU 1: NVIDIA RTX 5070 Ti Laptop (12GB)
GPU 2: NVIDIA RTX 5070 (12GB)
CPU: Intel Core Ultra 7 255HX (20 cores)
RAM: 96GB
OS: Ubuntu 26.04 LTS
Date: July 2026
Quantization compresses models to use less VRAM with minimal quality loss. Ollama does this automatically. Here is how it works:
| Quantization | VRAM for 8B Model | Quality Loss | When to Use |
|---|---|---|---|
| FP16 (16-bit, uncompressed) | ~16GB | None (reference) | Research, benchmarking |
| Q8 (8-bit) | ~9GB | Negligible | Best quality if VRAM allows |
| Q4 (4-bit) | ~5GB | Slight | Best balance — use this by default |
| Q3 (3-bit) | ~4GB | Noticeable on complex tasks | Last resort for low VRAM |
Rule of thumb: Q4_0 quantization (the default for most Ollama models) gives you roughly 50–60% VRAM savings compared to the uncompressed model, with barely noticeable quality loss.
VRAM needed ≈ (model parameters in billions) × (bytes per parameter at quantization level)
Examples (Q4 / 4-bit):
7B model → 7 × 0.7GB = ~5GB VRAM
8B model → 8 × 0.7GB = ~5.6GB VRAM
14B model → 14 × 0.7GB = ~10GB VRAM
32B model → 32 × 0.7GB = ~22GB VRAM
70B model → 70 × 0.7GB = ~49GB VRAM (needs 2× 24GB GPUs or heavy quantization)
Add 1–2GB overhead for context window and KV cache. Larger context windows use more VRAM.
Testing on RTX 5070 Ti (12GB VRAM), Q4 quantization:
| Model | Size | VRAM Used | Tokens/sec | Verdict |
|---|---|---|---|---|
| phi3:mini (3.8B) | 2.3GB | 2.8GB | ~85 t/s | Blazing fast, basic quality |
| llama3.1:8b | 4.7GB | 5.5GB | ~55 t/s | Great balance |
| qwen2.5:14b | 8.9GB | 9.8GB | ~32 t/s | Best quality at 12GB |
| qwen2.5:32b (q4) | 19.8GB | Won't fit single GPU | — | Needs 24GB or split across GPUs |
With our dual-GPU setup (24GB total), the 32B model runs at ~18 t/s split across both GPUs.
You can run local AI on CPU only, but it is significantly slower:
| Model Size | RAM Needed | Speed (Modern CPU) | Usable? |
|---|---|---|---|
| 3B | 8GB | 15–25 t/s | ✅ Yes — feels responsive |
| 7B–8B | 16GB | 5–12 t/s | ⚠️ Usable but slow |
| 14B | 32GB | 2–5 t/s | ❌ Painful |
Recommendation: If you don't have a GPU, stick to 3B–7B models. phi3:mini or llama3.1:8b on CPU is usable for basic tasks.
Some links below may be affiliate links. We only recommend GPUs we believe offer good value for local AI.
If you just want to chat with AI locally, a 12GB card is plenty. Don't buy a 4090 unless you plan to run 32B+ models or train/fine-tune.
Your system RAM matters too. If your model spills from VRAM to system RAM, performance tanks. Get at least 32GB RAM if you're running 8B+ models.
Running FP16 (uncompressed) wastes VRAM. Q4 quantization is nearly indistinguishable in quality and saves 50%+ VRAM. Always use Q4 unless you have a specific reason not to.
If you are starting: get a 12GB GPU (RTX 4070 or used RTX 3060 12GB). Run
qwen2.5:14bat Q4. That combination gives you near-GPT-3.5 quality completely offline and private. If you already have 8GB, you're fine — just stick to 7B–8B models with Q4 quantization.
For a good beginner experience, 8GB VRAM runs 7B-8B models like Llama 3.1 at Q4 comfortably. 12GB allows larger contexts or 13B models. 24GB (RTX 3090/4090) unlocks 30B-34B models. For 70B models, you need 48GB+, meaning dual GPUs or workstation cards.
Yes, an 8B model at Q4_K_M uses about 4.7-6GB VRAM depending on context length, so 8GB is sufficient. With 8GB you can handle 4,000-8,000 token context windows. For 16K+ contexts or Q8 quantization, you need 12GB or more.
A 70B model at Q4 needs approximately 40-48GB VRAM, requiring an RTX A6000 (48GB) or dual RTX 3090/4090 (24GB each). At Q2 you can squeeze into 24-28GB, but quality degrades noticeably. Most home users cannot run 70B without significant investment.
Yes, Ollama supports offloading layers to system RAM when VRAM is insufficient, but this causes a 5-10x speed penalty. Apple Silicon Macs handle this better due to unified memory. On PC, shared RAM is a fallback, not a recommendation.
Use 'nvidia-smi -l 1' for real-time VRAM on NVIDIA, or 'ollama ps' to see model memory consumption. On macOS, check Activity Monitor GPU History. Keep usage below 90% to avoid crashes.
Send me your specs, budget, and goals. I will tell you the best GPU and model setup for your money. $99.
Get a $99 Setup Review →The $19 Starter Kit includes a detailed model picker by VRAM with exact commands.
See the Starter Kit →Get the free Local AI Setup Checklist delivered to your inbox.
Get the Free ChecklistLast Updated: July 1, 2026 — Benchmarks from RTX 5070 Ti + RTX 5070 (dual 12GB). Model availability as of July 2026.