Intermediate 📅 Last Updated: July 1, 2026 ⏱️ 15 min read 💻 Hardware
12GB VRAM (single RTX 4070/5070): Run 8B–14B models at Q4, get 28–55 t/s. Best for individuals. ~$550 GPU. 24GB VRAM (RTX 3090/4090): Run 32B models or multiple models at once. 18–50 t/s. For power users. ~$700–$1,600. 32GB+ (dual 16GB or 3090+3090): Run 70B models quantized. The enthusiast tier. ~$1,400+. All three builds below include exact components, models, and benchmarks from our test machine.
Some GPU links below are affiliate links. We only recommend cards we've tested or that offer proven value for local AI. Prices are approximate as of July 2026.
Read this if: You're building or upgrading a machine for local AI and want to know exactly what to buy for your VRAM budget. You want a complete parts list, not vague advice.
Start here: If you don't know how much VRAM you need, read How Much VRAM Do You Need for Local AI? first.
Machine: MSI laptop (dual GPU)
GPU 1: NVIDIA RTX 5070 Ti Laptop (12GB)
GPU 2: NVIDIA RTX 5070 (12GB)
CPU: Intel Core Ultra 7 255HX (20 cores)
RAM: 96GB
OS: Ubuntu 26.04 LTS
Date: July 2026
| Tier | VRAM | Best GPU | Max Model | Speed | GPU Cost |
|---|---|---|---|---|---|
| Entry Pro | 12GB | RTX 4070 / 5070 | 14B (Q4) | 28–55 t/s | ~$550 |
| Enthusiast | 24GB | RTX 3090 / 4090 | 32B (Q4) | 18–50 t/s | ~$700–$1,600 |
| Power User | 32GB+ | 2× 16GB or 2× 3090 | 70B (Q3/Q4) | 8–25 t/s | ~$1,400+ |
This is what we recommend for 90% of people. 12GB VRAM runs the best quality-to-size models at comfortable speeds. Our test machine's primary GPU is 12GB.
| Component | Recommendation | Price |
|---|---|---|
| GPU | NVIDIA RTX 4070 12GB or RTX 5070 12GB | ~$550 |
| CPU | Intel Core i5-13600K or Ryzen 5 7600X | ~$200 |
| RAM | 32GB DDR5 (64GB if budget allows) | ~$100 |
| Storage | 1TB NVMe SSD | ~$70 |
| PSU | 750W 80+ Gold | ~$90 |
| Total Build | ~$1,010–$1,100 |
| Model | VRAM Used | Tokens/sec | Best For |
|---|---|---|---|
| llama3.1:8b | 5.5GB | ~55 t/s | Fast chat, general tasks |
| qwen2.5:14b | 9.8GB | ~32 t/s | Best balance — our default |
| qwen2.5-coder:14b | 9.8GB | ~30 t/s | Coding assistant |
| command-r (35B, Q3) | 11.5GB | ~14 t/s | Pushing the limit — tight fit |
24GB VRAM is the sweet spot for power users. You can run 32B models (near-GPT-4 quality) or multiple smaller models simultaneously for agent workflows.
| Component | Recommendation | Price |
|---|---|---|
| GPU (Budget) | Used RTX 3090 24GB | ~$700 |
| GPU (New) | RTX 4090 24GB or RTX 5090 24GB | ~$1,600–$2,000 |
| CPU | Intel Core i7-14700K or Ryzen 9 7900X | ~$350 |
| RAM | 64GB DDR5 | ~$180 |
| Storage | 2TB NVMe SSD | ~$120 |
| PSU | 1000W 80+ Gold (850W minimum) | ~$150 |
| Total Build | ~$1,500–$2,800 |
| Model | VRAM Used | Tokens/sec | Best For |
|---|---|---|---|
| qwen2.5:32b (Q4) | 19.8GB | ~18 t/s | Top-tier reasoning, near GPT-4 |
| mixtral 8x7B (Q4) | 24GB (tight) | ~22 t/s | MoE — fast for its size |
| qwen2.5:14b (Q8) | 15GB | ~38 t/s | High quality + speed |
| 2 models simultaneously | 10+10GB | ~25 t/s each | Multi-agent workflows |
Check RTX 3090 Prices → Check RTX 4090 Prices →
This is our test machine's configuration — dual 12GB GPUs (24GB total, but the architecture lessons apply to 32GB+). For true 32GB+, use dual 16GB cards (RTX 4080 Super) or dual 24GB cards (2× RTX 3090).
| Component | Recommendation | Price |
|---|---|---|
| GPU Option A | 2× Used RTX 3090 24GB (48GB total) | ~$1,400 |
| GPU Option B | 2× RTX 4080 Super 16GB (32GB total) | ~$2,000 |
| CPU | Intel Core i9-14900K or Ryzen 9 7950X | ~$500 |
| RAM | 96GB–128GB DDR5 | ~$300 |
| Storage | 4TB NVMe SSD | ~$250 |
| PSU | 1200W–1600W 80+ Platinum | ~$250 |
| Total Build | ~$2,700–$3,500 |
Multi-GPU requires a motherboard with two PCIe x16 slots (or x8/x8 split). Ollama splits models across GPUs automatically, but there's a slight overhead. For best results, use identical GPUs. Our dual 5070 Ti + 5070 setup works but mixed models can have minor performance variance.
| Model | VRAM Needed | Tokens/sec | Notes |
|---|---|---|---|
| llama3.1:70b (Q3) | ~32GB | ~8 t/s | Usable — flagship open model |
| qwen2.5:72b (Q3) | ~33GB | ~8 t/s | Top-tier reasoning |
| qwen2.5:32b (Q4) + agents | 20GB + 10GB | ~18 t/s | Run main model + agent model |
Testing on RTX 5070 Ti (12GB) + RTX 5070 (12GB), 24GB total, Q4 quantization:
| Model | Single GPU (5070 Ti) | Dual GPU (5070 Ti + 5070) | Improvement |
|---|---|---|---|
| llama3.1:8b | 55 t/s | 58 t/s | Minimal (model fits on one GPU) |
| qwen2.5:14b | 32 t/s | 35 t/s | Slight (fits on one GPU) |
| qwen2.5:32b (Q4) | ❌ Won't fit | 18 t/s | Enables 32B models |
| 2× qwen2.5:14b parallel | ❌ Won't fit | 25 t/s each | Multi-agent workflows |
Key insight: Dual GPU's biggest win isn't speed for small models — it's enabling larger models (32B+) and running multiple models simultaneously for agent workflows.
If your system RAM is less than 2× your VRAM, models that spill will be doubly penalized. Get 64GB+ RAM for any 24GB+ build.
Dual 3090s can pull 700W under load. A 1000W PSU will trip. Get 1200W+ for dual-GPU builds. Check the 12VHPWR connector requirements for 40-series cards.
If you only run 8B–14B models for personal chat, a 12GB card gives identical performance to a $2,000 4090. The 4090 only pays off at 32B+ models.
qwen2.5:14b Q4. The sweet spot.qwen2.5:32b Q4. Near-GPT-4 quality.llama3.1:70b Q3 or multi-model agents.For most people: buy a used RTX 3090 24GB (~$700). It's the best value in local AI right now. You get 24GB VRAM — enough for 32B models at Q4 — at less than half the price of a 4090. Pair it with 64GB RAM and a decent CPU. That build runs models that rival GPT-4, completely offline, for under $1,500 total. If budget is tight, a single RTX 4070 12GB (~$550) with
qwen2.5:14bis the best bang-for-buck entry point.
The RTX 3090 or 4090 with 24GB VRAM is the sweet spot, handling 8B-34B models with large contexts. 12GB cards like RTX 3060 are great budget options for 7B-8B models. 32GB+ is only necessary for 70B models. A used RTX 3090 around $700-800 offers best value.
Technically yes with Ollama and vLLM, but not recommended for consumer GPUs. Mixed GPUs operate at the slowest card speed, and different architectures cause inefficiencies. Two identical GPUs (dual RTX 3090s) work best.
12GB handles 7B-13B models with 4K-8K token contexts at Q4. It runs Llama 3.1 (8B), Mistral (7B), and Qwen2.5 (7B) well, but limits anything above 14B. For coding and general chat, 12GB is a solid budget choice.
Install both GPUs with adequate PSU (1000W+ for dual 3090s) and an NVLink bridge for memory pooling. Ollama auto-detects multiple GPUs and splits layers automatically. Dual RTX 3090s give 48GB effective VRAM for under $1,500.
Laptop GPUs can run local AI but are VRAM-limited (6-8GB) and thermal throttle 20-30% vs desktops. Apple Silicon MacBooks (M2/M3 Max with 32GB+ unified memory) are actually better laptop options for local AI.
Send me your budget, current specs, and what you want to do with local AI. I will tell you the exact GPU, model, and build for your situation — no overselling. $99.
Get a $99 Setup Review →The $19 Starter Kit includes full parts lists for all three tiers, a used-GPU buying checklist, and a price tracking spreadsheet.
See the Starter Kit →Get the free Local AI Setup Checklist delivered to your inbox.
Get the Free ChecklistLast Updated: July 1, 2026 — Benchmarks from RTX 5070 Ti + RTX 5070 dual-GPU testing. Prices as of July 2026 and may vary.