Beginner 📅 Last Updated: July 1, 2026 ⏱️ 10 min read 🧠 Models
Start with ollama run llama3.1:8b — it is the best all-around beginner model. If you have less than 6GB VRAM, use phi3:mini. If you have 12GB+ VRAM, upgrade to qwen2.5:14b for noticeably better quality. All of these are free and run completely offline.
Read this if: You have Ollama installed and want to know which model to actually run. You are overwhelmed by the model zoo and want simple recommendations.
Skip if: You want deep technical comparison of model architectures. This is a practical guide, not a research paper.
Three things matter for local AI:
Machine: MSI laptop (dual GPU)
GPU: NVIDIA RTX 5070 Ti Laptop (12GB) + RTX 5070 (12GB)
CPU: Intel Core Ultra 7 255HX
RAM: 96GB
OS: Ubuntu 26.04 LTS
Date: July 2026
| Attribute | Detail |
|---|---|
| Command | ollama run llama3.1:8b |
| Size | 4.7GB (Q4) |
| VRAM Needed | 6GB (comfortable), 8GB (with large context) |
| Strengths | General chat, writing, basic coding, reasoning |
| Weaknesses | Not the best at any single task — a generalist |
| Best For | Anyone starting out. If you don't know what to pick, pick this. |
| Speed (12GB GPU) | ~55 tokens/sec |
| Attribute | Detail |
|---|---|
| Command | ollama run qwen2.5:14b |
| Size | 8.9GB (Q4) |
| VRAM Needed | 10–12GB |
| Strengths | Excellent reasoning, coding, multilingual, math |
| Weaknesses | Needs 12GB VRAM. Slower than 8B models. |
| Best For | Anyone with 12GB+ VRAM who wants noticeably better quality |
| Speed (12GB GPU) | ~32 tokens/sec |
| Attribute | Detail |
|---|---|
| Command | ollama run phi3 |
| Size | 2.3GB (Q4) |
| VRAM Needed | 4GB (fits on almost anything) |
| Strengths | Extremely fast, runs on low-end hardware, surprisingly capable for size |
| Weaknesses | Limited reasoning depth compared to larger models |
| Best For | Laptops without dedicated GPU, older PCs, quick tasks |
| Speed (CPU only) | ~20 tokens/sec |
| Attribute | Detail |
|---|---|
| Command | ollama run gemma2:9b |
| Size | 5.4GB (Q4) |
| VRAM Needed | 8GB |
| Strengths | Excellent creative writing, natural language, good instruction following |
| Weaknesses | Can be verbose. Not as strong at coding. |
| Best For | Writing, content creation, brainstorming |
| Speed (12GB GPU) | ~48 tokens/sec |
| Attribute | Detail |
|---|---|
| Command | ollama run deepseek-coder-v2 |
| Size | Varies (use 16B version) |
| VRAM Needed | 12GB+ |
| Strengths | Best open-source coding model. Excellent at code generation, debugging, explanation. |
| Weaknesses | Not great for general chat. Focused on code. |
| Best For | Developers who want a local coding assistant |
| Speed (12GB GPU) | ~28 tokens/sec |
| Your Situation | Use This Model |
|---|---|
| "I just want to try local AI" | llama3.1:8b |
| "I have less than 6GB VRAM" | phi3 |
| "I want the best quality I can get" | qwen2.5:14b (needs 12GB) |
| "I mainly write content" | gemma2:9b |
| "I mainly code" | deepseek-coder-v2 |
| "I want it fast and don't care about quality" | qwen2.5:1.5b or phi3 |
| "I have 24GB+ VRAM" | qwen2.5:32b — see VRAM setup guide |
# List installed models
ollama list
# Run a different model
ollama run qwen2.5:14b
# Remove a model you don't need
ollama rm gemma2:9b
# Pull a new model without running it
ollama pull deepseek-coder-v2
Bigger is not always better. A 70B model that doesn't fit in VRAM will be slower than a 8B model. Match the model to your hardware. See How Much VRAM Do You Need?
Different models have different personalities. Test 2–3 models with the same prompt and compare. You will be surprised how much they differ.
If a guide recommends Llama 2 or Vicuna, it is outdated. As of July 2026, the models above are the current best.
Install llama3.1:8b first. Chat with it for a day. Then try qwen2.5:14b if you have the VRAM. The jump from 8B to 14B is noticeable — better reasoning, better coding, better instruction following. If you code, add deepseek-coder-v2 to your rotation.
Llama 3.1 (8B) is the best start - well-supported, widely documented, runs on 8GB+ VRAM. Install Ollama and run 'ollama run llama3.1'. If hardware-limited, try Phi-3 Mini (3.8B) or Llama 3.2 (3B) on 4GB RAM.
8B means 8 billion parameters (bigger = smarter but slower). Q4_K_M is 4-bit quantization, shrinking the model with minimal quality loss. GGUF is the file format for llama.cpp and Ollama. Beginners can use Ollama defaults.
8GB VRAM: 3B-8B models. 12-16GB: 7B-13B models. 24GB (RTX 3090/4090): 14B-34B models. 48GB+: 70B models. Always match model size to VRAM to avoid slow CPU fallback.
Quantization trades slight quality for dramatically smaller files. Q4_K_M (4-bit) is the community standard - 70% smaller with negligible loss. Q8 offers near-original quality at twice the memory. Avoid Q2/Q3 as reasoning degrades.
In Ollama, run 'ollama pull llama3.1' again to update. Check ollama.com/library for new releases. Remove old models with 'ollama rm [name]' to free disk space. Major models appear in Ollama within days of release.
The $19 Starter Kit includes a detailed model picker with every model, every VRAM tier, and exact commands.
Get the Starter Kit ($19) →Tell me your specs. I will tell you the best models. $99.
Get a Setup Review →Get the free Local AI Setup Checklist delivered to your inbox.
Get the Free ChecklistLast Updated: July 1, 2026 — Verified against Ollama 0.4.0. Model rankings current as of July 2026.