Overview
The Ollama model registry lists hundreds of models; most of them are wrong for your hardware or task. The selection decision has two axes: what fits in memory, and what performs best on the task. The right process is to filter by hardware first, then benchmark on your actual task. Benchmark tables from model releases are a starting point, not the answer.
Filter by VRAM before anything else
A model that does not fit in VRAM spills to CPU RAM, which reduces throughput by 10x or more. Know your memory budget before opening the model registry.
| Hardware | VRAM / Unified Memory | Practical max model |
|---|---|---|
| M2/M3 MacBook Pro 16 GB | 16 GB unified | 8B at Q4, or 7B at Q8 |
| M2/M3 MacBook Pro 32 GB | 32 GB unified | 32B at Q4 |
| M3 Max / M4 Max 64 GB | 64 GB unified | 70B at Q4 |
| Single A100 80 GB | 80 GB VRAM | 70B at Q8, or 32B at FP16 |
| Dual A100 / H100 80 GB | 160 GB VRAM | 405B at Q4 |
Add 10 to 20 percent headroom above the model size for the KV cache at your target context length. A 70B Q4 model is roughly 40 GB of weights; an 8K context adds another 4 to 6 GB.
Use Llama 3.3 70B for best general reasoning when memory allows
Llama 3.3 70B outperforms most open-weight models on reasoning, instruction following, and multi-step tasks as of 2026. It is the default pick when you have 48 GB or more of addressable memory.
ollama pull llama3.3:70b-instruct-q4_K_M- Runs comfortably on 64 GB unified memory (M3 Max, M4 Max) or a single 80 GB A100.
- Q4_K_M is the right quantization for most tasks; Q8_0 if quality is the constraint and memory allows. See ollama-quantization.
- Not suitable for edge devices, laptops with less than 48 GB, or latency-sensitive batch jobs.
Use Qwen 2.5 32B for the best quality-per-GB tradeoff
Qwen 2.5 32B matches or exceeds Llama 3.3 70B on many benchmarks at half the memory footprint. Qwen 2.5 Coder 32B is the strongest open-weight model for code generation and review as of 2026.
ollama pull qwen2.5:32b-instruct-q4_K_M
ollama pull qwen2.5-coder:32b-instruct-q8_0- Fits on 24 GB VRAM (RTX 3090/4090) or 32 GB unified memory at Q4.
- Qwen 2.5 Coder 32B at Q8 fits on 36 GB; use it for code review agents and refactoring tasks.
- Both models follow structured output instructions reliably, making them good choices for extraction pipelines.
Use Mistral Small or Llama 3.2 8B for low-memory and high-throughput workloads
For laptops under 16 GB, edge deployments, or batch jobs where throughput matters more than quality, smaller models are the correct pick.
ollama pull mistral-small:24b-instruct-q4_K_M # 16 GB sweet spot
ollama pull llama3.2:8b-instruct-q4_K_M # fits in 8 GB- Mistral Small 3 (24B) is a balanced mid-tier: fits on a 16 GB MacBook Pro (unified) and punches above its weight on instruction following.
- Llama 3.2 8B runs in 8 GB of VRAM or unified memory. Use it for classification, short extraction, or high-volume preprocessing where errors are filtered downstream.
- At 8B, quality drops are real. Compensate with few-shot examples in the prompt (see prompt-design) and tighter output constraints via ollama-modelfile.
For embeddings, use a dedicated embedding model
Generative models (Llama, Qwen, Mistral) produce poor embeddings. Use a dedicated model.
ollama pull nomic-embed-text
ollama pull bge-large # 1024 dim, strongest open-weight English retrievalRun the embedding model as a separate Ollama instance or on a separate process to avoid contention with the generative model’s memory. See embeddings for model selection rules and rag for how embeddings plug into a retrieval pipeline.
Benchmark on your own task before committing to a model
Public benchmarks (MMLU, HumanEval, GSM8K) measure general capability. They do not predict performance on your specific task, domain vocabulary, or output format.
- Build a 50- to 100-question golden set with known correct answers.
- Run each candidate model on the golden set with the same prompt template.
- Score correctness, format adherence, and latency separately.
- Pick the smallest model that meets the quality bar on your golden set.
The cheapest model that passes your eval is the right model. Reserve larger models for subtasks where smaller ones fail.