Ollama: Model Selection

Overview

The Ollama model registry lists hundreds of models; most of them are wrong for your hardware or task. The selection decision has two axes: what fits in memory, and what performs best on the task. The right process is to filter by hardware first, then benchmark on your actual task. Benchmark tables from model releases are a starting point, not the answer.

Filter by VRAM before anything else

A model that does not fit in VRAM spills to CPU RAM, which reduces throughput by 10x or more. Know your memory budget before opening the model registry.

Hardware	VRAM / Unified Memory	Practical max model
M2/M3 MacBook Pro 16 GB	16 GB unified	8B at Q4, or 7B at Q8
M2/M3 MacBook Pro 32 GB	32 GB unified	32B at Q4
M3 Max / M4 Max 64 GB	64 GB unified	70B at Q4
Single A100 80 GB	80 GB VRAM	70B at Q8, or 32B at FP16
Dual A100 / H100 80 GB	160 GB VRAM	405B at Q4

Add 10 to 20 percent headroom above the model size for the KV cache at your target context length. A 70B Q4 model is roughly 40 GB of weights; an 8K context adds another 4 to 6 GB.

Use Llama 3.3 70B for best general reasoning when memory allows

Llama 3.3 70B outperforms most open-weight models on reasoning, instruction following, and multi-step tasks as of 2026. It is the default pick when you have 48 GB or more of addressable memory.

ollama pull llama3.3:70b-instruct-q4_K_M

Runs comfortably on 64 GB unified memory (M3 Max, M4 Max) or a single 80 GB A100.
Q4_K_M is the right quantization for most tasks; Q8_0 if quality is the constraint and memory allows. See ollama-quantization.
Not suitable for edge devices, laptops with less than 48 GB, or latency-sensitive batch jobs.

Use Qwen 2.5 32B for the best quality-per-GB tradeoff

Qwen 2.5 32B matches or exceeds Llama 3.3 70B on many benchmarks at half the memory footprint. Qwen 2.5 Coder 32B is the strongest open-weight model for code generation and review as of 2026.

ollama pull qwen2.5:32b-instruct-q4_K_M
ollama pull qwen2.5-coder:32b-instruct-q8_0

Fits on 24 GB VRAM (RTX 3090/4090) or 32 GB unified memory at Q4.
Qwen 2.5 Coder 32B at Q8 fits on 36 GB; use it for code review agents and refactoring tasks.
Both models follow structured output instructions reliably, making them good choices for extraction pipelines.

Use Mistral Small or Llama 3.2 8B for low-memory and high-throughput workloads

For laptops under 16 GB, edge deployments, or batch jobs where throughput matters more than quality, smaller models are the correct pick.

ollama pull mistral-small:24b-instruct-q4_K_M  # 16 GB sweet spot
ollama pull llama3.2:8b-instruct-q4_K_M        # fits in 8 GB

Mistral Small 3 (24B) is a balanced mid-tier: fits on a 16 GB MacBook Pro (unified) and punches above its weight on instruction following.
Llama 3.2 8B runs in 8 GB of VRAM or unified memory. Use it for classification, short extraction, or high-volume preprocessing where errors are filtered downstream.
At 8B, quality drops are real. Compensate with few-shot examples in the prompt (see prompt-design) and tighter output constraints via ollama-modelfile.

For embeddings, use a dedicated embedding model

Generative models (Llama, Qwen, Mistral) produce poor embeddings. Use a dedicated model.

ollama pull nomic-embed-text
ollama pull bge-large       # 1024 dim, strongest open-weight English retrieval

Run the embedding model as a separate Ollama instance or on a separate process to avoid contention with the generative model’s memory. See embeddings for model selection rules and rag for how embeddings plug into a retrieval pipeline.

Benchmark on your own task before committing to a model

Public benchmarks (MMLU, HumanEval, GSM8K) measure general capability. They do not predict performance on your specific task, domain vocabulary, or output format.

Build a 50- to 100-question golden set with known correct answers.
Run each candidate model on the golden set with the same prompt template.
Score correctness, format adherence, and latency separately.
Pick the smallest model that meets the quality bar on your golden set.

The cheapest model that passes your eval is the right model. Reserve larger models for subtasks where smaller ones fail.

LLM Best Practices

Explorer

Overview

Filter by VRAM before anything else

Use Llama 3.3 70B for best general reasoning when memory allows

Use Qwen 2.5 32B for the best quality-per-GB tradeoff

Use Mistral Small or Llama 3.2 8B for low-memory and high-throughput workloads

For embeddings, use a dedicated embedding model

Benchmark on your own task before committing to a model

Graph View

Table of Contents

Backlinks

LLM Best Practices

Explorer

Ollama: Model Selection

Overview

Filter by VRAM before anything else

Use Llama 3.3 70B for best general reasoning when memory allows

Use Qwen 2.5 32B for the best quality-per-GB tradeoff

Use Mistral Small or Llama 3.2 8B for low-memory and high-throughput workloads

For embeddings, use a dedicated embedding model

Benchmark on your own task before committing to a model

Related

Graph View

Table of Contents

Backlinks