Overview
Quantization compresses model weights from 16-bit or 32-bit floats to lower-bit integers, reducing memory footprint at the cost of some quality. Ollama uses GGUF quantization from llama.cpp. The default Q4_K_M is the right starting point for most workloads. Understanding the tradeoff ladder lets you make an informed decision when memory is tight or quality is non-negotiable.
Use Q4_K_M as the default quantization
Q4_K_M stores most weights at 4 bits using a K-quant scheme that concentrates precision in the layers that matter most. It offers the best quality-per-byte in the 4-bit tier.
ollama pull llama3.3:70b-instruct-q4_K_M
ollama pull qwen2.5-coder:32b-instruct-q4_K_M- Memory cost: roughly 0.55 GB per billion parameters. A 70B model needs about 38 GB; a 32B model about 18 GB.
- Quality loss: 1 to 2 percent on most benchmarks compared to FP16. Often imperceptible for instruction following and chat.
- When to change it: move up when quality on your eval set is insufficient; move down only when memory forces it.
Choose Q8_0 when quality matters and memory permits
Q8_0 stores weights at 8 bits. Quality approaches FP16 closely and the format is simpler than K-quants, making dequantization faster on some hardware.
ollama pull qwen2.5-coder:32b-instruct-q8_0
ollama pull llama3.2:8b-instruct-q8_0- Memory cost: roughly 1.1 GB per billion parameters. A 32B model needs about 34 GB.
- When to use it: code generation, reasoning chains, extraction tasks where errors are costly. The quality difference over Q4 is visible on structured output and multi-step math.
- On Apple Silicon, Q8_0 often runs at comparable or faster tokens-per-second than Q4 due to the simpler dequantization path.
Use FP16 only when exact reproducibility or maximum fidelity is required
FP16 is the native training precision. No information is lost in quantization; it is the quality reference point.
ollama pull llama3.2:8b-instruct-fp16- Memory cost: 2 GB per billion parameters. An 8B model needs 16 GB; a 32B model needs 64 GB.
- Use cases: evaluations where you need to isolate quantization error from other factors; reference implementations; small models where quality is critical and memory is available.
- FP16 is rarely the right choice in production on a GPU with less than 64 GB.
Understand the K-quant ladder
The full ladder from lowest to highest quality:
| Level | Bits/weight | Memory (70B) | Relative quality |
|---|---|---|---|
| Q2_K | ~2.5 | ~18 GB | Noticeable degradation |
| Q3_K_M | ~3.5 | ~26 GB | Acceptable for low-stakes tasks |
| Q4_K_M | ~4.5 | ~38 GB | Default; 1-2% loss |
| Q5_K_M | ~5.5 | ~46 GB | ~0.5% loss |
| Q6_K | ~6.5 | ~53 GB | Near Q8 quality |
| Q8_0 | ~8 | ~70 GB | Near FP16 |
| FP16 | 16 | ~140 GB | Reference |
Move down the ladder only when you cannot fit the model at Q4_K_M. Each step down is a quality tradeoff; measure it on your task.
Measure quantization impact on your own eval set, not benchmarks
Public benchmarks (MMLU, HumanEval) mask quantization effects because they aggregate over many tasks. For your specific use case, run a comparison.
ollama pull qwen2.5:32b-instruct-q4_K_M
ollama pull qwen2.5:32b-instruct-q8_0
python eval.py --model qwen2.5:32b-instruct-q4_K_M --golden golden.json
python eval.py --model qwen2.5:32b-instruct-q8_0 --golden golden.jsonIf Q4_K_M passes your quality bar, use it. If not, step up to Q5_K_M or Q6_K before jumping to Q8. The memory cost grows non-linearly; often Q5 or Q6 is enough.
Never mix quantization levels within an agent pipeline that compares outputs
Two model calls in the same pipeline return different probability distributions at different quantizations. If you compare, merge, or score outputs across calls, use the same model and quantization throughout. See ollama-modelfile for how to pin quantization in a Modelfile.