Ollama: Quantization

Overview

Quantization compresses model weights from 16-bit or 32-bit floats to lower-bit integers, reducing memory footprint at the cost of some quality. Ollama uses GGUF quantization from llama.cpp. The default Q4_K_M is the right starting point for most workloads. Understanding the tradeoff ladder lets you make an informed decision when memory is tight or quality is non-negotiable.

Use Q4_K_M as the default quantization

Q4_K_M stores most weights at 4 bits using a K-quant scheme that concentrates precision in the layers that matter most. It offers the best quality-per-byte in the 4-bit tier.

ollama pull llama3.3:70b-instruct-q4_K_M
ollama pull qwen2.5-coder:32b-instruct-q4_K_M

Memory cost: roughly 0.55 GB per billion parameters. A 70B model needs about 38 GB; a 32B model about 18 GB.
Quality loss: 1 to 2 percent on most benchmarks compared to FP16. Often imperceptible for instruction following and chat.
When to change it: move up when quality on your eval set is insufficient; move down only when memory forces it.

Choose Q8_0 when quality matters and memory permits

Q8_0 stores weights at 8 bits. Quality approaches FP16 closely and the format is simpler than K-quants, making dequantization faster on some hardware.

ollama pull qwen2.5-coder:32b-instruct-q8_0
ollama pull llama3.2:8b-instruct-q8_0

Memory cost: roughly 1.1 GB per billion parameters. A 32B model needs about 34 GB.
When to use it: code generation, reasoning chains, extraction tasks where errors are costly. The quality difference over Q4 is visible on structured output and multi-step math.
On Apple Silicon, Q8_0 often runs at comparable or faster tokens-per-second than Q4 due to the simpler dequantization path.

Use FP16 only when exact reproducibility or maximum fidelity is required

FP16 is the native training precision. No information is lost in quantization; it is the quality reference point.

ollama pull llama3.2:8b-instruct-fp16

Memory cost: 2 GB per billion parameters. An 8B model needs 16 GB; a 32B model needs 64 GB.
Use cases: evaluations where you need to isolate quantization error from other factors; reference implementations; small models where quality is critical and memory is available.
FP16 is rarely the right choice in production on a GPU with less than 64 GB.

Understand the K-quant ladder

The full ladder from lowest to highest quality:

Level	Bits/weight	Memory (70B)	Relative quality
Q2_K	~2.5	~18 GB	Noticeable degradation
Q3_K_M	~3.5	~26 GB	Acceptable for low-stakes tasks
Q4_K_M	~4.5	~38 GB	Default; 1-2% loss
Q5_K_M	~5.5	~46 GB	~0.5% loss
Q6_K	~6.5	~53 GB	Near Q8 quality
Q8_0	~8	~70 GB	Near FP16
FP16	16	~140 GB	Reference

Move down the ladder only when you cannot fit the model at Q4_K_M. Each step down is a quality tradeoff; measure it on your task.

Measure quantization impact on your own eval set, not benchmarks

Public benchmarks (MMLU, HumanEval) mask quantization effects because they aggregate over many tasks. For your specific use case, run a comparison.

ollama pull qwen2.5:32b-instruct-q4_K_M
ollama pull qwen2.5:32b-instruct-q8_0
 
python eval.py --model qwen2.5:32b-instruct-q4_K_M --golden golden.json
python eval.py --model qwen2.5:32b-instruct-q8_0 --golden golden.json

If Q4_K_M passes your quality bar, use it. If not, step up to Q5_K_M or Q6_K before jumping to Q8. The memory cost grows non-linearly; often Q5 or Q6 is enough.

Never mix quantization levels within an agent pipeline that compares outputs

Two model calls in the same pipeline return different probability distributions at different quantizations. If you compare, merge, or score outputs across calls, use the same model and quantization throughout. See ollama-modelfile for how to pin quantization in a Modelfile.

LLM Best Practices

Explorer

Overview

Use Q4_K_M as the default quantization

Choose Q8_0 when quality matters and memory permits

Use FP16 only when exact reproducibility or maximum fidelity is required

Understand the K-quant ladder

Measure quantization impact on your own eval set, not benchmarks

Never mix quantization levels within an agent pipeline that compares outputs

Graph View

Table of Contents

Backlinks

LLM Best Practices

Explorer

Ollama: Quantization

Overview

Use Q4_K_M as the default quantization

Choose Q8_0 when quality matters and memory permits

Use FP16 only when exact reproducibility or maximum fidelity is required

Understand the K-quant ladder

Measure quantization impact on your own eval set, not benchmarks

Never mix quantization levels within an agent pipeline that compares outputs

Related

Graph View

Table of Contents

Backlinks