Overview
Use Ollama for any task where the goal is “run a model on my machine” and the developer wants ollama run qwen3 to work. Use llama.cpp when the goal is to ship a binary that runs inference inside a product, optimize for a specific CPU or GPU, or pick a quantization that Ollama does not package. Ollama is the user-facing front end; llama.cpp is the engine. See ollama for the Ollama rule set.
When Ollama wins
Ollama is the right pick for local development, prototypes, and small-team self-hosting.
- One command:
ollama pull llama3.2 && ollama run llama3.2. No build flags, no quantization choice, no GPU detection script. Works on a fresh Mac in five minutes. - OpenAI-compatible HTTP API on
localhost:11434. Existing clients (LangChain, LlamaIndex, OpenAI SDK with a base URL override) just work. - Model library with sensible defaults: each model ships with a known-good quantization, context size, and chat template. The Modelfile syntax customizes prompts without recompiling.
- Multi-model serving: Ollama hot-swaps models in memory; you can run a small embedding model and a large chat model in the same process.
- macOS Metal, Linux CUDA, and Windows builds ship as signed installers. llama.cpp asks the user to compile.
When llama.cpp wins
llama.cpp is the right pick when the runtime is the product, you need a quantization Ollama does not ship, or you want to embed inference in another binary.
- Static binary:
llama-serveris one C++ executable. Drop it on a server, run, no Go runtime, no daemon, no model library. - Quantization control: you pick the exact format (Q4_K_M, Q5_K_S, IQ3_XXS, BF16) per model and per use case. Ollama exposes a small subset.
- Embedded use: link
libllama.sointo a Rust, Python, or Swift app and run inference inside the process. Ollama assumes a separate daemon. - Custom builds: enable or disable specific CPU intrinsics (AVX-512, NEON), tune batch size, or build with Vulkan for a non-CUDA GPU. Ollama hides these knobs.
- Bare-metal performance tuning: llama.cpp lets you set
--threads,--ubatch-size,--mmap, and--mlockper workload. Ollama uses defaults that work for most cases but are not always optimal.
Trade-offs at a glance
| Dimension | Ollama | llama.cpp |
|---|---|---|
| Install | One installer, two minutes | Build from source or download binary |
| Mental model | Docker for models | C++ inference library plus CLI server |
| API | OpenAI-compatible HTTP | OpenAI-compatible HTTP plus C/C++ ABI |
| Quantization | Curated GGUF per model | Any GGUF; you pick |
| Model swap | Automatic, in-process | One model per server process |
| Embedding | Run model behind daemon | Link as library |
| GPU support | Metal, CUDA, ROCm out of the box | Same, plus Vulkan and custom builds |
| Modelfile | Yes; templates and parameters | No; flags and config files |
| Tuning depth | Defaults you cannot override much | Every knob |
| Ecosystem fit | LangChain, Cline, Continue, Cursor | llama-cpp-python, candle, ggml-runners |
| Update cadence | Slower; curated releases | Faster; tracks new models same week |
Migration cost
Ollama-to-llama.cpp is the common direction when a project moves from prototype to production. The reverse is rare.
- Ollama to llama.cpp: pull the GGUF file Ollama stored under
~/.ollama/models, pointllama-serverat it, and copy the chat template into a Jinja file. Plan one day per service plus a week to wire up the build into CI. - llama.cpp to Ollama: package the model with a Modelfile that mirrors your prompt template, then
ollama create. Plan an afternoon per model. Lose any custom quantization that Ollama does not support. - Cheaper path: use Ollama in development and llama.cpp in production. The HTTP API is compatible, so client code does not change.
Recommendation
- Laptop dev loop, building a RAG app or running evals: Ollama. See ollama and rag.
- Embedded inference inside a desktop app or CLI tool: llama.cpp linked as a library.
- Production self-hosting at 1 to 10 QPS with no need for custom quantization: Ollama. Operational simplicity wins.
- Production self-hosting at 50-plus QPS or on bespoke hardware: llama.cpp with tuned flags, behind a load balancer.
- Hosted alternative when local inference is not a hard requirement: see claude-vs-gpt for the API comparison.
- Embeddings at scale: llama.cpp with a small embedding model in a server pool; see embeddings.