Ollama Best Practices

Overview

Ollama runs open-weight LLMs locally with one command. Use it when the workload demands offline operation, data privacy, or a hard cost ceiling on bulk inference. For the production-quality reasoning path, frontier models (Claude, GPT-4.1) remain the default; Ollama is the fallback, the fast path, and the privacy lane.

Reach for a local model when one of three conditions holds

Pick Ollama deliberately, not by default.

Offline or air-gapped: no network to the API. Ollama runs entirely on the host.
Privacy: data cannot leave the machine. Patient records, internal source code under restrictive licenses, customer PII under contract.
Bulk cost: a million low-stakes classifications. The marginal token is free once the GPU is paid for.

If none of those apply, the frontier API is faster to set up, smarter per token, and cheaper at low volume. Hybrid is common: Ollama for bulk preprocessing, frontier model for the final reasoning step. See prompt-design for prompts that port across both.

Pick the model by GPU memory, then by task

The model has to fit in VRAM (or unified memory on Apple Silicon) with room for the context window. As of 2026, three families cover most use.

Llama 3.3 70B: best general reasoning in open-weight; needs roughly 40 GB at Q4 for an 8K context.
Qwen 2.5 32B and Qwen 2.5 Coder 32B: strong general and best-in-class open-weight code; runs on 24 GB at Q4.
Mistral Small 3 (24B) and Llama 3.2 8B: lower-end machines, edge devices, batch jobs. 8B runs comfortably on 8 GB.

For embeddings, run nomic-embed-text or bge-large separately (see embeddings).

Use Q4_K_M as the default quantization

Quantization trades quality for memory. The standard ladder.

Q4_K_M: the default. ~4 bits per weight, ~1 to 2 percent quality loss vs. full precision on most benchmarks.
Q5_K_M: small step up, ~25 percent more memory.
Q8_0: near full quality, roughly 2x memory of Q4. Use when the host has the headroom and quality matters.
Q3 or lower: only when memory forces it. Quality drops noticeably.

ollama pull llama3.3:70b-instruct-q4_K_M
ollama pull qwen2.5-coder:32b-instruct-q8_0

Benchmark on your own eval set. The relative ranking between Q4 and Q8 varies by task.

Match context length to actual need

A larger context window costs memory linearly and slows the first token.

Set num_ctx to the largest input you actually send, rounded up to a power of two.
Chat: 4K to 8K is plenty. Long-document RAG: 16K to 32K, with retrieval narrowing the input.

OLLAMA_CONTEXT_LENGTH=16384 ollama serve

Long contexts also degrade attention quality in many open-weight models. Aggressive retrieval (see rag) beats a giant context window.

Serve via the REST and OpenAI-compatible endpoints

Ollama exposes two HTTP surfaces on localhost:11434.

Native: POST /api/generate, POST /api/chat, POST /api/embeddings. Streaming by default.
OpenAI-compatible: POST /v1/chat/completions, POST /v1/embeddings. Lets any OpenAI client point at Ollama with one env var.

export OPENAI_BASE_URL="http://localhost:11434/v1"
export OPENAI_API_KEY="ollama"

The OpenAI-compatible mode is the fastest way to swap a frontier model into Ollama for local testing of an agent loop.

Customize behavior with a Modelfile

A Modelfile pins a model with a system prompt, parameters, and template into a named local image.

FROM qwen2.5-coder:32b-instruct-q4_K_M
SYSTEM "You are a code reviewer for a TypeScript monorepo. Return JSON only."
PARAMETER temperature 0.2
PARAMETER num_ctx 16384

ollama create code-reviewer -f Modelfile
ollama run code-reviewer

Modelfiles are version-controllable. Check them in alongside the agent code so the model and the prompt move together.

Wire local models in as a fallback or fast path

Two common integration patterns with frontier-model agents like claude-code.

Fallback: try Claude first, fall back to Llama 3.3 when the API is down or quota is exhausted.
Fast path: route low-stakes calls (classification, summarization, extraction) to Ollama; reserve the frontier model for the hard reasoning step.

Log both routes with the same schema so quality comparisons sit side by side.

LLM Best Practices

Explorer

Overview

Reach for a local model when one of three conditions holds

Pick the model by GPU memory, then by task

Use Q4_K_M as the default quantization

Match context length to actual need

Serve via the REST and OpenAI-compatible endpoints

Customize behavior with a Modelfile

Wire local models in as a fallback or fast path

Graph View

Table of Contents

Backlinks

LLM Best Practices

Explorer

Ollama Best Practices

Overview

Reach for a local model when one of three conditions holds

Pick the model by GPU memory, then by task

Use Q4_K_M as the default quantization

Match context length to actual need

Serve via the REST and OpenAI-compatible endpoints

Customize behavior with a Modelfile

Wire local models in as a fallback or fast path

Related

Graph View

Table of Contents

Backlinks