Overview
Ollama exposes two HTTP surfaces on localhost:11434: a native API with streaming by default, and an OpenAI-compatible API that lets any OpenAI client point at a local model with one environment variable. Understanding both surfaces, their streaming behavior, and their concurrency limits is necessary before integrating Ollama into an agent pipeline.
Use the OpenAI-compatible endpoint to swap in any OpenAI client
The OpenAI-compatible surface (/v1/chat/completions, /v1/embeddings) is the fastest integration path. Any library that speaks the OpenAI API can use Ollama by changing the base URL.
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama", # Required by the library; value is ignored
)
response = client.chat.completions.create(
model="qwen2.5-coder:32b-instruct-q4_K_M",
messages=[{"role": "user", "content": "Explain HNSW indexing in one paragraph."}],
)
print(response.choices[0].message.content)- Set
OPENAI_BASE_URL=http://localhost:11434/v1andOPENAI_API_KEY=ollamaas environment variables to make existing OpenAI code use Ollama without code changes. - The model name must be the exact Ollama model tag, not the OpenAI model name.
- Not every OpenAI feature is supported. Function calling, tool use, and vision are model-dependent.
Use the native API for full control over generation parameters
The native endpoint at /api/generate and /api/chat offers Ollama-specific options not available through the OpenAI-compatible surface.
curl http://localhost:11434/api/chat -d '{
"model": "llama3.3:70b-instruct-q4_K_M",
"messages": [{"role": "user", "content": "Summarize the GDPR in three bullets."}],
"stream": false,
"options": {
"temperature": 0.3,
"num_ctx": 8192,
"num_predict": 512
}
}'num_ctx: the context window length in tokens. Defaults to the model’s native size; lower it to save KV cache memory.num_predict: max output tokens. Set this to prevent runaway generation in batch pipelines.stream: set tofalsefor synchronous pipelines;truefor interactive UIs.
Stream responses for interactive use cases
Streaming sends tokens as they are generated rather than buffering the entire response. It is the right choice for chat interfaces, terminal tools, and any case where perceived latency matters.
import httpx
import json
with httpx.stream("POST", "http://localhost:11434/api/chat", json={
"model": "llama3.3:70b-instruct-q4_K_M",
"messages": [{"role": "user", "content": "Write a haiku about vectors."}],
"stream": True,
}) as r:
for line in r.iter_lines():
if line:
chunk = json.loads(line)
print(chunk["message"]["content"], end="", flush=True)
if chunk.get("done"):
breakFor batch processing pipelines where you need the full output before proceeding, set "stream": false. Streaming adds complexity without benefit when the consumer needs the complete text.
Understand concurrency limits before running parallel requests
Ollama processes one request at a time by default. Concurrent requests queue behind the active request.
OLLAMA_NUM_PARALLEL: set to 2 or 4 on systems with enough VRAM to hold multiple KV caches. Each parallel slot costs roughly context_length * layers * hidden_size * 2 bytes.OLLAMA_MAX_QUEUE: the request queue depth before Ollama returns 503. Default is 512. Lower it to fail fast in batch pipelines.
OLLAMA_NUM_PARALLEL=2 OLLAMA_MAX_QUEUE=100 ollama serveFor CPU-bound workloads on a multi-core machine with no GPU, higher parallelism can improve throughput. Benchmark with your actual workload before setting it.
Use keep-alive to avoid model reloading between requests
By default, Ollama unloads a model from memory after 5 minutes of inactivity. If your workload sends requests in bursts with gaps, add a keep-alive parameter.
response = client.chat.completions.create(
model="qwen2.5:32b-instruct-q4_K_M",
messages=[...],
extra_body={"keep_alive": "15m"},
)Set keep_alive to match your request cadence. An always-on daemon should set it to -1 (never unload). See ollama-deployment for how to configure this in systemd or Docker.
Use structured output via the format parameter for reliable parsing
Ollama supports constrained JSON generation when you set "format": "json_object" or pass a JSON Schema. Use it for any pipeline that parses the model’s output.
response = client.chat.completions.create(
model="qwen2.5:32b-instruct-q4_K_M",
messages=[{
"role": "user",
"content": "Extract: name, email, company from: 'Alice Smith, alice@acme.com, Acme Corp'",
}],
response_format={"type": "json_object"},
)The output is constrained to valid JSON at the grammar level; you still need to validate the schema, but malformed JSON is rare. See structured-output for the broader structured output patterns.