Overview

Ollama exposes two HTTP surfaces on localhost:11434: a native API with streaming by default, and an OpenAI-compatible API that lets any OpenAI client point at a local model with one environment variable. Understanding both surfaces, their streaming behavior, and their concurrency limits is necessary before integrating Ollama into an agent pipeline.

Use the OpenAI-compatible endpoint to swap in any OpenAI client

The OpenAI-compatible surface (/v1/chat/completions, /v1/embeddings) is the fastest integration path. Any library that speaks the OpenAI API can use Ollama by changing the base URL.

from openai import OpenAI
 
client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # Required by the library; value is ignored
)
 
response = client.chat.completions.create(
    model="qwen2.5-coder:32b-instruct-q4_K_M",
    messages=[{"role": "user", "content": "Explain HNSW indexing in one paragraph."}],
)
print(response.choices[0].message.content)
  • Set OPENAI_BASE_URL=http://localhost:11434/v1 and OPENAI_API_KEY=ollama as environment variables to make existing OpenAI code use Ollama without code changes.
  • The model name must be the exact Ollama model tag, not the OpenAI model name.
  • Not every OpenAI feature is supported. Function calling, tool use, and vision are model-dependent.

Use the native API for full control over generation parameters

The native endpoint at /api/generate and /api/chat offers Ollama-specific options not available through the OpenAI-compatible surface.

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.3:70b-instruct-q4_K_M",
  "messages": [{"role": "user", "content": "Summarize the GDPR in three bullets."}],
  "stream": false,
  "options": {
    "temperature": 0.3,
    "num_ctx": 8192,
    "num_predict": 512
  }
}'
  • num_ctx: the context window length in tokens. Defaults to the model’s native size; lower it to save KV cache memory.
  • num_predict: max output tokens. Set this to prevent runaway generation in batch pipelines.
  • stream: set to false for synchronous pipelines; true for interactive UIs.

Stream responses for interactive use cases

Streaming sends tokens as they are generated rather than buffering the entire response. It is the right choice for chat interfaces, terminal tools, and any case where perceived latency matters.

import httpx
import json
 
with httpx.stream("POST", "http://localhost:11434/api/chat", json={
    "model": "llama3.3:70b-instruct-q4_K_M",
    "messages": [{"role": "user", "content": "Write a haiku about vectors."}],
    "stream": True,
}) as r:
    for line in r.iter_lines():
        if line:
            chunk = json.loads(line)
            print(chunk["message"]["content"], end="", flush=True)
            if chunk.get("done"):
                break

For batch processing pipelines where you need the full output before proceeding, set "stream": false. Streaming adds complexity without benefit when the consumer needs the complete text.

Understand concurrency limits before running parallel requests

Ollama processes one request at a time by default. Concurrent requests queue behind the active request.

  • OLLAMA_NUM_PARALLEL: set to 2 or 4 on systems with enough VRAM to hold multiple KV caches. Each parallel slot costs roughly context_length * layers * hidden_size * 2 bytes.
  • OLLAMA_MAX_QUEUE: the request queue depth before Ollama returns 503. Default is 512. Lower it to fail fast in batch pipelines.
OLLAMA_NUM_PARALLEL=2 OLLAMA_MAX_QUEUE=100 ollama serve

For CPU-bound workloads on a multi-core machine with no GPU, higher parallelism can improve throughput. Benchmark with your actual workload before setting it.

Use keep-alive to avoid model reloading between requests

By default, Ollama unloads a model from memory after 5 minutes of inactivity. If your workload sends requests in bursts with gaps, add a keep-alive parameter.

response = client.chat.completions.create(
    model="qwen2.5:32b-instruct-q4_K_M",
    messages=[...],
    extra_body={"keep_alive": "15m"},
)

Set keep_alive to match your request cadence. An always-on daemon should set it to -1 (never unload). See ollama-deployment for how to configure this in systemd or Docker.

Use structured output via the format parameter for reliable parsing

Ollama supports constrained JSON generation when you set "format": "json_object" or pass a JSON Schema. Use it for any pipeline that parses the model’s output.

response = client.chat.completions.create(
    model="qwen2.5:32b-instruct-q4_K_M",
    messages=[{
        "role": "user",
        "content": "Extract: name, email, company from: 'Alice Smith, alice@acme.com, Acme Corp'",
    }],
    response_format={"type": "json_object"},
)

The output is constrained to valid JSON at the grammar level; you still need to validate the schema, but malformed JSON is rare. See structured-output for the broader structured output patterns.