Ollama: Deployment

Overview

Ollama runs in three deployment contexts: local developer machine (one user, one process), shared GPU server (multiple users, single GPU node), and production service (containerized, behind a reverse proxy, with access controls). Each context has different requirements for authentication, resource limits, and restart behavior. The default ollama serve startup is correct only for the first context; the others need explicit configuration.

Local development: use the default startup with a keep-alive adjustment

For a single developer machine, the default ollama serve is correct. The only configuration worth changing is the model keep-alive duration, which controls how long a loaded model stays in VRAM after the last request.

OLLAMA_KEEP_ALIVE=30m ollama serve

The default keep-alive is 5 minutes. Extend it when your workflow sends requests in bursts with gaps.
Set OLLAMA_KEEP_ALIVE=-1 to keep models loaded indefinitely. Use this for dedicated workstations.
On macOS, Ollama installs a launchd agent that starts on login. Unload it if you prefer manual control: launchctl unload ~/Library/LaunchAgents/com.ollama.ollama.plist.

Run as a systemd service on a GPU server

On a Linux GPU server shared by multiple users, run Ollama as a systemd unit so it starts on boot, restarts on crash, and runs as a dedicated service account.

# /etc/systemd/system/ollama.service
[Unit]
Description=Ollama LLM Server
After=network-online.target
 
[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=5
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_KEEP_ALIVE=15m"
Environment="OLLAMA_NUM_PARALLEL=2"
Environment="OLLAMA_MAX_LOADED_MODELS=1"
 
[Install]
WantedBy=multi-user.target

sudo systemctl daemon-reload
sudo systemctl enable ollama
sudo systemctl start ollama

OLLAMA_HOST=0.0.0.0:11434 binds to all interfaces; add a reverse proxy with auth in front.
OLLAMA_MAX_LOADED_MODELS=1 prevents OOM errors when multiple users load different models concurrently.

Use Docker for containerized and cloud deployments

The official ollama/ollama image packages the binary with CUDA and ROCm drivers for GPU passthrough.

# docker-compose.yml
services:
  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"
    volumes:
      - ollama_models:/root/.ollama
    environment:
      - OLLAMA_KEEP_ALIVE=15m
      - OLLAMA_HOST=0.0.0.0:11434
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    restart: unless-stopped
 
volumes:
  ollama_models:

Mount a named volume for ~/.ollama so models survive container restarts. Model files are large; pulling them on every container start is expensive.
For CPU-only deployments, remove the deploy.resources block.
For Apple Silicon, Docker Desktop does not expose the Metal GPU to containers. Run ollama serve directly on the host; point agent containers at host.docker.internal:11434.

Put a reverse proxy with authentication in front of the API

Ollama’s HTTP API has no built-in authentication. Any process that can reach port 11434 can run inference and load models. On a shared server or any network-accessible deployment, put a reverse proxy in front.

upstream ollama {
    server 127.0.0.1:11434;
}
 
server {
    listen 443 ssl;
    server_name ollama.internal.example.com;
 
    location / {
        auth_basic "Ollama API";
        auth_basic_user_file /etc/nginx/.htpasswd;
        proxy_pass http://ollama;
        proxy_read_timeout 600s;
        proxy_buffering off;  # Required for streaming responses
    }
}

proxy_buffering off is required for streaming responses; nginx buffers by default, breaking the text/event-stream format.
proxy_read_timeout 600s prevents nginx from closing the connection during slow generation of long outputs.
For team deployments, use a proper identity provider (OAuth2 proxy, Cloudflare Access) instead of HTTP basic auth.

Separate the model storage path from the container image

Model files (.gguf files, manifests) live in ~/.ollama by default. On servers, put this on a fast, large disk separate from the OS volume.

export OLLAMA_MODELS=/data/ollama/models

NVMe or SSD is required for acceptable model load times. An 8B Q4 model loads in under 5 seconds from NVMe; the same on spinning disk can take 30 seconds or more.
Share the model directory across multiple Ollama instances (dev and staging) to avoid storing duplicate copies of large model files.

LLM Best Practices

Explorer

Overview

Local development: use the default startup with a keep-alive adjustment

Run as a systemd service on a GPU server

Use Docker for containerized and cloud deployments

Put a reverse proxy with authentication in front of the API

Separate the model storage path from the container image

Graph View

Table of Contents

Backlinks

LLM Best Practices

Explorer

Ollama: Deployment

Overview

Local development: use the default startup with a keep-alive adjustment

Run as a systemd service on a GPU server

Use Docker for containerized and cloud deployments

Put a reverse proxy with authentication in front of the API

Separate the model storage path from the container image

Related

Graph View

Table of Contents

Backlinks