Overview

Ollama runs in three deployment contexts: local developer machine (one user, one process), shared GPU server (multiple users, single GPU node), and production service (containerized, behind a reverse proxy, with access controls). Each context has different requirements for authentication, resource limits, and restart behavior. The default ollama serve startup is correct only for the first context; the others need explicit configuration.

Local development: use the default startup with a keep-alive adjustment

For a single developer machine, the default ollama serve is correct. The only configuration worth changing is the model keep-alive duration, which controls how long a loaded model stays in VRAM after the last request.

OLLAMA_KEEP_ALIVE=30m ollama serve
  • The default keep-alive is 5 minutes. Extend it when your workflow sends requests in bursts with gaps.
  • Set OLLAMA_KEEP_ALIVE=-1 to keep models loaded indefinitely. Use this for dedicated workstations.
  • On macOS, Ollama installs a launchd agent that starts on login. Unload it if you prefer manual control: launchctl unload ~/Library/LaunchAgents/com.ollama.ollama.plist.

Run as a systemd service on a GPU server

On a Linux GPU server shared by multiple users, run Ollama as a systemd unit so it starts on boot, restarts on crash, and runs as a dedicated service account.

# /etc/systemd/system/ollama.service
[Unit]
Description=Ollama LLM Server
After=network-online.target
 
[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=5
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_KEEP_ALIVE=15m"
Environment="OLLAMA_NUM_PARALLEL=2"
Environment="OLLAMA_MAX_LOADED_MODELS=1"
 
[Install]
WantedBy=multi-user.target
sudo systemctl daemon-reload
sudo systemctl enable ollama
sudo systemctl start ollama
  • OLLAMA_HOST=0.0.0.0:11434 binds to all interfaces; add a reverse proxy with auth in front.
  • OLLAMA_MAX_LOADED_MODELS=1 prevents OOM errors when multiple users load different models concurrently.

Use Docker for containerized and cloud deployments

The official ollama/ollama image packages the binary with CUDA and ROCm drivers for GPU passthrough.

# docker-compose.yml
services:
  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"
    volumes:
      - ollama_models:/root/.ollama
    environment:
      - OLLAMA_KEEP_ALIVE=15m
      - OLLAMA_HOST=0.0.0.0:11434
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    restart: unless-stopped
 
volumes:
  ollama_models:
  • Mount a named volume for ~/.ollama so models survive container restarts. Model files are large; pulling them on every container start is expensive.
  • For CPU-only deployments, remove the deploy.resources block.
  • For Apple Silicon, Docker Desktop does not expose the Metal GPU to containers. Run ollama serve directly on the host; point agent containers at host.docker.internal:11434.

Put a reverse proxy with authentication in front of the API

Ollama’s HTTP API has no built-in authentication. Any process that can reach port 11434 can run inference and load models. On a shared server or any network-accessible deployment, put a reverse proxy in front.

upstream ollama {
    server 127.0.0.1:11434;
}
 
server {
    listen 443 ssl;
    server_name ollama.internal.example.com;
 
    location / {
        auth_basic "Ollama API";
        auth_basic_user_file /etc/nginx/.htpasswd;
        proxy_pass http://ollama;
        proxy_read_timeout 600s;
        proxy_buffering off;  # Required for streaming responses
    }
}
  • proxy_buffering off is required for streaming responses; nginx buffers by default, breaking the text/event-stream format.
  • proxy_read_timeout 600s prevents nginx from closing the connection during slow generation of long outputs.
  • For team deployments, use a proper identity provider (OAuth2 proxy, Cloudflare Access) instead of HTTP basic auth.

Separate the model storage path from the container image

Model files (.gguf files, manifests) live in ~/.ollama by default. On servers, put this on a fast, large disk separate from the OS volume.

export OLLAMA_MODELS=/data/ollama/models
  • NVMe or SSD is required for acceptable model load times. An 8B Q4 model loads in under 5 seconds from NVMe; the same on spinning disk can take 30 seconds or more.
  • Share the model directory across multiple Ollama instances (dev and staging) to avoid storing duplicate copies of large model files.