Overview
Ollama runs in three deployment contexts: local developer machine (one user, one process), shared GPU server (multiple users, single GPU node), and production service (containerized, behind a reverse proxy, with access controls). Each context has different requirements for authentication, resource limits, and restart behavior. The default ollama serve startup is correct only for the first context; the others need explicit configuration.
Local development: use the default startup with a keep-alive adjustment
For a single developer machine, the default ollama serve is correct. The only configuration worth changing is the model keep-alive duration, which controls how long a loaded model stays in VRAM after the last request.
OLLAMA_KEEP_ALIVE=30m ollama serve- The default keep-alive is 5 minutes. Extend it when your workflow sends requests in bursts with gaps.
- Set
OLLAMA_KEEP_ALIVE=-1to keep models loaded indefinitely. Use this for dedicated workstations. - On macOS, Ollama installs a launchd agent that starts on login. Unload it if you prefer manual control:
launchctl unload ~/Library/LaunchAgents/com.ollama.ollama.plist.
Run as a systemd service on a GPU server
On a Linux GPU server shared by multiple users, run Ollama as a systemd unit so it starts on boot, restarts on crash, and runs as a dedicated service account.
# /etc/systemd/system/ollama.service
[Unit]
Description=Ollama LLM Server
After=network-online.target
[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=5
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_KEEP_ALIVE=15m"
Environment="OLLAMA_NUM_PARALLEL=2"
Environment="OLLAMA_MAX_LOADED_MODELS=1"
[Install]
WantedBy=multi-user.targetsudo systemctl daemon-reload
sudo systemctl enable ollama
sudo systemctl start ollamaOLLAMA_HOST=0.0.0.0:11434binds to all interfaces; add a reverse proxy with auth in front.OLLAMA_MAX_LOADED_MODELS=1prevents OOM errors when multiple users load different models concurrently.
Use Docker for containerized and cloud deployments
The official ollama/ollama image packages the binary with CUDA and ROCm drivers for GPU passthrough.
# docker-compose.yml
services:
ollama:
image: ollama/ollama:latest
ports:
- "11434:11434"
volumes:
- ollama_models:/root/.ollama
environment:
- OLLAMA_KEEP_ALIVE=15m
- OLLAMA_HOST=0.0.0.0:11434
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
restart: unless-stopped
volumes:
ollama_models:- Mount a named volume for
~/.ollamaso models survive container restarts. Model files are large; pulling them on every container start is expensive. - For CPU-only deployments, remove the
deploy.resourcesblock. - For Apple Silicon, Docker Desktop does not expose the Metal GPU to containers. Run
ollama servedirectly on the host; point agent containers athost.docker.internal:11434.
Put a reverse proxy with authentication in front of the API
Ollama’s HTTP API has no built-in authentication. Any process that can reach port 11434 can run inference and load models. On a shared server or any network-accessible deployment, put a reverse proxy in front.
upstream ollama {
server 127.0.0.1:11434;
}
server {
listen 443 ssl;
server_name ollama.internal.example.com;
location / {
auth_basic "Ollama API";
auth_basic_user_file /etc/nginx/.htpasswd;
proxy_pass http://ollama;
proxy_read_timeout 600s;
proxy_buffering off; # Required for streaming responses
}
}proxy_buffering offis required for streaming responses; nginx buffers by default, breaking thetext/event-streamformat.proxy_read_timeout 600sprevents nginx from closing the connection during slow generation of long outputs.- For team deployments, use a proper identity provider (OAuth2 proxy, Cloudflare Access) instead of HTTP basic auth.
Separate the model storage path from the container image
Model files (.gguf files, manifests) live in ~/.ollama by default. On servers, put this on a fast, large disk separate from the OS volume.
export OLLAMA_MODELS=/data/ollama/models- NVMe or SSD is required for acceptable model load times. An 8B Q4 model loads in under 5 seconds from NVMe; the same on spinning disk can take 30 seconds or more.
- Share the model directory across multiple Ollama instances (dev and staging) to avoid storing duplicate copies of large model files.