Overview
Ollama runs open-weight LLMs on local hardware with a single binary and an OpenAI-compatible REST API. This guide installs Ollama, pulls a model, runs a chat, sets a system prompt, and shows two integration paths: the raw REST API and LangChain. For the trade-off against llama.cpp, see ollama-vs-llamacpp.
Prerequisites
- macOS 13+ (Metal GPU), Linux with a GPU (NVIDIA or AMD with ROCm), or Linux CPU-only. Windows is supported via WSL2.
- At least 8 GB RAM for 7B models; 16 GB for 13B; 32 GB for 30B+ models.
curlfor API testing.- Python 3.11+ for the LangChain integration.
Steps
1. Install Ollama
macOS
brew install ollamaOr download the .dmg from ollama.com.
Linux
curl -fsSL https://ollama.com/install.sh | shThis installs the binary and sets up a systemd service. Start it:
sudo systemctl start ollama
sudo systemctl enable ollamaConfirm Ollama is running:
ollama --version
curl http://localhost:11434/api/tags
# expected: {"models": [...]}2. Pull a model
Choose a model based on available RAM:
| Model | Size | Min RAM |
|---|---|---|
llama3.2:3b | 2.0 GB | 8 GB |
llama3.1:8b | 4.7 GB | 8 GB |
mistral:7b | 4.1 GB | 8 GB |
qwen2.5:14b | 9.0 GB | 16 GB |
ollama pull llama3.2:3bList installed models:
ollama list3. Run an interactive chat
ollama run llama3.2:3bType at the >>> prompt. Press Ctrl+D to exit. Use /bye inside the REPL to quit cleanly.
4. Set a system prompt with a Modelfile
Create a custom model variant with a baked-in system prompt:
cat > Modelfile << 'EOF'
FROM llama3.2:3b
SYSTEM """
You are a concise technical writer. Reply in plain text only.
Lead with the rule, then the rationale.
Avoid em-dashes. Use periods or commas instead.
"""
EOF
ollama create mymodel -f Modelfile
ollama run mymodelSee system-prompts for rules on writing effective system prompts.
5. Use the REST API
Ollama exposes an OpenAI-compatible API at port 11434.
Non-streaming
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2:3b",
"prompt": "Explain HNSW indexing in two sentences.",
"stream": false
}'OpenAI-compatible /v1/chat/completions endpoint
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2:3b",
"messages": [{"role": "user", "content": "What is RAG?"}]
}'The /v1 endpoint accepts the same payload as the OpenAI SDK. Swap the base URL to run any OpenAI SDK code against a local model.
6. Integrate with LangChain
pip install langchain langchain-ollamafrom langchain_ollama import OllamaLLM
llm = OllamaLLM(model="llama3.2:3b", base_url="http://localhost:11434")
response = llm.invoke("List three Postgres performance tips.")
print(response)For embeddings (useful in RAG pipelines):
from langchain_ollama import OllamaEmbeddings
embeddings = OllamaEmbeddings(model="nomic-embed-text")
vector = embeddings.embed_query("What is RAG?")
print(len(vector)) # 768 for nomic-embed-textSee rag for the full retrieval pipeline that consumes these embeddings.
Verify it worked
# 1. Ollama server is up.
curl -s http://localhost:11434/api/tags | python3 -m json.tool | head -5
# 2. Model is installed.
ollama list | grep llama3.2
# 3. Single-shot inference works.
ollama run llama3.2:3b "Reply with the word OK only." --nowordwrap
# 4. REST API responds.
curl -s http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"llama3.2:3b","messages":[{"role":"user","content":"Say OK"}]}' | \
python3 -c "import sys,json; print(json.load(sys.stdin)['choices'][0]['message']['content'])"Common errors
connection refusedon port 11434. The Ollama service is not running. Runollama servein a terminal orsudo systemctl start ollama.pull model manifest: 404. The model name is wrong. Checkollama listand the Ollama model library.- Out-of-memory crash. The model does not fit in RAM or VRAM. Use a smaller quantized variant (e.g.,
llama3.1:8b-instruct-q4_0). - Slow responses on CPU. Without a GPU, 7B models run at 2 to 5 tokens per second. Use a 3B model for interactive speeds.
OllamaLLMnot found in LangChain. Installlangchain-ollama, notlangchain-community. The community package moved Ollama to a dedicated package.