Definition
Fine-tuning continues training a pre-trained language model on a supervised dataset of input-output examples. The model’s weights are updated to minimize loss on this dataset, adapting behavior toward the target distribution without retraining from scratch.
Fine-tuning is appropriate when:
- A specific output format must be followed exactly and prompt engineering is inconsistent.
- Domain vocabulary or notation is unusual and the base model performs poorly.
- Latency requires a smaller, specialized model rather than prompting a large general one.
- Tone, style, or persona must be locked across millions of requests.
Fine-tuning is NOT appropriate when:
- The goal is to teach new facts; fine-tuning does not reliably inject factual knowledge and can worsen hallucination.
- Prompt engineering achieves equivalent quality; fine-tuning is more expensive to maintain.
- The dataset is small (fewer than roughly 100 examples); the model will overfit.
Parameter-efficient fine-tuning methods (LoRA, QLoRA) update a small set of adapter weights rather than all model parameters, reducing VRAM and storage requirements.
Evaluation after fine-tuning requires a held-out test set from the same distribution as training. Use a golden set and an evaluation harness to measure improvement against the base model.
When it applies
Prefer few-shot prompting and system prompt engineering before committing to fine-tuning. Build an evaluation harness first so you can measure whether fine-tuning helps. Start with a small LoRA run on 100-500 examples to validate the signal before scaling.
Example
# OpenAI fine-tuning API (illustrative)
from openai import OpenAI
client = OpenAI()
with open("train.jsonl", "rb") as f:
upload = client.files.create(file=f, purpose="fine-tune")
job = client.fine_tuning.jobs.create(
training_file=upload.id,
model="gpt-4o-mini-2024-07-18"
)
print(job.id)Related concepts
- distillation - distillation uses a large model’s outputs as training data for a smaller one; fine-tuning uses human-labeled data.
- evaluation-harness - always evaluate the fine-tuned model against the base model on a test set.
- golden-set - the gold standard for measuring fine-tuning effectiveness.
- few-shot-prompting - prefer few-shot prompting over fine-tuning when the task can be expressed in examples in the prompt.
- hallucination - fine-tuning does not reduce hallucination and can increase it for facts not in training data.
Citing this term
See Fine-Tuning (llmbestpractices.com/glossary/fine-tuning).