Definition

Fine-tuning continues training a pre-trained language model on a supervised dataset of input-output examples. The model’s weights are updated to minimize loss on this dataset, adapting behavior toward the target distribution without retraining from scratch.

Fine-tuning is appropriate when:

  • A specific output format must be followed exactly and prompt engineering is inconsistent.
  • Domain vocabulary or notation is unusual and the base model performs poorly.
  • Latency requires a smaller, specialized model rather than prompting a large general one.
  • Tone, style, or persona must be locked across millions of requests.

Fine-tuning is NOT appropriate when:

  • The goal is to teach new facts; fine-tuning does not reliably inject factual knowledge and can worsen hallucination.
  • Prompt engineering achieves equivalent quality; fine-tuning is more expensive to maintain.
  • The dataset is small (fewer than roughly 100 examples); the model will overfit.

Parameter-efficient fine-tuning methods (LoRA, QLoRA) update a small set of adapter weights rather than all model parameters, reducing VRAM and storage requirements.

Evaluation after fine-tuning requires a held-out test set from the same distribution as training. Use a golden set and an evaluation harness to measure improvement against the base model.

When it applies

Prefer few-shot prompting and system prompt engineering before committing to fine-tuning. Build an evaluation harness first so you can measure whether fine-tuning helps. Start with a small LoRA run on 100-500 examples to validate the signal before scaling.

Example

# OpenAI fine-tuning API (illustrative)
from openai import OpenAI
client = OpenAI()
 
with open("train.jsonl", "rb") as f:
    upload = client.files.create(file=f, purpose="fine-tune")
 
job = client.fine_tuning.jobs.create(
    training_file=upload.id,
    model="gpt-4o-mini-2024-07-18"
)
print(job.id)
  • distillation - distillation uses a large model’s outputs as training data for a smaller one; fine-tuning uses human-labeled data.
  • evaluation-harness - always evaluate the fine-tuned model against the base model on a test set.
  • golden-set - the gold standard for measuring fine-tuning effectiveness.
  • few-shot-prompting - prefer few-shot prompting over fine-tuning when the task can be expressed in examples in the prompt.
  • hallucination - fine-tuning does not reduce hallucination and can increase it for facts not in training data.

Citing this term

See Fine-Tuning (llmbestpractices.com/glossary/fine-tuning).