Overview

Few-shot prompting shows the model what good output looks like by example. It is a powerful tool when the rule is hard to state in prose and a waste of context when the rule is easy to state. The patterns below apply to Claude Sonnet 4.6, GPT-5, Gemini 2.5, and local models served via ollama.

Use few-shot when the rule is easier to show than to say

Examples earn their token cost when the output format, tone, or boundary is fiddly. Skip them when a one-line rule does the job.

  • Good case: classifying user intents into ten categories where two of them differ only in a subtle phrasing tell.
  • Good case: matching an idiosyncratic markdown format used by a specific tool.
  • Good case: producing a tone or voice that resists description but is obvious from samples.
  • Bad case: “write a summary.” The rule is statable; examples just inflate the prompt.

If you can state the rule in one sentence and a model already follows it, do not pay for examples.

Three to five examples is the sweet spot

One example is fragile; the model fits to its incidentals. Two examples can contradict each other without telling the model how. Ten examples drown the new input.

  • Three to five examples covers the common case, an edge case, and a counter-case.
  • More than seven examples rarely helps and inflates input cost. Run an eval to confirm before paying for more.
  • If your task needs twenty examples, you are doing fine-tuning, not prompting. See evaluation for the eval-driven path that surfaces this trade-off.

Include negative examples for boundary cases

Positive examples show what to do. Negative examples show what not to do. The pairing is what teaches the model the boundary.

Input: "Refund my order #401."
Output: { "intent": "refund_request", "order_id": "401" }
 
Input: "How do refunds work?"
Output: { "intent": "policy_question", "order_id": null }
 
Input: "Can you refund order 401?"
Output: { "intent": "refund_request", "order_id": "401" }

The second row is the negative example: it looks like a refund but is a policy question. Without it, the model collapses both into refund_request.

Diversity beats volume

Five diverse examples beat ten near-duplicates. The model picks up the shape from the spread, not the count.

  • Cover the easy case, the edge case, the adversarial case.
  • Vary input length, vocabulary, and structure across examples.
  • If three examples look almost identical, drop two; you are paying for one signal three times.

A useful test: shuffle your example set and read the inputs cold. If you can guess the output from the input alone, the example earns its slot. If two inputs feel identical, you do not need both.

Order matters; put the strongest example last

Models exhibit recency bias inside the prompt. The last example carries more weight than the first.

  • Put the canonical, prototypical case last.
  • Put quirky or adversarial cases in the middle.
  • The first example anchors the format; the last example anchors the behavior.

In long contexts with Claude Sonnet 4.6 or Gemini 2.5, this effect weakens but does not vanish. With short-context models, ordering dominates.

Label inputs and outputs unambiguously

Mixed-case, free-form labels confuse the model. Use a consistent delimiter pattern across every example, and use it in the live prompt too.

<example>
<input>Refund my order #401.</input>
<output>{ "intent": "refund_request", "order_id": "401" }</output>
</example>

XML tags work well for Claude. Markdown headers or Input: / Output: prefixes work for most other models. Pick one and apply it to every example and to the user message; mixing styles invites the model to invent its own.

Examples are local; rules are global

A few-shot example shapes one decision; a rule in the system prompt shapes every decision. When a rule contradicts an example, the model often follows the nearer signal (the example), not the global signal (the rule). Cross-check that examples obey the rules. See examples-vs-rules for when to reach for which tool.

Test that examples earn their slot

Few-shot is not free. Every example you include costs input tokens on every call and slows time-to-first-token. Run the eval set with and without each example; drop any that does not lift the score.

A baseline pass with zero examples is a cheap reference point. If zero-shot scores within a few points of three-shot, ship zero-shot.