Overview

Chain-of-thought (CoT) prompting asks the model to write out reasoning steps before its final answer. It helps when the task is multi-step and hurts when the task is simple retrieval. With modern reasoning models, much of CoT’s value moves inside the model. The rules below say when to reach for explicit CoT and when to let the model think on its own.

Use CoT when the task is multi-step reasoning

CoT lifts accuracy on math, multi-hop questions, symbolic manipulation, and planning. The model uses its own output as a scratchpad; intermediate tokens carry intermediate state.

  • Good case: word problems that require arithmetic across three sentences.
  • Good case: deciding which of five tools to call based on a chain of conditions.
  • Good case: extracting a value that depends on resolving entity references first.

If a human would need to write something down to solve the task, CoT helps. If a human would answer immediately, CoT is overhead.

Skip CoT for simple retrieval and classification

CoT is wasted budget when the task is one-step lookup, format conversion, or classification with clear features.

  • Skip: “translate this sentence.”
  • Skip: “is this email spam?” (after a tuned classifier prompt).
  • Skip: “return the user’s name from this JSON.”

For short tasks, CoT inflates latency and cost without lifting accuracy. The “think step by step” tail in a classification prompt sometimes lowers accuracy because the model invents reasons to second-guess a correct first impression.

Prefer reasoning models over manual CoT

Claude Sonnet 4.6 extended thinking, GPT-5 reasoning, and Gemini 2.5 thinking modes do CoT internally with training optimized for it. The internal scratchpad does not have to be parsed, does not pollute the visible output, and is generally stronger than what prompted CoT produces.

  • Use reasoning models for hard problems where the latency budget allows it.
  • Reach for prompted “think step by step” on non-reasoning models, or when the reasoning model’s thinking budget is too expensive for the task volume.
  • A small reasoning model often beats a large non-reasoning model on multi-step tasks at the same dollar cost.

Anchor the scratchpad with explicit structure

When you do prompt CoT, do not leave it freeform. Give the model named sections so the reasoning shape is consistent.

Before answering, write:
<reasoning>
Step 1: ...
Step 2: ...
Step 3: ...
</reasoning>
<answer>...</answer>

Named sections beat “let’s think this through.” They make the output parseable and let evals score the reasoning trace separately from the answer.

Hide the scratchpad with structured output

Callers do not want to see the scratchpad. Put the reasoning in a field the API consumer can ignore, or use the model’s internal thinking mode that the API hides by default.

  • Tool-use schema with a reasoning field plus an answer field. The client reads answer; the log captures reasoning for debugging.
  • Anthropic extended thinking returns thinking blocks separately from the visible response. The API client renders only the response by default.
  • See structured-output for the JSON-schema-as-contract pattern.

Never let a chat-style “Here is my thinking…” preamble leak into a user-facing response.

CoT can degrade short-answer accuracy

On simple multiple-choice or factual recall, CoT sometimes lowers accuracy because the model rationalizes itself into a wrong answer. Run the eval set with and without CoT before assuming it helps.

  • If pass@1 drops when you add “think step by step,” remove it.
  • If pass@1 rises, keep it and measure the latency tax.

A CoT trace that lifts accuracy on adversarial slices but drops it on easy slices is a real trade-off. Pick by which slice matters more. See evaluation.

Self-consistency for high-stakes answers

When correctness matters more than latency, sample the model k times with non-zero temperature and majority-vote the answers. This is cheaper than fine-tuning and more reliable than a single CoT pass.

  • k = 5 is a common starting point.
  • Vote on the final answer, not the reasoning trace.
  • The cost is k times the per-call cost; budget accordingly. See cost-control.

Log the reasoning trace for debugging

A wrong answer with a visible reasoning trace tells you where the model went off. A wrong answer with no trace tells you nothing.

  • Store the reasoning field with the request log.
  • Use it as eval input: a judge can rate “did the reasoning support the answer?” separately from “was the answer correct?”

The scratchpad is a debugging artifact even when the user never sees it.