Prompting reasoning models

Overview

Reasoning models (Claude Sonnet/Opus thinking, OpenAI o-series, DeepSeek R1, Gemini Thinking) run an internal chain of thought before they answer. Prompting them like a chat model wastes their strength and degrades their output. Strip the scaffolding, sharpen the instruction, and ask for the final answer in a structured form.

Do not instruct the model to think step by step

Reasoning models already think step by step. Adding “think step by step” or “show your work” duplicates the behavior into the visible output and crowds out the actual answer. The reasoning happens in the hidden reasoning budget; the visible response is the conclusion.

Bad:  Think step by step, then explain your reasoning, then give
      the final answer.

Good: What is the optimal index strategy for this query plan?

The hidden reasoning is not shown by default on Anthropic and OpenAI APIs; it is billed but not returned. Some providers expose a reasoning field for debugging. Do not parse it for application logic; it is not stable across model versions.

Do not include chain-of-thought examples

Few-shot examples that walk through “Step 1… Step 2… Step 3… Answer:” add noise to a reasoning model. The model treats the example chain as a constraint on its own output format and emits a visible chain instead of a concise answer. See chain-of-thought for the contrast with chat models, where CoT examples help.

If you want a worked example, show only the input and the final answer. Skip the intermediate reasoning.

Bad:  Input: 23 * 47
      Step 1: 23 * 40 = 920
      Step 2: 23 * 7 = 161
      Step 3: 920 + 161 = 1081
      Answer: 1081

Good: Input: 23 * 47
      Answer: 1081

Lean on instruction clarity, not example count

Chat models benefit from three to five few-shot examples. Reasoning models do better with one example or none and a sharper instruction.

The rule: every example you remove from a reasoning-model prompt is one fewer constraint on the model’s internal reasoning. Replace example density with instruction precision.

Reply in this exact format:
{"answer": <integer>, "unit": "<currency_code>"}

Use ISO 4217 currency codes. Reply {"answer": null, "unit": null}
if the amount or currency is missing.

The schema and the escape hatch carry the contract that examples used to carry. See output-constraints.

Use higher max-tokens

Reasoning models emit a hidden reasoning budget before the visible answer. A max_tokens of 1024 may be entirely consumed by reasoning, leaving zero tokens for the visible response.

Practical ceilings as of 2026:

Claude thinking models: budget 16k to 64k tokens for hard problems, 4k for easy ones. The thinking.budget_tokens parameter is separate from max_tokens.
OpenAI o-series: max_completion_tokens covers both reasoning and visible output. Set to 16k or higher for non-trivial problems.
DeepSeek R1 and Gemini Thinking: similar dynamics; consult the current docs.

Underspecifying the budget is the most common reasoning-model failure mode in production. The model silently truncates mid-thought.

Ask for the final answer in a structured form

The model’s reasoning is internal. The visible output should be the conclusion only, in a form the application can parse.

<instructions>
Solve the optimization problem in the user input. Reply with this
JSON object and nothing else:

{
  "decision": "string",
  "expected_value": <number>,
  "confidence": <number between 0 and 1>
}
</instructions>

For high-stakes outputs, pair the structured reply with provider-side structured output (JSON schema mode on OpenAI, tool use on Anthropic). See structured-output.

Skip temperature tuning for most cases

Reasoning models are less sensitive to temperature than chat models because the heavy lifting happens inside the reasoning budget at a fixed sampling regime. Default to temperature 1.0 (or whatever the provider recommends) and tune only the prompt.

Some providers (OpenAI o-series) lock or ignore temperature. Do not waste cycles tuning a parameter the model ignores.

When to use a reasoning model vs a chat model

Reasoning models cost more and respond slower. Use them when the task warrants the bill.

Reach for a reasoning model when:

The task is multi-step with a verifiable final answer (math, code synthesis, theorem proving, structured planning).
The task has multiple interacting constraints (scheduling, optimization, complex SQL generation).
The chat-model output passes the eval set 60 to 80 percent of the time and the failures are reasoning errors, not formatting errors.
Latency tolerance is over five seconds.

Stay with a chat model when:

The task is single-step (extraction, classification, summarization, translation).
Latency must be under two seconds.
Cost per call matters more than ceiling quality.
The bottleneck is retrieval quality, not reasoning quality. See rag.

A hybrid pattern works well: use a chat model for the surrounding chain and call a reasoning model on the one step that needs it. See prompt-chaining.

Pitfalls

Stuffing the prompt with “think step by step” and chain-of-thought examples. The model wastes its budget mimicking the scaffolding.
Setting max_tokens to a chat-model ceiling (1024-4096). The reasoning budget exhausts the cap.
Tuning temperature on a model that ignores it.
Using a reasoning model for high-throughput extraction. Cost and latency both blow up; the chat model handles it.

LLM Best Practices

Explorer

Overview

Do not instruct the model to think step by step

Do not include chain-of-thought examples

Lean on instruction clarity, not example count

Use higher max-tokens

Ask for the final answer in a structured form

Skip temperature tuning for most cases

When to use a reasoning model vs a chat model

Pitfalls

Graph View

Table of Contents

Backlinks

LLM Best Practices

Explorer

Prompting reasoning models

Overview

Do not instruct the model to think step by step

Do not include chain-of-thought examples

Lean on instruction clarity, not example count

Use higher max-tokens

Ask for the final answer in a structured form

Skip temperature tuning for most cases

When to use a reasoning model vs a chat model

Pitfalls

Related

Graph View

Table of Contents

Backlinks