Overview
Reasoning models (Claude Sonnet/Opus thinking, OpenAI o-series, DeepSeek R1, Gemini Thinking) run an internal chain of thought before they answer. Prompting them like a chat model wastes their strength and degrades their output. Strip the scaffolding, sharpen the instruction, and ask for the final answer in a structured form.
Do not instruct the model to think step by step
Reasoning models already think step by step. Adding “think step by step” or “show your work” duplicates the behavior into the visible output and crowds out the actual answer. The reasoning happens in the hidden reasoning budget; the visible response is the conclusion.
Bad: Think step by step, then explain your reasoning, then give
the final answer.
Good: What is the optimal index strategy for this query plan?
The hidden reasoning is not shown by default on Anthropic and OpenAI APIs; it is billed but not returned. Some providers expose a reasoning field for debugging. Do not parse it for application logic; it is not stable across model versions.
Do not include chain-of-thought examples
Few-shot examples that walk through “Step 1… Step 2… Step 3… Answer:” add noise to a reasoning model. The model treats the example chain as a constraint on its own output format and emits a visible chain instead of a concise answer. See chain-of-thought for the contrast with chat models, where CoT examples help.
If you want a worked example, show only the input and the final answer. Skip the intermediate reasoning.
Bad: Input: 23 * 47
Step 1: 23 * 40 = 920
Step 2: 23 * 7 = 161
Step 3: 920 + 161 = 1081
Answer: 1081
Good: Input: 23 * 47
Answer: 1081
Lean on instruction clarity, not example count
Chat models benefit from three to five few-shot examples. Reasoning models do better with one example or none and a sharper instruction.
The rule: every example you remove from a reasoning-model prompt is one fewer constraint on the model’s internal reasoning. Replace example density with instruction precision.
Reply in this exact format:
{"answer": <integer>, "unit": "<currency_code>"}
Use ISO 4217 currency codes. Reply {"answer": null, "unit": null}
if the amount or currency is missing.
The schema and the escape hatch carry the contract that examples used to carry. See output-constraints.
Use higher max-tokens
Reasoning models emit a hidden reasoning budget before the visible answer. A max_tokens of 1024 may be entirely consumed by reasoning, leaving zero tokens for the visible response.
Practical ceilings as of 2026:
- Claude thinking models: budget 16k to 64k tokens for hard problems, 4k for easy ones. The
thinking.budget_tokensparameter is separate frommax_tokens. - OpenAI o-series:
max_completion_tokenscovers both reasoning and visible output. Set to 16k or higher for non-trivial problems. - DeepSeek R1 and Gemini Thinking: similar dynamics; consult the current docs.
Underspecifying the budget is the most common reasoning-model failure mode in production. The model silently truncates mid-thought.
Ask for the final answer in a structured form
The model’s reasoning is internal. The visible output should be the conclusion only, in a form the application can parse.
<instructions>
Solve the optimization problem in the user input. Reply with this
JSON object and nothing else:
{
"decision": "string",
"expected_value": <number>,
"confidence": <number between 0 and 1>
}
</instructions>
For high-stakes outputs, pair the structured reply with provider-side structured output (JSON schema mode on OpenAI, tool use on Anthropic). See structured-output.
Skip temperature tuning for most cases
Reasoning models are less sensitive to temperature than chat models because the heavy lifting happens inside the reasoning budget at a fixed sampling regime. Default to temperature 1.0 (or whatever the provider recommends) and tune only the prompt.
Some providers (OpenAI o-series) lock or ignore temperature. Do not waste cycles tuning a parameter the model ignores.
When to use a reasoning model vs a chat model
Reasoning models cost more and respond slower. Use them when the task warrants the bill.
Reach for a reasoning model when:
- The task is multi-step with a verifiable final answer (math, code synthesis, theorem proving, structured planning).
- The task has multiple interacting constraints (scheduling, optimization, complex SQL generation).
- The chat-model output passes the eval set 60 to 80 percent of the time and the failures are reasoning errors, not formatting errors.
- Latency tolerance is over five seconds.
Stay with a chat model when:
- The task is single-step (extraction, classification, summarization, translation).
- Latency must be under two seconds.
- Cost per call matters more than ceiling quality.
- The bottleneck is retrieval quality, not reasoning quality. See rag.
A hybrid pattern works well: use a chat model for the surrounding chain and call a reasoning model on the one step that needs it. See prompt-chaining.
Pitfalls
- Stuffing the prompt with “think step by step” and chain-of-thought examples. The model wastes its budget mimicking the scaffolding.
- Setting
max_tokensto a chat-model ceiling (1024-4096). The reasoning budget exhausts the cap. - Tuning temperature on a model that ignores it.
- Using a reasoning model for high-throughput extraction. Cost and latency both blow up; the chat model handles it.