Overview

A prompt chain is a sequence of narrow LLM calls where each step has one job and passes a small, typed payload to the next step. Chain when one prompt would mix tasks, overload context, or hide the failure point. The chain itself is the state; the model stays stateless and replaceable.

Chain when a single prompt would do more than one thing

A prompt that asks the model to extract, classify, and rewrite in one shot fails in three places at once and is impossible to debug. Split the work.

Symptoms that signal it is time to chain:

  • The instructions list more than one verb (extract and classify and summarize).
  • The output schema has unrelated fields glued together.
  • The eval fails at the same step every time but the prompt change to fix it breaks the other steps.
  • The context window is mostly retrieved documents and the answer is buried.

The fix is one prompt per verb. Each prompt has its own template and its own entry in the eval set.

Give each step one job

A step that does one thing has a measurable success criterion and a swappable implementation. Two canonical patterns:

extract -> classify -> respond

Step 1: extract entities from the message. Output: JSON list.
Step 2: classify each entity's intent. Output: JSON list of {entity, intent}.
Step 3: draft a reply that addresses each intent. Output: string.
query -> retrieve -> ground -> answer

Step 1: rewrite the user query into a search query. Output: string.
Step 2: retrieve top-k documents. Output: list of {id, text} (not an LLM call).
Step 3: rank documents by relevance to the original question. Output: list of ids.
Step 4: answer the question using only the ranked documents. Output: string.

Step 2 in the second example is a database call, not a model call. Chains mix LLM and non-LLM steps freely. See rag.

Pass only the relevant subset of the previous step’s output

The default failure mode of chaining is shoveling the entire previous response into the next prompt. The next model now wades through noise that was not its job to produce.

Trim aggressively between steps. If step 1 returned {entities: [...], rationale: "...", debug: {...}}, step 2 receives only {entities: [...]}. The rationale and debug fields stay in the chain’s log for evals; they do not enter the next prompt.

A useful test: if you cannot write the input schema for step N+1 in three lines, the previous step is leaking too much.

Keep the router or dispatcher prompt thin

The first step in many chains is a router that picks which branch to run. The router should do one thing: classify the input into a fixed set of route labels.

<instructions>
Classify the user's message into one of:
- billing
- technical
- account
- other

Reply with the label and nothing else.
</instructions>

A thin router has three properties: a fixed output vocabulary, no business logic, and no attempt to also start solving the routed task. The branch prompt does the work. The router’s only job is to pick the branch.

The chain holds the state, the model does not

Stateful behavior across turns (user history, tool results, intermediate facts) lives in the chain’s orchestration code, not in the model’s context. The orchestrator decides what to put in the next prompt, what to keep in memory, and what to discard.

A practical implication: do not pass the full conversation history into every step. Pass the slice that is relevant to the current step. The user’s name and tier go into the billing step’s context; the embedded support article goes into the technical step’s context. Each step sees the minimum it needs.

Use a reasoning model for one step, not the whole chain

Reasoning models are slow and expensive. Use them for the one step in the chain that benefits from deep reasoning (planning, hard classification, multi-constraint generation) and use a fast chat model for the surrounding steps. The chain is the natural place to mix model sizes.

Pitfalls

  • Chaining for the sake of chaining. A two-step chain where one step would suffice doubles latency and cost for no gain. Chain only when the prompt has more than one verb.
  • Forwarding the entire previous response. The next model loses focus.
  • Putting business logic in the router prompt. The router classifies; the branch executes.
  • Losing the per-step log. Without a record of each step’s input and output, the chain is unevalable. See prompt-evals.
  • Hand-rolling chain orchestration when a framework would do. Use the chain primitive in your agent framework (LangGraph nodes, Anthropic tool-use loops, OpenAI Assistants steps) so the per-step traces are free.