Overview

Output constraints are how a prompt narrows the model from “any plausible response” to “the one response the application can consume.” State the constraint in the instruction, show it with an example, validate the result, and reject when validation fails. Length caps, format specs, schemas, exclusions, and escape hatches are the five levers.

State the format with an example, not just an instruction

A two-line example of the target output beats a paragraph of prose describing it. The model is much better at pattern-matching one example than at parsing a structural rule.

Bad:  Return a JSON object with a label field (string) and a
      confidence field (float between 0 and 1).

Good: Reply in this exact format:
      {"label": "billing", "confidence": 0.87}

For multi-line or nested formats, ship the example inside the same delimiter block as the schema. See prompt-templates for the canonical skeleton and structured-prompt for why the example carries the structural contract.

Offer the model an explicit escape hatch

The single highest-impact mitigation against fabrication on grounded Q&A is giving the model permission to refuse. Without an escape hatch the model invents a plausible answer; with one it returns the sentinel.

Reply NOT_FOUND if the answer is not in the provided text.
Reply UNCLEAR if the question is ambiguous.
Reply OUT_OF_SCOPE if the question is not about Postgres.

Pick one sentinel string per condition, document it in the schema, and handle it in the caller. The sentinel should be uppercase and free of punctuation so it survives downstream normalization. See best-practices.

Set hard token limits as a safety floor, not a length instruction

The provider-side max_tokens parameter is a circuit breaker, not a length spec. Setting max_tokens: 200 does not produce a 200-token answer; it produces a truncated answer when the model wants to write more than 200 tokens.

Use both layers:

  • Instruction-level length spec: “Reply in no more than three sentences.” or “Return a JSON object under 500 characters.”
  • Provider-level max_tokens: set to roughly 1.5x the expected output so a runaway generation gets cut off.

The instruction governs shape; the parameter prevents bill shock. Never rely on one without the other.

Ship a schema for JSON output and require the model to match it

If the output is JSON, declare the schema inside the prompt and use the provider’s structured-output mode where available (response_format: { type: "json_schema", schema: ... } on OpenAI, tool_use with a typed input on Anthropic, responseSchema on Google). See structured-output.

<output_schema>
{
  "type": "object",
  "required": ["label", "confidence"],
  "properties": {
    "label": {"enum": ["billing", "technical", "account", "UNKNOWN"]},
    "confidence": {"type": "number", "minimum": 0, "maximum": 1}
  },
  "additionalProperties": false
}
</output_schema>

A schema plus structured-output mode is the only way to get reliable JSON at scale. Free-text JSON requests fail on roughly 1 to 3 percent of calls even at low temperature.

Reject post-generation rather than coax retroactively

When the output fails the constraint, do not append “wait, that was wrong, try again” to the conversation and re-ask. The follow-up turn pollutes the context with the bad output and the model often anchors on it.

The right loop:

  1. Validate the output against the schema or length cap in the caller.
  2. If invalid, log the failure and either retry from scratch with the same prompt (raise temperature slightly) or fail loud.
  3. Track the failure rate as a deployment metric. A rising rate is a prompt regression, not a model regression. See prompt-evals.

For high-stakes outputs, use a second LLM call as a validator with its own narrow prompt. The validator’s job is to return VALID or INVALID plus a one-line reason; do not let it rewrite the output inline.

State exclusions positively when possible

“Refer the user to /help/faq” beats “do not ask for the password” on instruction-following accuracy. The model is better at executing an affirmative instruction than at avoiding a negated one. When a true exclusion is unavoidable, state it once at the top of the instructions and again at the bottom.

Bad:  Do not include any personal opinions.
Good: Cite only facts from the provided text. Mark anything else
      as INFERRED.

Pitfalls

  • Asking for “concise” or “brief” output without a unit. The model has no anchor; specify sentences, words, characters, or tokens.
  • Putting the schema after the user input. Recency bias works against you; put the schema and the example before the input.
  • Letting the model produce both prose and JSON in one response. Pick one. If you need both, use two calls or use tool-use to wrap the JSON.
  • Coaxing in a follow-up turn. The bad output is now in context and the model anchors on it.
  • Treating max_tokens as a length instruction. It is a truncator.