Overview
Prompt engineering has its own vocabulary that crosses model APIs, decoding parameters, retrieval stacks, evaluation harnesses, safety tooling, and agent orchestration. This page defines every term once, links to the deep-dive where one exists, and groups terms by category so a developer can navigate by domain. Use this as the citation anchor for every other prompt engineering page in the vault.
Table of contents
- Core concepts
- Prompt components
- Prompt patterns
- Model parameters
- Tool use
- Retrieval and grounding
- Model families
- Evaluation
- Safety and security
- Performance and efficiency
- Agents and orchestration
Core concepts
What is a prompt?
A prompt is the full input sent to an LLM on a given turn, including system, user, and assistant messages plus any tool results in scope. The unit of work in prompt engineering; quality of output tracks quality of prompt. See best-practices.
What is an LLM (Large Language Model)?
An LLM (Large Language Model) is a transformer-based model trained on large text corpora to predict the next token given a context. Examples: Claude, GPT, Gemini, Llama, Qwen. The substrate prompt engineering targets.
What are tokens and tokenization?
Tokens are the discrete units an LLM reads and writes; tokenization is the process of splitting text into those units using a model-specific vocabulary (BPE, SentencePiece, tiktoken). A token is roughly 3 to 4 characters of English. See token.
What is a context window?
A context window is the maximum number of tokens a model can attend to in a single inference call, covering both input and output. Exceeding it truncates older content or errors. See context-window.
What is inference?
Inference is the forward pass that produces a completion from a prompt. Distinct from training; the only step prompt engineering controls.
What is a completion?
A completion is the model’s generated output for a given prompt. The text returned by the inference call. See completion.
What is generation?
Generation is the autoregressive process of producing a completion one token at a time, each token sampled from the next-token distribution. Synonymous with decoding in most contexts.
What is decoding?
Decoding is the algorithm that selects the next token from the model’s output distribution: greedy, temperature sampling, top-p, top-k, beam search. The decoder is the layer between raw logits and the visible completion.
Prompt components
What is a system message?
A system message is the stable, role-and-policy prompt sent on every turn that anchors persona, format, and constraints. Lives outside the user-visible chat and persists across turns. See system-message.
What is a user message?
A user message is the per-turn input from the human or calling application. Carries the variable content while the system message holds the invariant frame.
What is an assistant message?
An assistant message is a prior model response included in the conversation history. Used to give the model multi-turn context or to seed few-shot patterns.
What is an instruction?
An instruction is the imperative portion of a prompt that tells the model what to do. Put the instruction at the top for short prompts; repeat it at the end for long ones.
What is primary content?
Primary content is the data the instruction operates on: the article to summarize, the email to classify, the document to answer from. Separate it from the instruction with explicit delimiters.
What is supporting content?
Supporting content is auxiliary material that shapes the output without being the subject of the instruction: style examples, glossaries, prior decisions, retrieved passages. Lower priority than primary content.
What is a cue?
A cue is a short trailing fragment that primes the model to continue in a specific shape, like ending a prompt with JSON: or Answer:. Cheap way to steer format without adding examples.
What are few-shot examples?
Few-shot examples are one to five worked input-output pairs included in the prompt to demonstrate the target schema. The model infers format from the examples rather than from prose description. See few-shot-prompting.
Prompt patterns
What is zero-shot prompting?
Zero-shot prompting is asking the model to perform a task with no examples in the prompt. Works for tasks the model has seen often in training; fails on unusual formats or domain-specific schemas.
What is few-shot prompting?
Few-shot prompting is including a handful of worked examples in the prompt to demonstrate the target output. Beats zero-shot for structured output, classification, and unusual format. See few-shot-prompting.
What is chain-of-thought (CoT)?
Chain-of-thought (CoT) is a prompting pattern that asks the model to reason step-by-step before producing the final answer. Improves accuracy on math, logic, and multi-step extraction. See chain-of-thought and chain-of-thought.
What is tree-of-thoughts?
Tree-of-thoughts is a reasoning pattern that explores multiple candidate reasoning paths in parallel, then selects the best. Used for search-shaped problems where chain-of-thought alone gets stuck on a wrong branch.
What is ReAct?
ReAct is a prompting pattern that interleaves reasoning steps with tool-use actions. The model alternates Thought: and Action: lines, observing each tool result before the next step. The canonical agent loop scaffold.
What is role priming?
Role priming is assigning the model a persona at the head of the prompt to anchor tone, depth, and vocabulary. “As a tax accountant, explain…” See role-priming.
What is persona prompting?
Persona prompting is a synonym for role priming. Some teams use “persona” for richer character specifications (background, style, audience) and “role” for the bare profession or stance.
What is sequential prompting?
Sequential prompting is chaining related turns when one prompt would overload the model. First describe the process, then list variants, then compare them. Each turn builds on the previous output.
What is prompt chaining?
Prompt chaining is the production form of sequential prompting: a fixed pipeline of prompts where each step’s output feeds the next, often with code between steps. See prompt-chaining.
What is self-consistency?
Self-consistency is sampling the same chain-of-thought prompt N times at non-zero temperature and taking the majority vote of the final answers. Trades cost for accuracy on reasoning tasks.
What is reflexion?
Reflexion is a pattern where the model critiques its own output against the task, then revises. Useful when the failure mode is overconfidence; less useful when the model lacks the underlying knowledge.
What is a structured prompt?
A structured prompt is one with explicit sections (instruction, context, examples, format) separated by delimiters or headings. The shape that makes evals and caching tractable. See structured-prompt.
What are delimiters?
Delimiters are syntactic markers (---, ###, XML tags, triple backticks, markdown headings) that separate instructions from primary content. Prevent the model from confusing user-provided text with system instructions and reduce injection surface. See prompt-injection-defense.
Model parameters
What is temperature?
Temperature is a decoding parameter that controls randomness by scaling the next-token logits before sampling. 0 is greedy and deterministic; higher values flatten the distribution. Use 0 for extraction and Q&A; 0.7+ only for creative tasks. See temperature.
What is top-p (nucleus sampling)?
Top-p (nucleus sampling) is a decoding parameter that samples from the smallest set of tokens whose cumulative probability exceeds p. Adapts the candidate pool to the confidence of the distribution. See top-p.
What is top-k?
Top-k is a decoding parameter that samples only from the k most probable next tokens. Fixed-size pool, unlike top-p. Less common in modern APIs but still exposed by some.
What are max tokens?
Max tokens is the cap on completion length, counted in tokens not characters. Set it tight enough to fail fast on runaway generations but loose enough for the longest legitimate output.
What are stop sequences?
Stop sequences are strings that, when emitted, terminate generation immediately. Used to bound structured output or to halt at a delimiter. See stop-sequence.
What is a frequency penalty?
A frequency penalty is a decoding parameter that downweights tokens proportional to how often they have already appeared in the completion. Discourages literal repetition.
What is a presence penalty?
A presence penalty is a decoding parameter that downweights any token that has appeared at all in the completion, regardless of count. Encourages topic diversity rather than just reducing repetition.
What is a seed?
A seed is an integer that fixes the sampler’s random state so a given prompt produces the same completion across runs at a given temperature. Required for reproducible evals; supported by some APIs only.
Tool use
What is tool use?
Tool use is the pattern where the model emits a structured request for an external function and the application executes it, returning the result to the next inference call. The substrate of agent loops. See tool-use.
What is function calling?
Function calling is the API-level form of tool use, where tools are declared as JSON schemas and the model returns a structured call object the application dispatches. See function-calling.
What is a tool result?
A tool result is the output of an external function returned to the model in the next turn. Becomes part of the conversation history the model reasons over. See tool-result.
What is JSON mode?
JSON mode is an API setting that constrains the completion to valid JSON. Cheap form of structured output; pair with a schema for tighter guarantees.
What is structured output?
Structured output is any output shape the application can parse without prose handling: JSON, XML, tabular text, fixed-key markdown. See structured-output and structured-output.
What is schema-validated output?
Schema-validated output is structured output checked against a declared schema (JSON Schema, Pydantic, Zod) either at decode time (constrained decoding) or after the call. See schema-validated.
What are parallel tool calls?
Parallel tool calls are multiple tool invocations emitted in a single assistant turn, dispatched concurrently by the application. Reduces latency when the tools are independent.
Retrieval and grounding
What is RAG (Retrieval-Augmented Generation)?
RAG (Retrieval-Augmented Generation) is the pattern of fetching relevant documents from a corpus at query time and including them in the prompt as grounding context. The standard mitigation for stale or domain-specific knowledge gaps. See rag and rag.
What is an embedding?
An embedding is a fixed-size vector representation of a text chunk produced by an embedding model. The substrate of semantic search. See embedding.
What is a vector store?
A vector store is a database that indexes embeddings for nearest-neighbor search. Examples: pgvector, Pinecone, Qdrant, Weaviate, ChromaDB.
What is semantic search?
Semantic search is retrieval by embedding similarity rather than keyword match. Catches paraphrases and synonyms that lexical search misses; pair with keyword search for hybrid retrieval.
What is a reranker?
A reranker is a second-stage model that scores retrieved candidates against the query and reorders them. Improves top-k precision after a recall-oriented first-stage retrieval. See reranker.
What is chunking?
Chunking is splitting a corpus into passage-sized units before embedding. Chunk size and overlap directly affect retrieval quality; too small loses context, too large dilutes the embedding.
What is grounding?
Grounding is including authoritative source material in the prompt so the model answers from given context rather than parametric memory. The single largest hallucination reduction technique.
What is a citation?
A citation is a reference to a specific source span in the model’s output. Asking the model to cite its sources reduces fabrication because every fabricated claim now requires a fabricated citation that breaks under inspection.
Model families
What is a base model?
A base model is a pretrained model that has not been instruction-tuned. Predicts next tokens given any text but does not follow instructions cleanly. See base.
What is an instruct model?
An instruct model is a base model fine-tuned to follow instructions in a single-turn format. The precursor to chat models; still useful for batch transforms where multi-turn is unnecessary.
What is a chat model?
A chat model is an instruct model further tuned for multi-turn dialogue with structured role messages (system, user, assistant). The default shape for production prompting.
What is a reasoning model?
A reasoning model is a chat model trained to spend additional inference-time compute on internal reasoning before answering. Examples: Claude with extended thinking, OpenAI o-series. Prompt differently than chat models. See reasoning-model-prompting.
What is a multimodal model?
A multimodal model is one that accepts non-text inputs (images, audio, video, PDFs) alongside text. See multimodal.
What is a fine-tuned model?
A fine-tuned model is a base or chat model further trained on a task-specific dataset to lock in a format, voice, or domain. Last-resort optimization after prompting and RAG have been exhausted. See fine-tuning.
What is a distilled model?
A distilled model is a smaller model trained to mimic a larger teacher model’s outputs. Cheaper and faster at inference; loses some capability at the tail. See distillation.
Evaluation
What is an eval set?
An eval set is a fixed collection of inputs with known correct outputs used to regression-test a prompt on every change. The unit of evidence for promoting a prompt to production. See eval-set and prompt-evals.
What is a golden set?
A golden set is a smaller, hand-curated subset of the eval set with high-confidence ground truth. The benchmark for catching subtle regressions. See golden-set.
What is ground truth?
Ground truth is the verified correct output for an eval input. The reference the model’s completion is scored against. See ground-truth.
What is LLM-as-judge?
LLM-as-judge is using a separate LLM call to score completions against a rubric. Cheaper than human review; biased toward verbose outputs and toward its own family of models. See llm-as-judge.
What is an evaluation harness?
An evaluation harness is the framework that runs prompts against the eval set, scores results, and reports metrics. Examples: Inspect, promptfoo, Braintrust, in-house. See evaluation-harness.
What is hallucination rate?
Hallucination rate is the share of completions that contain unsupported or fabricated claims. The primary safety metric for grounded Q&A. See hallucination-rate.
What is a regression test?
A regression test is an eval case that previously failed and now passes. Pin it to the eval set so the fix does not silently revert. See prompt-evals.
Safety and security
What is prompt injection?
Prompt injection is adversarial input that overrides the system instruction by embedding new instructions in user input or retrieved content. The canonical LLM security vulnerability. See prompt-injection and prompt-injection-defense.
What is a jailbreak?
A jailbreak is a prompt designed to bypass the model’s safety training and elicit refused content. Overlaps with but is distinct from prompt injection. See jailbreak.
What is indirect prompt injection?
Indirect prompt injection is injection delivered through content the model retrieves rather than through direct user input: a webpage, an email, a tool result containing hidden instructions. The dominant agent-era attack surface.
What is a sentinel prompt?
A sentinel prompt is a known-text canary embedded in the system message; if the model echoes or reveals it, an injection has succeeded. A cheap injection-detection tripwire.
What is a guardrail?
A guardrail is a pre- or post-inference filter that blocks unsafe inputs or outputs. Examples: input classifiers, output policy checks, regex denylists, schema validators.
What is a refusal?
A refusal is a completion in which the model declines to perform the requested task, citing policy. Track refusal rate as an eval metric; high rates on benign inputs indicate over-tuned safety.
What is a safe completion?
A safe completion is an output that complies with policy without an outright refusal: redacted, partial, or redirected to a safer alternative. Generally preferable to a hard refusal for user experience.
Performance and efficiency
What is a prompt cache?
A prompt cache is an API feature that caches the model’s internal state for a stable prompt prefix, charging less and returning faster on subsequent calls. Structure prompts so the cached prefix is invariant across requests. See prompt-cache and prompt-caching-strategies.
What is a KV cache?
A KV cache is the per-token key-value tensors the model retains during generation so each new token attends to prior tokens in O(1). The mechanism prompt caching exposes to the API surface.
What is context window utilization?
Context window utilization is the fraction of the model’s context window consumed by a given request. Track it as a leading indicator; degraded accuracy and latency creep in well before the hard limit.
What is a token budget?
A token budget is the allocation of context tokens across system message, examples, retrieved context, and reserved output space. Set it explicitly so retrieval cannot starve the answer.
What is latency?
Latency is wall-clock time from request to final token. Decomposes into time-to-first-token (TTFT) and inter-token latency. Streaming reduces perceived latency without changing total.
What is cost per request?
Cost per request is the dollar cost of a single inference call, computed from input tokens, output tokens, cache hits, and tool-result tokens. The unit economic for any LLM feature.
What is streaming?
Streaming is delivering completion tokens to the client as they are generated rather than waiting for the full response. Reduces perceived latency and enables incremental UI updates.
Agents and orchestration
What is an agent?
An agent is an LLM-driven program that selects and invokes tools in a loop until a goal is reached. Distinguished from a single-shot prompt by the presence of state, tool use, and termination conditions.
What is an agent loop?
An agent loop is the cycle of plan, act, observe, repeat that an agent runs until it produces a final answer or hits a stop condition. See agent-loop.
What is a planner?
A planner is the agent component that decomposes a goal into ordered sub-tasks. May be a dedicated prompt, a separate model, or implicit in the main loop.
What is an executor?
An executor is the agent component that runs each sub-task, calling tools and producing intermediate results. Paired with a planner in the planner-executor pattern.
What is the planner-executor pattern?
The planner-executor pattern is an agent design where one prompt plans the full task and a second prompt executes each step. Separates strategic and tactical reasoning; easier to debug than a single monolithic loop. See planner-executor.
What is a subagent?
A subagent is an agent invoked by another agent for a bounded sub-task, with its own system prompt and tool set. Used to isolate context, scope permissions, and parallelize work.
What is MCP (Model Context Protocol)?
MCP (Model Context Protocol) is Anthropic’s open standard for connecting LLMs to external tools, resources, and prompts via a uniform server interface. See mcp and mcp-servers.
What is memory?
Memory is durable state an agent reads and writes across turns or sessions: scratchpad files, vector stores of past interactions, structured key-value stores. Distinct from the in-context conversation history.