About This Catalog

This is the fifteenth volume in a catalog of the working vocabulary of agentic AI, and the first one I would defensibly skip or fold into the prior volumes. The honest framing matters. Volumes 11—14 were “Not gaps” --- adjacent disciplines or perishable snapshots that earned their place despite my initially flagging them as non-essential. This volume is a weaker candidate than any of those. Much of what could be in a “prompting and context engineering” volume already lives elsewhere in the series: packaged instructions belong to Volume 2 (Skills); tool definitions to Volume 3 (Tools); persistent context state to Volume 6 (Memory); retrieved context to Volume 10 (Retrieval); safety prompting to Volume 8 (Evaluation & Guardrails); multi-agent prompting to Volume 9 (Multi-Agent Coordination). The residual --- the genuinely-not-elsewhere content this volume covers --- is real but smaller in scope than the prior volumes.

Why include it anyway? Because the residual is coherent enough to deserve explicit vocabulary, and because prompting as a discipline has accumulated enough practitioner knowledge through 2022—2026 that scattering the residual across other volumes loses the coherence of the discipline. The discipline’s emergence as a distinct concern --- the way teams have learned that few-shot examples behave differently from chain-of-thought triggers, that context ordering produces different model behavior, that prompt versioning is its own engineering practice --- is what this volume captures. Read it knowing the scope is narrower than prior volumes, and treat the fold-into-existing-volumes alternative as a legitimate choice some readers will make.

The volume’s defensive posture also reflects a real evolution in the field. Through 2022—2023, prompting was the dominant discipline for working with LLMs: most of the value came from clever phrasing, well-chosen examples, careful constraints. Through 2024—2026, the discipline shifted as models became more capable (less sensitive to small prompt variations), reasoning-trained models reduced the need for explicit chain-of-thought patterns, structured output features (JSON mode, tool calling) made format prompting less critical, and agent frameworks abstracted many prompting concerns. What remains is narrower than the original discipline but still real: system prompt design, context window engineering at production scale, prompt versioning, model-specific conventions, meta-prompting. This volume covers the residual.

Scope

Coverage:

System and user prompt design patterns: role, persona, constraints, instruction structure.
Few-shot and N-shot prompting: when examples help, when they hurt, ordering effects.
Chain-of-thought variants and reasoning-trained model implications.
Context window engineering: selection, ordering, compression at production scale.
Lost-in-the-middle effects and position-dependent recall.
Structured output prompting: JSON mode, schema-constrained generation, tool-calling for structure.
Prompt versioning, testing, and iteration as engineering practice.
Model-specific prompting conventions: Anthropic XML, OpenAI markdown, Google instructions.
Meta-prompting: AI-generated prompts and prompt optimization.

Out of scope (covered elsewhere):

Packaged agent instructions as reusable units. Volume 2 (Skills) covers this.
Tool definitions, schemas, and function-calling conventions. Volume 3 (Tools) covers this.
Memory persistence and conversation state. Volume 6 (Memory) covers this.
Retrieval-augmented generation patterns. Volume 10 (Retrieval & Knowledge Engineering) covers this.
Safety guardrails and content moderation prompting. Volume 8 (Evaluation & Guardrails) covers this.
Inter-agent communication prompts. Volume 9 (Multi-Agent Coordination) covers this.
HITL approval prompting. Volume 7 (Human-in-the-Loop) covers this.

How to read this catalog

Part 1 (“The Narratives”) is conceptual orientation: why this volume is a weaker candidate than the prior fourteen; the prompting-vs-context-engineering distinction that emerged through 2024—2026; the five prompting pattern families that compose most production prompts; context window engineering as a discipline beyond just fitting tokens; and the practitioner discipline of versioning, testing, and iteration. Four diagrams sit in Part 1 (one fewer than prior volumes, matching the scope reduction).

Part 2 (“The Substrates”) is reference material organized by section. Each section opens with a short essay on what its entries share. Representative substrates appear in the Fowler-style template, with the volume’s entry count smaller than prior volumes to reflect its narrower scope.

Part 1 — The Narratives

Five short essays frame why a weaker-candidate volume can still earn its place. The reference entries in Part 2 assume the orientation established here.

Chapter 1. Why This Volume Is a Weaker Candidate

The prior four “Not gaps” volumes (11—14) followed a consistent pattern: I flagged each as not a major missing piece; the volume earned its place anyway as either a complementary discipline (compliance, security, design) or a deliberately-perishable snapshot (products survey). This volume is different. It’s not a complementary discipline with its own distinct audience; the audience is the same agent developers reading the prior volumes. It’s not a perishable snapshot violating the structural-vocabulary principle; it’s ostensibly structural vocabulary. What it is, instead, is a volume whose content largely exists elsewhere in the series --- enough so that the honest framing acknowledges the fold-into-existing-volumes alternative as a legitimate choice.

The fold map makes this concrete. Packaged instructions --- the reusable units that bundle system prompts with associated context and tools --- are covered in Volume 2 (Skills). Tool definitions and function-calling schemas are covered in Volume 3 (Tools). Persistent context state across sessions and conversations is covered in Volume 6 (Memory). Retrieved context for RAG patterns is covered in Volume 10 (Retrieval & Knowledge Engineering). Safety prompt patterns and constitutional AI techniques are partially covered in Volume 8 (Evaluation & Guardrails). Inter-agent communication prompts are covered in Volume 9 (Multi-Agent Coordination). Each of these is a substantial chunk of what a comprehensive “prompting” volume would contain; each is already covered elsewhere with appropriate depth.

The residual --- what this volume actually covers --- is real but narrower. Few-shot and N-shot patterns as a category distinct from packaged Skills. Chain-of-thought variants and their evolution as reasoning-trained models emerged. Role and persona prompting that doesn’t fit cleanly under Skills. System vs. user prompt design as an architectural concern. Model-specific conventions where the same instruction phrased differently produces different results across vendors. Context window engineering at production scale where selection, ordering, and compression become explicit disciplines beyond just “does it fit.” Structured output prompting where JSON mode and schema constraints shape the prompt design. Prompt versioning and testing as engineering practice. Meta-prompting where AI generates prompts for other AI.

The volume earns its place if the reader believes these residual topics deserve explicit treatment in one place rather than scattered across the volumes they’d nominally fit. The honest case for explicit treatment: the practitioner discipline of prompting has accumulated enough working knowledge through 2022—2026 that the discipline’s coherence is worth preserving as a discrete reference. The honest case against: the residual could be appendices in the existing volumes; the fold-into-other-volumes alternative would reduce series complexity without losing essential content. The reader who finds the fold alternative more attractive should treat this volume as optional; the reader who values the consolidated treatment should read it knowing the scope is deliberately narrower than prior volumes.

Chapter 2. Prompting vs. Context Engineering

The discipline of working with LLMs has bifurcated through 2024—2026 into two related but distinct sub-disciplines. Prompting is about designing the instructions --- what you say to the model. Context engineering is about managing what the model sees --- what content is in the context window when the model is asked to produce its output. Both matter for production agents; the distinction is recent enough that practitioners often conflate them, but the conflation produces confusion when techniques that work for one don’t work for the other.

Prompting in the strict sense is about the instructions you give the model. The system prompt that establishes role and constraints. The user prompt that describes the specific task. The few-shot examples that demonstrate desired behavior. The chain-of-thought trigger that prompts explicit reasoning. The output format specification that produces parseable structure. The question prompting asks: how do I phrase this so the model does what I want? The techniques are about word choice, instruction structure, example selection, and prompt composition. Section A and Section B of Part 2 cover this.

Context engineering is about what the model has available when it produces its output. The conversation history --- how much of it to include, when to summarize, when to truncate. The retrieved documents --- which ones, in what order, how to present them. Memory contents --- what persistent state to inject, when to ignore. Tool definitions and prior tool results --- which tools to expose, how to format prior results. The question context engineering asks: what should the model see when it produces its output? The techniques are about selection, ordering, compression, and presentation. Section C of Part 2 covers this.

The two disciplines interact. A well-engineered context paired with a poorly-designed prompt underperforms. A well-designed prompt paired with a poorly-engineered context underperforms. Production agents need both done well; the discipline of prompting alone, dominant through 2023, gradually proved insufficient as production agents accumulated state, tools, and retrieved content. Context engineering emerged as the necessary complement once the typical production agent had multi-turn conversations, tool-call results from prior turns, retrieved RAG documents, persistent memory, and dynamic tool selections all competing for limited context window.

The distinction also explains why some prompting advice from 2022—2023 ages poorly. Advice like “include detailed instructions in the system prompt” was right when the prompt was the main thing the model saw. By 2026, the model also sees prior conversation turns, retrieved documents, tool results, and memory --- the prompt is a small fraction of what affects the model’s behavior, and context engineering matters at least as much. Modern prompting practice integrates both: the prompt design is part of a broader context strategy rather than the whole game.

Chapter 3. The Five Prompting Pattern Families

Most production prompts compose techniques from five pattern families. The families have evolved through 2022—2026 with shifting prominence as models matured: system prompting and output formatting became more critical as agents accumulated capability; few-shot prompting declined in relative importance as models improved zero-shot performance; chain-of-thought prompting shifted as reasoning-trained models internalized the pattern; model-specific conventions emerged as vendors developed distinct prompt structures. Understanding the families separately makes prompt design choices legible.

System prompting establishes the model’s role, persona, and constraints. The system prompt defines what the model is (a coding assistant, a customer service agent, a research analyst), what it does (the tasks it’s expected to perform), and what it won’t do (the constraints and refusals). Almost every production agent has a system prompt; the system prompt is usually the longest single piece of prompt material the model sees. Design considerations: clarity over cleverness, explicit constraints over implicit ones, concrete behavior descriptions over abstract characterizations. The practice has converged on patterns: most production system prompts have role definition first, capability description second, constraints third, output guidelines fourth.

Few-shot prompting includes 1 to N examples of desired input→output pairs in the prompt itself. The model uses the examples to infer the pattern. Few-shot was the dominant technique through 2022—2023 because zero-shot performance was unreliable for many tasks; including examples bridged the gap. By 2026 few-shot is more selective: modern models handle many tasks zero-shot adequately; few-shot remains valuable for tasks with specific format requirements, idiosyncratic conventions, or rare patterns the model hasn’t seen enough of. Ordering matters --- examples earlier in the prompt typically have stronger influence; difficulty progression (easy examples first, hard last) often helps. Number selection involves trade-offs: more examples improve pattern recognition but consume context window.

Chain-of-thought prompting asks the model to reason step-by-step before producing the final answer. The technique improved performance on complex reasoning tasks substantially when introduced in 2022. By 2026, the technique’s relative importance has shifted: reasoning-trained models (o-series at OpenAI, extended thinking at Anthropic, similar capabilities at other providers) internalize the pattern, reducing the need for explicit chain-of-thought triggers. The technique still helps for non-reasoning-trained models, for tasks the model hasn’t seen the pattern for in training, and for cases where the user wants visibility into the reasoning. Variants include “think step by step,” “think carefully,” “show your work,” and structured-thinking prompts that ask for specific reasoning phases.

Output formatting prompting specifies the exact format the model should produce. JSON schemas, XML tag conventions, markdown structures, custom delimiters. The pattern was critical when downstream parsers required predictable formats and models were inconsistent at producing them. By 2026 the pattern has been partially absorbed into model features: OpenAI’s structured outputs guarantee schema compliance; Anthropic’s tool use produces typed outputs; similar capabilities exist at other providers. Explicit output formatting prompting remains valuable when the structured output features aren’t available, when the format is complex enough to need explicit specification, or when the model needs guidance on format choices the schema doesn’t express.

Model-specific prompting reflects the conventions different vendors have developed. Anthropic’s Claude family handles XML-tag-delimited prompts particularly well (instructions wrapped in `<instructions>`, context in `<context>`, examples in `<examples>`); the convention is documented and reliable. OpenAI’s GPT family handles markdown-formatted prompts well and integrates instructions across the system/user/assistant turn structure differently than Claude. Google’s Gemini has its own conventions emphasizing explicit instruction blocks. Cross-model portability of prompts is partial; production deployments typically lock in conventions for their primary model with explicit knowledge that switching providers requires prompt rework.

Chapter 4. Context Window as Engineering Discipline

Context windows grew from 4K—8K tokens in 2022 to 200K—1M+ tokens by 2026. The growth changed the discipline rather than eliminating it. With 4K windows, the question was “what fits?” --- a constrained-optimization problem. With 1M windows, the question is “what helps?” --- a different problem with the same name. Too much context degrades performance: lost-in-the-middle effects, attention dilution, instruction-following degradation as context grows. Production agents that ignore context engineering pay the cost in degraded performance even when within the model’s nominal window.

Context engineering — Selection, ordering, and compression — the three problems that emerge at scale. Larger windows shifted the discipline from "what fits" to "what helps."

Selection is the first problem. What content goes in the context window? The naive answer --- everything that might be relevant --- fails at scale because relevance has gradients. A conversation with 200 prior turns may have only 5—20 turns that affect the current decision. A retrieved corpus may have 50 documents semantically related to the query but only 3—5 that contain decision-relevant information. The discipline of selection: include what’s decision-relevant; exclude what isn’t. The discipline is harder than it sounds because relevance isn’t always knowable in advance; production systems use heuristics (recency, semantic similarity, explicit tagging) and increasingly use LLM-based selection (a smaller cheaper model judges what to include before the main model runs).

Ordering is the second problem. Position within the context window affects model behavior substantially. The lost-in-the-middle phenomenon documented in research and confirmed in production: information at the beginning or end of long contexts is recalled reliably; information in the middle is recalled less reliably. The implication: place decision-critical information at the start or end of the context window, not buried in the middle. For RAG patterns, this means the most relevant retrieved documents should be at the start or end of the retrieved context, not in the middle. For conversation history, this means key prior content should be repeated or summarized at the start of subsequent turns rather than relying on the model to find it in the middle.

Compression is the third problem. When the decision-relevant content exceeds the available budget, compression becomes necessary. Several patterns: summarization (lossy; preserves overall content but loses specifics), extraction (precise; keeps key facts and discards prose), hierarchical retrieval (retrieve summaries first; retrieve details on demand), compression models (smaller LLMs trained to compress context for larger ones). Each has trade-offs. Summarization loses specifics that might matter for downstream decisions. Extraction loses context that might be needed for synthesis. Hierarchical retrieval adds latency through multiple round trips. Compression models add complexity and may introduce errors of their own. The discipline is choosing the right compression strategy for the specific use case.

Larger context windows have not eliminated these problems; they have shifted them. The 1M-token window that became commercially available through 2025—2026 lets agents include vastly more context than the 4K window of 2022, but the selection, ordering, and compression problems persist. A 1M-token context with poorly-organized content underperforms a 100K-token context with well-organized content. The discipline matters across context sizes; the specific techniques vary but the underlying problems are the same.

Chapter 5. The Practitioner Discipline

Beyond the patterns and techniques, prompting has accumulated practitioner discipline: versioning prompts as artifacts, testing them systematically, iterating with structured methodology. The discipline is the most under-documented aspect of prompting because it’s the most boring --- not novel techniques but the operational practice of treating prompts as engineering artifacts subject to the same discipline as any other production code.

Prompt versioning matters because prompts change over time and the changes have consequences. A team that ships a prompt change and discovers it broke downstream behavior needs to know what changed, who changed it, and how to roll back. The naive pattern --- prompts as strings embedded in application code, committed to version control alongside the code that uses them --- works for simple cases. At scale, prompts often need their own versioning infrastructure: prompt registries (LangChain Hub, PromptLayer, Anthropic Workbench, OpenAI Playground saved prompts) provide centralized versioning, A/B testing, and rollback. The choice between embedded prompts and registry-managed prompts depends on the deployment complexity; small teams typically start with embedded and graduate to registries as the prompt count and team size grow.

Prompt testing is the systematic verification that prompts produce expected behavior. The discipline borrows from software testing but adapts to LLM specifics. Unit-test-like patterns: specific input→output assertions where the prompt is treated like a function whose behavior can be tested. Eval-suite patterns: collections of test cases with grading rubrics, run periodically to detect regressions when prompts or models change. A/B testing patterns: two prompt variants run against the same traffic with outcomes compared statistically. Each has applicability and limitations; the LLM’s stochasticity makes pure assertion-based testing harder than for deterministic code, and the discipline has converged on combinations of techniques rather than any single approach. Section D of Part 2 covers the tooling.

Iteration discipline is the practice of changing prompts based on observed behavior rather than intuition. The naive pattern is “message the prompt until it seems to work better,” which produces non-reproducible improvements and missed regressions. The disciplined pattern: instrument the prompt with traces (Volume 7 covers the observability infrastructure), identify specific failure modes with example inputs, change the prompt to address those failure modes, re-run the eval suite, deploy with monitoring for regressions. The discipline isn’t new --- it’s software engineering practice applied to prompts --- but applying it requires recognizing that prompts are engineering artifacts subject to engineering rigor.

Meta-prompting is the practice of using AI to generate or optimize prompts for other AI. The pattern emerged through 2024—2026 as foundation models became capable enough to reason about prompt design. Patterns include: prompt-from-examples (the model generates a prompt that would produce the given examples), prompt optimization (the model proposes improvements to an existing prompt against an eval set), self-prompting (the model writes its own task-specific prompt based on the user’s description). The discipline is less mature than the other practices in this chapter; the long-term value of AI-generated prompts vs. human-authored ones is still being figured out. Production teams that use meta-prompting typically use it as a starting point that humans refine rather than as the final artifact.

Part 2 — The Substrates

Eight sections survey the residual substrate of prompting and context engineering as of mid-2026. The section count matches prior volumes; the entry count is smaller, reflecting the volume’s narrower scope.

Sections at a glance

Section A --- System and user prompt design
Section B --- Few-shot and chain-of-thought patterns
Section C --- Context window engineering
Section D --- Structured outputs
Section E --- Prompt versioning and testing
Section F --- Model-specific prompting conventions
Section G --- Meta-prompting and optimization
Section H --- Discovery and resources

Section A — System and user prompt design

Role, persona, constraints, instruction structure --- the prompt as architectural artifact

System prompts are the longest piece of prompt material most production agents use. The system prompt establishes role, persona, capability scope, output guidelines, and constraints; user prompts then describe specific tasks within the established frame. The design discipline involves deciding what goes in the system prompt vs. the user prompt, how to structure instructions for the model’s preferences, and how to balance specificity against generalization across the range of inputs the agent will receive.

System prompt design patterns

Source: Pattern documented across vendor documentation; specific patterns in Anthropic’s prompt engineering guide, OpenAI’s GPT best practices, Google’s Gemini prompting guide

Classification Patterns for the system prompt that establishes agent role and behavior.

Intent

Design system prompts that produce reliable agent behavior across the range of user inputs by establishing role, capabilities, constraints, and output guidelines in a structure the model handles well.

Motivating Problem

System prompts that are too brief produce inconsistent behavior; the model fills gaps with defaults that may not match the intended agent. System prompts that are too long dilute attention and may trigger lost-in-the-middle effects within the prompt itself. System prompts with conflicting or vague instructions produce unpredictable behavior. The discipline is finding the right amount of specificity --- enough to produce reliable behavior, not so much that the prompt becomes unwieldy.

How It Works

Standard structure: most production system prompts follow a pattern. Role definition first (“You are X, a specialized assistant for Y”), capability description second (“You can help with A, B, C”), constraints third (“You will not D, E; if asked about D, say F”), output guidelines fourth (“Respond in X format; use Y tone”). The ordering matters because models attend more to instructions placed early; constraints buried at the end of long prompts get less consistent compliance.

Explicit vs. implicit constraints: explicit constraints (“do not discuss competitor products”) work more reliably than implicit ones (“stay on topic”). The model’s interpretation of implicit constraints varies; explicit constraints leave less room for the model’s judgment to diverge from the intended behavior.

Concrete behavior descriptions: “When the user asks about X, do Y” works better than “Handle X questions appropriately.” The concrete pattern gives the model a specific behavior to imitate; the abstract pattern relies on the model’s interpretation of “appropriate.”

Persona vs. role: persona (“You are a friendly, knowledgeable cooking assistant”) shapes tone and approach but not specific behavior. Role (“You are a customer support agent for Acme Inc.”) shapes specific behavior through the role’s associated expectations. Production system prompts typically combine both: persona establishes tone, role establishes capability scope.

Length considerations: production system prompts range from 100 tokens for simple agents to 5000+ tokens for complex agents with many tools and constraints. The right length depends on the agent’s complexity; shorter is generally better when shorter suffices. The discipline is fitting the necessary instructions in the minimum space, not maximizing the prompt to cover every possible case.

When to Use It

Every production agent has at least a brief system prompt; the question is how much structure and specificity to include. Use the standard pattern (role, capability, constraints, output) as a starting point; expand specific sections as the agent’s requirements demand. Use explicit over implicit constraints; use concrete behavior descriptions over abstract ones.

Alternatives --- packaged Skills (Volume 2) for cases where the instructions are large enough to be reusable artifacts. Multi-turn priming where the first user turn establishes context that the system prompt alone wouldn’t convey.

Sources

docs.claude.com/en/docs/build-with-claude/prompt-engineering
platform.openai.com/docs/guides/prompt-engineering
ai.google.dev/gemini-api/docs/prompting-strategies

Example artifacts

Code.

<!-- Anthropic-style system prompt with XML structure -->

<role>

You are a research assistant specialized in technology industry
analysis. You help users

understand market dynamics, competitive positioning, and emerging
trends.

</role>

<capabilities>

- Summarize industry reports and analyst coverage

- Compare products and companies on specific dimensions

- Identify trends from time-series data

- Flag conflicting information from different sources

</capabilities>

<constraints>

- Always cite sources when making factual claims

- Acknowledge uncertainty when sources disagree or data is
incomplete

- Do not predict stock prices or make investment recommendations

- If asked about regulated activities (securities, healthcare),
recommend consulting

a qualified professional rather than answering directly

</constraints>

<output_guidelines>

- Use markdown for structure (headers, bullets, tables) when content
has structure

- Use plain prose for narrative explanations

- Include a 'Sources' section at the end listing references

- Keep tone professional but accessible; avoid jargon when simpler
words work

</output_guidelines>

Section B — Few-shot and chain-of-thought patterns

Examples and reasoning triggers --- the techniques that evolved as models matured

Few-shot prompting (including examples in the prompt) was the dominant technique through 2022—2023. Chain-of-thought prompting (asking the model to reason step-by-step) emerged in 2022 and shifted production through 2023—2024. Both techniques have evolved as models improved zero-shot performance and as reasoning-trained models internalized chain-of-thought behavior. The patterns remain relevant but with narrower applicability than they had at peak.

Few-shot and N-shot prompting patterns

Source: Brown et al., “Language Models are Few-Shot Learners” (GPT-3 paper, 2020); accumulated practitioner knowledge through 2022—2026

Classification Including examples in the prompt to demonstrate desired behavior.

Intent

Improve model performance on specific tasks by including 1—N examples of desired input→output pairs in the prompt, allowing the model to infer patterns from the examples rather than relying on instruction-following alone.

Motivating Problem

Some tasks have specific format requirements, idiosyncratic conventions, or rare patterns that the model handles inconsistently zero-shot. Few-shot prompting bridges the gap: showing examples of the desired pattern produces more consistent compliance than describing it. The technique was foundational through 2022—2023 when zero-shot performance was unreliable for many tasks. By 2026 the technique is more selective --- modern models handle many tasks zero-shot adequately --- but it remains valuable for specific use cases.

How It Works

Number selection: 1-shot (one example), few-shot (typically 2—5), N-shot (more, up to context window limits). More examples improve pattern recognition for tasks where the pattern is genuinely complex; fewer examples are sufficient for tasks where the pattern is simple enough that 1—2 examples suffice. Diminishing returns set in quickly; production deployments rarely use more than 5—10 examples.

Example selection: the examples should cover the range of cases the agent will encounter. Production patterns include manual curation (engineers select representative examples), retrieval-augmented few-shot (the most similar examples to the current input are selected dynamically from a larger pool), and diverse selection (examples chosen to cover edge cases as well as common cases).

Ordering effects: examples earlier in the prompt typically have stronger influence; this is the same lost-in-the-middle pattern that affects long contexts generally. Production patterns: put the most representative examples first; vary difficulty progression (some practitioners put easy examples first to establish the pattern, then harder ones; others put hard examples first to emphasize edge cases).

Format consistency: the examples should match the desired output format exactly. Inconsistent example formats confuse the model; the model may produce output that mixes the formats it saw. The discipline is treating the examples as a strict format specification, not just a pattern hint.

When few-shot stops helping: modern models handle many tasks zero-shot adequately. Few-shot for these tasks adds context-window cost without proportionate benefit. The discipline is testing whether few-shot actually improves performance vs. zero-shot for the specific task and model; defaulting to few-shot for everything wastes context budget.

When to Use It

Tasks with specific format requirements where consistency matters. Idiosyncratic conventions specific to the application that the model wouldn’t know. Rare patterns or domains where the model’s training data didn’t include enough examples. Cases where zero-shot testing shows specific failure modes that examples address.

Alternatives --- zero-shot prompting with detailed instructions for cases where modern models handle the task adequately without examples. Fine-tuning for cases where the pattern is consistent enough and the volume high enough to justify dedicated training. Retrieval-augmented few-shot for cases where the pattern varies by input and dynamic example selection helps.

Sources

Brown et al., “Language Models are Few-Shot Learners” (2020)
Vendor prompt engineering guides for current best practices

Chain-of-thought and reasoning patterns

Source: Wei et al., “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models” (2022); reasoning-trained model documentation

Classification Prompting patterns for step-by-step reasoning and the implications of reasoning-trained models.

Intent

Improve performance on complex reasoning tasks by prompting the model to think step-by-step before producing the final answer, and understand how reasoning-trained models have shifted the technique’s applicability.

Motivating Problem

Many reasoning tasks (math problems, multi-step logic, complex analysis) benefit from explicit reasoning rather than direct answer generation. Through 2022—2023, prompting the model to “think step by step” before answering produced substantial accuracy improvements on these tasks. The technique entered production widely. Through 2024—2026, reasoning-trained models (OpenAI’s o-series, Anthropic’s extended thinking, Google’s reasoning models, others) internalized the pattern; explicit chain-of-thought prompts became less necessary for these models because the model reasons internally before producing output.

How It Works

Classical chain-of-thought: include “think step by step” or similar in the prompt; the model produces explicit reasoning followed by the final answer. Variants include “think carefully,” “work through this step by step,” “show your work,” and structured-thinking patterns that ask for specific reasoning phases. The technique works on non-reasoning-trained models and improves accuracy on complex tasks substantially.

Zero-shot chain-of-thought: a particularly effective variant uses “Let’s think step by step” as a magic phrase that triggers chain-of-thought without examples. The pattern emerged from research showing this specific phrasing was unusually effective; the effect varies across models but the pattern remains useful.

Few-shot chain-of-thought: combines chain-of-thought with few-shot prompting. Examples include both the reasoning and the final answer; the model learns to imitate both. The pattern is more powerful than either alone for complex tasks but consumes more context window.

Reasoning-trained models: o-series at OpenAI (o1, o3, etc.), Anthropic extended thinking, Google reasoning models, others internalize chain-of-thought reasoning. The model reasons during a dedicated reasoning phase before producing user-visible output; the explicit chain-of-thought prompt is unnecessary because the model is already doing the reasoning. Prompts that ask for chain-of-thought on reasoning-trained models are redundant; the model was going to think anyway.

Hybrid use: production agents that use both reasoning-trained and non-reasoning-trained models for different subtasks need different prompting for each. The reasoning-trained model gets direct task descriptions; the non-reasoning-trained model gets explicit chain-of-thought prompts. The architectural pattern emerged through 2025—2026 as teams discovered that the right model for each subtask varies and the prompts need to vary with it.

Visibility considerations: reasoning-trained models typically produce extensive internal reasoning that may not be shown to users. The reasoning is available through APIs (extended thinking blocks in Anthropic’s API, reasoning tokens in OpenAI’s API) but production UIs often hide it by default with progressive disclosure (Volume 13 covers). Whether to show reasoning to users is a UX decision separate from the prompting decision.

When to Use It

Complex reasoning tasks on non-reasoning-trained models. Math problems, multi-step logic, careful analysis of conflicting evidence. Tasks where the user benefits from seeing the reasoning (educational use cases, transparency-required use cases). Cases where reasoning quality matters more than latency.

Skip for reasoning-trained models doing tasks they’re already reasoning about. Skip for simple tasks where chain-of-thought adds latency and tokens without benefit. Skip when the cost of additional output tokens (reasoning is verbose) exceeds the accuracy improvement.

Sources

Wei et al., “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models” (2022)
Kojima et al., “Large Language Models are Zero-Shot Reasoners” (2022)
Vendor documentation for reasoning models

Section C — Context window engineering

Selection, ordering, and compression --- the discipline beyond “does it fit”

Context windows in 2026 are large --- 200K tokens is common, 1M tokens is available, longer windows are coming. The growth changed the discipline. The question is no longer “what fits” but “what helps.” Selection (what goes in), ordering (in what position), and compression (when budget is exceeded) are the three working sub-disciplines. Production agents that ignore them underperform agents that engineer them deliberately, regardless of the nominal window size.

Context selection and ordering at production scale

Source: Liu et al., “Lost in the Middle: How Language Models Use Long Contexts” (2023); accumulated practitioner knowledge through 2024—2026

Classification Engineering the contents and position of the context window for production performance.

Intent

Determine what content goes into the context window and in what order to maximize the model’s ability to use it effectively, recognizing that larger context windows have not eliminated the need for this engineering.

Motivating Problem

Large context windows let agents include more content than smaller windows allowed, but the model’s ability to use the content degrades as content grows. Lost-in-the-middle effects mean information buried in the middle of a long context is recalled less reliably than information at the start or end. Attention dilution means instructions in a context with much surrounding content compete with the surrounding content for attention. Instruction-following degrades subtly as context length grows. The naive approach --- include everything that might be relevant --- produces context that’s technically within the window but in which the model’s performance has degraded.

How It Works

Selection patterns: include what’s decision-relevant; exclude what isn’t. The challenge is determining relevance before the model runs. Patterns: recency-based selection (recent conversation turns matter more than old ones; recent tool results matter more than old ones), similarity-based selection (semantic similarity to the current query/task), explicit tagging (the agent’s logic marks specific content as required vs. optional), LLM-based pre-filtering (a smaller cheaper model judges relevance before the main model runs).

Ordering patterns: place decision-critical content at the start or end of the context window, exploiting the strong-recall positions. For RAG, this means the top-relevance retrieved documents should be at the start or end of the retrieved-context section, not buried in the middle. For conversation history, this means summarizing or repeating key prior content near the start of the current context window rather than relying on the model to find it in the middle of a long history.

Lost-in-the-middle phenomenon: documented in Liu et al. (2023) and confirmed across many models. Information at the beginning of a long context (“primacy” effect) and at the end (“recency” effect) is recalled reliably; information in the middle of long contexts is recalled less reliably. The effect varies by model and task but the general pattern persists across providers.

Trade-off considerations: context engineering adds engineering complexity. Production agents that don’t engineer context can still work; they just underperform what they could be. The discipline is justified when the agent’s value depends on careful use of available context (RAG-heavy agents, multi-turn agents with long histories, agents with many tools); it’s less critical for short-interaction agents where context engineering has less room to help.

Tools that help: agent frameworks (Volume 14 covers) provide primitives for context management. LangGraph’s state management, LlamaIndex’s indexing and retrieval, custom solutions built on agent frameworks. The tools don’t make context engineering automatic; they provide the primitives the team uses to implement the engineering discipline.

When to Use It

Production agents where context window approaches half of nominal capacity or more. RAG-heavy agents where retrieved content competes with other context for window space. Multi-turn agents with long conversation histories. Agents with many tools whose definitions consume significant context budget.

Alternatives --- architectural changes that reduce context needs (memory persistence via Volume 6 rather than including full history in context; specialized smaller agents for sub-tasks instead of one large agent handling everything). The right alternative depends on whether context engineering or architectural change addresses the root cause.

Sources

Liu et al., “Lost in the Middle: How Language Models Use Long Contexts” (2023)
Various papers on long-context model evaluation and limitations

Context compression patterns

Source: Various academic and practitioner papers on summarization and compression for LLM context

Classification Reducing context size when decision-relevant content exceeds the available budget.

Intent

Compress context content when the relevant material exceeds the available context budget, using summarization, extraction, hierarchical retrieval, or compression-specific models, with explicit awareness of the trade-offs each compression approach introduces.

Motivating Problem

Even with large context windows, some applications have more decision-relevant content than fits. Long-running conversations accumulate history. RAG corpora can be vast. Tool result chains can compound across many calls. When the relevant content exceeds the available budget, compression becomes necessary. The choice among compression strategies depends on what information needs to be preserved and what can be lost; each strategy has characteristic trade-offs.

How It Works

Summarization (lossy): replace prior content with model-generated summaries. The summary preserves overall content at higher information density but loses specifics. Useful for conversation history where the gist matters more than exact wording; less useful for cases where specific facts or numbers need to be preserved exactly.

Extraction (precise): pull specific facts, decisions, or key information from prior content into a structured form. Less lossy than summarization for the extracted facts but discards everything not extracted. Useful for cases with well-defined facts to track; less useful for cases where the narrative or context around facts matters.

Hierarchical retrieval: store full content in an external retrieval system; retrieve summaries first; retrieve details on demand when specific information is needed. The pattern works well for RAG corpora where most documents are not relevant to most queries; it adds latency through additional retrieval round-trips when details are needed.

Compression models: dedicated smaller LLMs trained to compress context for larger ones. The compression model takes a long context and produces a shorter one optimized for downstream model consumption. The approach is more sophisticated than naive summarization but adds complexity (another model to run, another point of failure, another evaluation challenge); production adoption is uneven.

Sliding window patterns: keep the most recent N turns at full fidelity; summarize older turns; further-summarize the oldest turns. The hierarchical aging pattern matches how human memory works and fits many production conversation patterns.

Selective preservation: identify content that must be preserved exactly (specific facts the user mentioned, specific tool results referenced) and content that can be compressed. Apply different compression strategies to different categories. The pattern is more complex to implement but produces better results than uniform compression.

When to Use It

Long-running conversations where full history exceeds context window. RAG systems where the relevant retrieved content exceeds the budget. Multi-step agent workflows where prior tool results accumulate. Production deployments where context budget management is a recurring concern.

Alternatives --- larger context windows where the issue is purely “doesn’t fit” rather than “doesn’t help.” Architectural changes that reduce context needs (Volume 6 memory patterns; Volume 10 retrieval patterns). The right alternative depends on whether the content is fundamentally too large or whether the architecture is making the content larger than it needs to be.

Sources

Academic literature on context compression and summarization
Practitioner documentation in LangChain, LlamaIndex, and related framework documentation

Section D — Structured outputs

JSON mode, schema-constrained generation, tool-calling as output structure

Structured output prompting was a substantial discipline through 2022—2023: getting reliable JSON, XML, or other structured output from models required careful prompt design with format specifications, validation prompts, and retry loops on parse failures. The discipline shifted through 2024—2026 as vendors shipped structured output features (OpenAI’s structured outputs with guaranteed schema compliance, Anthropic’s tool use producing typed outputs, similar at other providers). The prompting discipline narrowed but didn’t disappear: the structured output features handle the easy case; the prompting discipline still matters for cases the features don’t cover.

Structured output prompting and vendor features

Source: OpenAI structured outputs (Aug 2024); Anthropic tool use with typed schemas; Google Gemini function calling; Pydantic-based structured output libraries

Classification Producing reliable structured output from LLMs through vendor features or prompt engineering.

Intent

Get reliable, parseable structured output (JSON, XML, custom formats) from LLMs using vendor-provided structured output features where available, falling back to prompt engineering and validation patterns where they aren’t.

Motivating Problem

Many production agents need to produce structured output that downstream systems parse --- API calls, database operations, structured analyses, configuration generation. Through 2022—2023, achieving this reliably required prompt engineering: explicit format specifications, examples of valid output, validation prompts, retry loops on parse failures. Through 2024—2026, vendor features absorbed much of this work: OpenAI’s structured outputs (announced August 2024) guarantee 100% schema compliance for JSON outputs; Anthropic’s tool use produces typed outputs through the tool-calling mechanism; similar features exist at other providers. The prompt engineering work remaining is narrower but still real.

How It Works

Vendor structured output features: define a JSON schema or Pydantic model; the model is constrained to produce output that exactly matches the schema. OpenAI’s structured outputs use grammar-constrained decoding under the hood to guarantee compliance. Anthropic’s tool use treats tool definitions as schemas and produces typed tool inputs. The features work for cases where the schema is known and the structure fits the feature’s constraints.

When vendor features apply: most production structured output use cases. JSON outputs for downstream APIs. Function call arguments. Configuration generation. Structured analysis with predefined fields. The features work for these and should be the default choice when available.

When vendor features don’t apply: outputs that aren’t JSON (XML, custom delimited formats, prose with embedded structure). Outputs whose structure varies based on the input (different schemas for different cases). Outputs that need to be partially structured and partially free-form. Cases where the vendor feature has limitations (Anthropic’s tool use has some schema patterns it handles less well; OpenAI’s structured outputs have some recursive schema limitations).

Fallback prompting patterns: when vendor features don’t apply, the older prompting discipline applies. Explicit format specification in the prompt (“Respond with JSON matching this schema…”). Example outputs showing the expected structure. Validation prompts that check the output before downstream use. Retry loops on parse failures, with explicit error feedback to the model.

Pydantic-based libraries: Instructor, Outlines, and similar libraries provide schema-based structured output across providers. Some use vendor features under the hood; some use prompting plus validation. They provide cross-provider abstractions over the vendor-specific features.

XML and custom formats: for Anthropic Claude specifically, XML-tag-delimited outputs work reliably through prompt design alone (the model is trained well for this pattern). Other custom formats can work through prompt engineering but require more careful prompt design and validation.

When to Use It

Any production agent that needs to produce parseable output for downstream systems. Use vendor structured output features as the default when they apply; the reliability is much better than prompt engineering alone. Fall back to prompt engineering for cases the vendor features don’t cover.

Alternatives --- unstructured output with downstream parsing for cases where the parsing is robust enough to handle variability. Tool calling for cases where the structured output is naturally a function call. The boundary between structured output and tool calling is fuzzy; production deployments often use tool calling for what is logically structured output.

Sources

openai.com/index/introducing-structured-outputs-in-the-api/
docs.claude.com (Anthropic tool use documentation)
github.com/jxnl/instructor (Instructor library)
github.com/outlines-dev/outlines (Outlines library)

Sampling parameters — temperature, top-p, and top-k

Source: Anthropic Messages API parameter documentation; Holtzman et al., “The Curious Case of Neural Text Degeneration” (nucleus sampling, 2019); OpenAI and Google generation-parameter guides

Classification Controlling output randomness and diversity at generation time through decoding parameters.

Intent

Control the randomness and diversity of model output at generation time through the decoding parameters --- temperature, top-p (nucleus sampling), and top-k --- trading determinism against exploration to match the task, rather than accepting whatever the defaults produce.

Motivating Problem

Prompt design determines what the model is asked; the decoding parameters determine how it selects among candidate tokens once it has been asked. The same prompt produces different output depending on these settings, and teams that leave them at defaults hit two opposite failure modes: extraction and classification tasks come back inconsistent because the sampling is too random, and ideation tasks come back flat and repetitive because it is too constrained. The parameters are a small control surface --- three numbers --- that prompt engineering alone does not address, and getting them wrong undermines otherwise well-designed prompts. They also govern reproducibility: without deliberate settings, the same input is not guaranteed to give the same output across calls.

How It Works

Temperature: scales the logit distribution before the final softmax. At temperature 0 the model is greedy --- it always takes the highest-probability token, the closest the API offers to deterministic output. Raising the temperature toward 1.0 (and, on some APIs, beyond) flattens the distribution and raises the odds of lower-probability tokens being chosen. Use low temperature (roughly 0—0.3) for extraction, classification, and code, where one correct answer exists; use higher temperature (roughly 0.7—1.0) for brainstorming, drafting, and any task that benefits from variation.

Top-p (nucleus sampling): restricts sampling to the smallest set of tokens whose cumulative probability crosses the threshold p. The pool sizes itself dynamically --- a handful of tokens when the model is confident, many when it is uncertain. Typical production values sit around 0.9—0.95. Top-p caps the long tail of implausible tokens without fixing the pool size the way top-k does.

Top-k: restricts sampling to the k highest-probability tokens, a fixed-size pool regardless of the model’s confidence. A top-k of 1 is greedy decoding; larger values admit more diversity. Top-k is the bluntest of the three and the least commonly tuned.

One knob at a time: the three parameters interact, and their effects overlap. Anthropic’s guidance --- echoed by other providers --- is to adjust temperature first and leave top-p and top-k at their defaults, reaching for the others only with a specific reason. Tuning several at once makes behavior hard to reason about and hard to reproduce.

The determinism caveat: temperature 0 reduces variation but does not formally guarantee identical output. Floating-point behavior, hardware, and server-side batching introduce residual nondeterminism. Treat temperature 0 as “as reproducible as the API offers,” not as a hard guarantee, and pin model versions when reproducibility matters.

Provider coverage: not every provider exposes all three parameters, and their ranges differ. Claude exposes temperature, top_p, and top_k, with documentation steering most users toward temperature.

When to Use It

Every production call implicitly chooses these values; the discipline is choosing them deliberately. Set temperature low for structured extraction, verification, routing, and any deterministic pipeline --- this pairs directly with the structured-output and validation patterns in this section. Set it higher where diversity is the point, including the parallel-sampling patterns in Volume 1 (Parallelization --- Voting and Generator—Verifier both depend on temperature diversity to make repeated samples worthwhile). As a default, change only temperature; reach for top-p or top-k when you have a concrete reason the temperature knob cannot serve.

Alternatives --- structured output features and prefilling (below) constrain the shape of the output rather than its randomness; they compose with a low temperature rather than replacing it.

Sources

docs.claude.com/en/api/messages (temperature, top_p, top_k parameters)
arxiv.org/abs/1904.09751 (Holtzman et al., nucleus sampling)
platform.openai.com/docs/api-reference (sampling parameters)

Example artifacts

Code.

import anthropic

client = anthropic.Anthropic()

# Deterministic extraction: temperature 0, one right answer
extraction = client.messages.create(
    model="claude-sonnet-5",
    max_tokens=512,
    temperature=0,
    messages=[{"role": "user", "content": "Extract the invoice total as JSON."}],
)

# Diverse ideation: higher temperature, top_p / top_k left at defaults
ideation = client.messages.create(
    model="claude-sonnet-5",
    max_tokens=512,
    temperature=1.0,
    messages=[{"role": "user", "content": "Brainstorm ten taglines for a coffee brand."}],
)

Prefilling the assistant turn

Source: Anthropic prefill documentation (docs.claude.com); accumulated practitioner knowledge on output steering

Classification Steering output format and opening tokens by seeding the start of the assistant's response.

Intent

Steer the model’s output --- its format and its first tokens --- by seeding the beginning of the assistant’s reply in the request itself, rather than relying on instructions alone to produce the desired shape.

Motivating Problem

An instruction like “respond only with JSON” is probabilistic: the model usually complies, but it may still open with a preamble (“Here is the JSON you asked for:”) that breaks a downstream parser, and instruction-only formatting gets less reliable as the context grows. Some cases need a way to lock the opening of the response that does not depend on the model choosing to comply on that particular call. Prefilling supplies it by committing the first characters of the response deterministically.

How It Works

The mechanism: the Anthropic Messages API allows the final entry in the messages array to carry the role “assistant” with partial content. The model continues from that text instead of starting a fresh turn, so whatever is placed there is committed as the beginning of the response. This is a request-shape feature, not a prompt instruction, which is why it is more reliable than asking in words.

Format anchoring: prefill an opening brace to force a JSON object, an opening tag to force XML, a header to force a markdown document, or a language keyword to force a code block. Because the first token is fixed, the model cannot emit a preamble, and the format is anchored from token one.

Assembling the output: the response is the continuation and does not repeat the prefill, so the application prepends the prefill text when reconstructing the full value --- prefill an opening brace, and the returned text completes the object. Keep prefills short, and note that the API rejects trailing whitespace in the prefilled content.

Pairing with stop sequences: prefilling composes well with stop sequences. Seed an opening tag and stop generation on its closing tag to get exactly the delimited block, with no surrounding prose to strip.

Not a schema guarantee: prefilling fixes the start of the output, not its full structure. It is not a replacement for tool use or structured-output features, which constrain the entire schema. For field-level guarantees, combine prefilling with those features and with validation; the practical reliability ordering runs tool use, then structured-output mode, then prefilling and format anchoring, then plain format instructions.

A safety note: because prefilling seeds the assistant’s own words, it can be misused to push the model past a refusal by starting the disallowed answer for it. Anthropic documents this; the legitimate use is format control, and seeding content to bypass safety behavior is a misuse to avoid.

Portability: prefilling the assistant turn is an Anthropic feature (with partial analogues elsewhere) and is not universally available --- some providers do not accept a trailing assistant message. Deployments targeting multiple providers should treat it as a Claude-specific optimization.

When to Use It

Use prefilling when a specific output format or opening is required and instruction-only control has proven insufficient --- forcing JSON for extraction, enforcing a fixed template, or continuing a partial answer. Prefer tool use or structured-output features when full-schema guarantees are needed, and use prefilling either as the lightweight option for anchoring format or in combination with those features to pin the opening tokens they do not.

Alternatives --- structured output and tool calling (this section) for whole-schema constraint; few-shot examples (Section B) for demonstrating a format when the opening does not need to be hard-locked.

Sources

docs.claude.com/en/docs/build-with-claude/prompt-engineering/prefill-claudes-response
docs.claude.com/en/docs/build-with-claude/tool-use

Example artifacts

Code.

import anthropic

client = anthropic.Anthropic()

# Prefill the assistant turn with an opening brace to force a JSON object
resp = client.messages.create(
    model="claude-sonnet-5",
    max_tokens=1024,
    messages=[
        {"role": "user", "content": "Extract name and email from: Jane Doe, jane@acme.com"},
        {"role": "assistant", "content": "{"},
    ],
)

# The response continues from the brace; prepend the prefill to rebuild the object
data = "{" + resp.content[0].text

Section E — Prompt versioning and testing

Prompts as engineering artifacts --- the operational discipline

Production prompts are engineering artifacts subject to the same operational discipline as production code: versioning, testing, deployment, rollback, monitoring. The discipline borrows from software engineering and adapts to LLM specifics (stochasticity makes pure assertion testing harder; A/B testing patterns matter more). Prompt registries (LangChain Hub, PromptLayer, Anthropic Workbench, OpenAI Playground prompts) provide infrastructure for the discipline; the practice of using them well is what distinguishes mature deployments from improvised ones.

Prompt registries and versioning (LangChain Hub, PromptLayer, vendor playgrounds)

Source: LangChain Hub (smith.langchain.com/hub); PromptLayer (promptlayer.com); Anthropic Workbench; OpenAI Playground saved prompts

Classification Infrastructure for prompt versioning, sharing, and lifecycle management.

Intent

Treat prompts as versioned artifacts with explicit history, rollback capability, and lifecycle management, using prompt registry infrastructure that separates prompts from application code while maintaining the connection between specific prompt versions and specific deployment versions.

Motivating Problem

Prompts embedded in application code work for simple cases but produce problems at scale. Changes to prompts mix with code changes in pull request reviews, making the prompt impact hard to evaluate. Rolling back a prompt change requires a code deployment. Sharing prompts across applications requires copying code. Comparing prompt variants across deployments is hard. Prompt registries solve these problems by separating prompts from code: prompts are versioned independently, applications reference specific prompt versions, changes can be deployed and rolled back without code deployments.

How It Works

LangChain Hub: hosted prompt registry by LangChain. Prompts are stored with versions, tags, and metadata; applications reference them by name + version. Integrates with LangChain framework natively but the prompts can be used outside LangChain. Free for public prompts; commercial for private/team.

PromptLayer: dedicated prompt registry with versioning, A/B testing, and observability. Integrates with multiple model providers; tracks which prompt versions are used in which requests. Commercial product with various tier options.

Vendor playgrounds: OpenAI Playground saved prompts, Anthropic Workbench, Google AI Studio. Each vendor has its own prompt management UI tied to their platform. The lock-in is real (prompts saved in one vendor’s playground don’t easily move to another); the convenience is high for single-vendor deployments.

Self-hosted patterns: some teams build their own prompt registry as part of their internal platform. The pattern is appropriate for teams with strong infrastructure capabilities and specific requirements existing registries don’t meet; for most teams, the existing registries are sufficient and self-hosting is unnecessary complexity.

Operational patterns: typical deployment workflow involves authoring prompts in the registry, testing them through registry tooling, deploying specific versions to specific environments (staging, production), monitoring performance, rolling back when needed. The pattern mirrors software deployment workflow with prompt-specific tooling.

Choice criteria: LangChain Hub for teams using LangChain ecosystem. PromptLayer for teams wanting framework-agnostic prompt management with strong observability. Vendor playgrounds for single-vendor deployments with simple needs. Self-hosted for teams with specific platform requirements.

When to Use It

Production deployments with non-trivial prompt counts. Teams where multiple people change prompts. Cases where prompt changes need rollback capability separate from code rollback. Deployments where prompt A/B testing is part of the development process.

Alternatives --- embedded prompts in code for simple deployments where the operational overhead of a registry isn’t justified. The transition point is typically when the team grows past 2—3 people changing prompts or when the prompt count grows past about 10—15.

Sources

smith.langchain.com/hub
promptlayer.com
Anthropic Workbench documentation

Prompt testing and evaluation patterns

Source: Various practitioner literature; vendor evaluation tools (Anthropic Evals, OpenAI Evals); third-party (Promptfoo, Braintrust)

Classification Testing prompts for behavior verification, regression detection, and quality comparison.

Intent

Apply systematic testing discipline to prompts: verify specific behaviors against test cases, detect regressions when prompts or models change, compare prompt variants quantitatively for A/B testing decisions.

Motivating Problem

Prompts have consequences. Changes to a system prompt can shift agent behavior in ways the team doesn’t anticipate; the shift can be improvement or regression. Without testing, the team learns about regressions through user complaints or visible failures, after the regression has affected users. With testing, regressions are caught before deployment, and prompt changes can be evaluated quantitatively rather than by impression. The testing discipline borrows from software testing but adapts to LLM stochasticity --- pure assertion testing doesn’t work because the same input can produce different outputs; statistical testing replaces deterministic testing.

How It Works

Test case construction: build a set of test cases that represent the agent’s expected inputs. Include common cases, edge cases, and known failure modes from production. Each test case has an input (what the user/system provides) and an expected output or assertion (what the agent should produce, possibly with tolerance for variation).

Grading strategies: exact-match grading for cases where the output is deterministic (structured outputs, specific format requirements). LLM-as-judge grading for cases where the output is open-ended and a stronger model can assess whether it meets quality criteria. Embedding-similarity grading for cases where semantic similarity to a reference matters more than exact wording. Multiple grading strategies typically apply to different parts of the test suite.

Run frequency: pre-deployment runs (test the new prompt against the suite before deploying); periodic runs (test the production prompt against the suite to catch drift); A/B testing runs (compare prompt variants statistically against the same test cases or against production traffic).

Tooling: vendor-native eval tools (Anthropic Evals, OpenAI Evals) integrate with their respective platforms. Third-party tools (Promptfoo, Braintrust, custom frameworks) provide cross-platform evaluation. The choice depends on the team’s platform commitments and feature requirements.

Integration with observability: production observability (Volume 7) provides the traces that show how the agent actually behaved on real inputs. Eval suites bridge observability and testing: real inputs from production traces can become test cases; production failures inform new test cases; eval results inform deployment decisions.

Limitations: LLM stochasticity means tests can pass on one run and fail on another. Statistical interpretation (run each test case multiple times; report success rate; track distributions) replaces binary pass/fail. The discipline is more nuanced than software testing; the LLM-specific patterns are still evolving.

When to Use It

Production agents where prompt changes are frequent enough that regressions matter. High-stakes agents where prompt-induced failures have costs. Teams with multiple people making prompt changes who need shared validation. Cases where A/B testing prompt variants would inform deployment decisions.

Alternatives --- manual testing for simple cases where the prompt count and change frequency don’t justify automated testing infrastructure. The transition point varies; teams typically discover the need for systematic testing after experiencing a few painful regressions.

Sources

anthropic.com/news/evals (Anthropic Evals)
platform.openai.com/docs/guides/evals (OpenAI Evals)
promptfoo.dev, braintrust.dev

Section F — Model-specific prompting conventions

Anthropic, OpenAI, Google --- vendor-specific patterns and cross-model portability

Different foundation models have developed different prompting conventions. The conventions reflect each vendor’s training data and recommendations: Anthropic emphasizes XML-tag-delimited structure; OpenAI handles markdown-formatted prompts and uses the system/user/assistant turn structure differently; Google has its own conventions emphasizing explicit instruction blocks. Cross-model portability of prompts is partial; production deployments typically lock in conventions for their primary model with explicit awareness that switching providers requires prompt rework.

Model-specific prompting conventions across major providers

Source: Anthropic prompt engineering guide; OpenAI prompt engineering best practices; Google Gemini prompting strategies; practitioner experience across providers

Classification Conventions for prompting specific foundation models effectively.

Intent

Use the prompting conventions each major foundation model handles best, recognizing that cross-model portability is partial and production deployments need to be aware of which conventions they’re using.

Motivating Problem

The same prompt instruction phrased differently produces different results across models. Anthropic’s Claude family handles XML-tagged content particularly well; the same prompt with markdown instead of XML works but produces less consistent behavior. OpenAI’s GPT family handles markdown well and treats the system/user/assistant turn structure as a meaningful part of the prompt; system prompts have different effective weight than user-turn instructions. Google’s Gemini has its own preferred conventions. Production prompts that work well on one model may underperform on another not because of capability differences but because of convention mismatches.

How It Works

Anthropic XML conventions: Claude models handle XML-tag-delimited content particularly reliably. Wrapping instructions in `<instructions>`, examples in `<examples>`, retrieved content in `<context>` or `<document>`, user input in `<user_input>` improves Claude’s ability to distinguish the parts of the prompt and produce structured output. Anthropic’s documentation explicitly recommends this convention; production deployments using Claude typically adopt it.

OpenAI markdown and turn structure: GPT models handle markdown formatting well and treat the system prompt differently from user-turn content. The system prompt sets persistent behavior; user-turn content sets the immediate task. Production deployments using GPT typically use markdown for structure within prompts and put role/persona/constraint instructions in the system prompt with task-specific instructions in user-turn content.

Google Gemini conventions: Gemini handles explicit instruction blocks and is comparatively flexible across formats. Google’s documentation recommends clear instruction structure but doesn’t emphasize a specific markup convention as strongly as Anthropic does for XML or OpenAI does for markdown.

Cross-model portability: prompts can typically be moved between models with some adaptation. The structural conventions need adjustment; the underlying instructions usually transfer. Production deployments that need cross-model portability typically develop their own intermediate representation that gets transformed into model-specific prompts at deployment time.

Tool calling and structured outputs: each provider has its own tool calling and structured output mechanisms (Volume 3 covers tool calling specifically). The prompt design implications differ: when tool calling is the primary output mechanism, the prompt design focuses on tool selection rather than output structure; when structured outputs are used, the prompt design references the schema rather than describing the structure verbally.

Multi-vendor deployments: agents that use different models for different subtasks need different prompt conventions for each model. The pattern emerged through 2025—2026 as teams discovered that the right model for each subtask varies and the prompts need to vary accordingly. Production deployments typically have prompt libraries organized by model.

When to Use It

Production deployments using a specific foundation model where the model-specific conventions produce measurable performance improvements. Cross-model deployments where prompts need to be adapted for each model. Cases where prompt-engineering effort justifies investment in model-specific conventions vs. lowest-common-denominator approaches.

Alternatives --- lowest-common-denominator prompts that work acceptably across models for cases where portability matters more than per-model optimization. Vendor-agnostic prompt libraries that use neutral conventions and accept some per-model performance loss.

Sources

docs.claude.com/en/docs/build-with-claude/prompt-engineering
platform.openai.com/docs/guides/prompt-engineering
ai.google.dev/gemini-api/docs/prompting-strategies

Section G — Meta-prompting and optimization

AI-generated prompts and the discipline that’s still maturing

Meta-prompting is the practice of using AI to generate or optimize prompts for other AI. The pattern emerged through 2024—2026 as foundation models became capable enough to reason about prompt design. The discipline is less mature than the other practices in this volume; the long-term value of AI-generated prompts vs. human-authored ones is still being figured out. Production teams that use meta-prompting typically use it as a starting point that humans refine rather than as the final artifact.

Meta-prompting and prompt optimization patterns

Source: Various practitioner reports through 2024—2026; Anthropic prompt generator in Workbench; OpenAI prompt enhancement features; research literature on automated prompt optimization

Classification Using AI to generate or improve prompts for other AI.

Intent

Use foundation models to generate, optimize, or refine prompts for other AI tasks, leveraging the model’s knowledge of effective prompt patterns to accelerate prompt development.

Motivating Problem

Prompt design is time-consuming. A new agent’s system prompt takes hours to days to develop well; refinement based on observed behavior takes more time. Meta-prompting aims to accelerate this work by having AI generate first-draft prompts, suggest improvements to existing prompts, or optimize prompts against evaluation criteria. The pattern is appealing because it scales prompt engineering effort; the limitation is that AI-generated prompts may not match what a thoughtful human author would produce.

How It Works

Prompt-from-description: the user describes what the agent should do in natural language; the model generates a system prompt that would implement that description. Anthropic’s prompt generator in Workbench implements this pattern; OpenAI has similar tools. The pattern produces reasonable first drafts that humans typically refine.

Prompt-from-examples: the user provides examples of desired input→output behavior; the model generates a prompt that would produce that behavior. The pattern works better than prompt-from-description for cases where the behavior is hard to describe but easy to demonstrate.

Prompt optimization against evals: given an existing prompt and an eval suite, the model proposes modifications and tests them against the evals. The optimization loop runs until improvements plateau. The pattern is more sophisticated than first-draft generation but requires significant infrastructure (eval suite, optimization loop, model calls for each variant).

Self-prompting: the model writes its own task-specific prompt at runtime based on the user’s description. The pattern lets the same agent handle varied tasks by generating appropriate prompts dynamically. The trade-off is increased latency (a model call to generate the prompt before the main task) and reduced reproducibility (the generated prompt may vary across runs).

Limitations of AI-generated prompts: the model’s knowledge of effective prompts comes from training; the model may not know what works best for the specific deployment context, the specific model the prompt will run on (Anthropic’s model generates prompts; whether those prompts run best on Anthropic’s model vs. others is unclear), or the specific edge cases the human author has encountered. Production deployments typically treat AI-generated prompts as drafts that human authors refine.

Research direction: academic literature on automated prompt optimization (DSPy, Promptbreeder, others) explores more systematic approaches. The pattern of automatically discovering optimal prompts through search and evaluation has theoretical appeal; production adoption is less consolidated than for the simpler meta-prompting patterns above.

When to Use It

First-draft prompt generation when starting a new agent; the AI-generated draft is a starting point rather than the final artifact. Prompt iteration where the human author wants suggestions for improvement. Cases where prompt design effort is the bottleneck and faster iteration matters more than peak optimization.

Alternatives --- human-authored prompts for cases where the prompt quality matters more than development speed. Prompt templates for cases where the pattern is well-known and parameterization captures the variation. Direct experimentation with prompt variants for cases where the team has the time and infrastructure to test systematically.

Sources

Anthropic Workbench prompt generator documentation
OpenAI prompt enhancement documentation
DSPy framework (dspy.ai); various research papers on automated prompt optimization

Section H — Discovery and resources

Where to track prompting discipline evolution as models continue to mature

Prompting is a rapidly evolving discipline. Techniques that were dominant in 2022 are less essential in 2026; techniques that emerged through 2024—2026 may evolve further. Staying current requires tracking vendor prompt engineering guides, practitioner literature, and research as foundation models continue to mature.

Resources for tracking prompting discipline

Source: Various vendor guides, practitioner publications, academic research

Classification Sources for staying current on prompting discipline as it evolves.

Intent

Provide pointers to the active sources of prompting discipline knowledge: vendor documentation that captures current recommendations, practitioner publications that capture working knowledge, and academic literature that captures emerging techniques.

Motivating Problem

Prompting techniques evolve with model capabilities. Vendor recommendations shift as models change; practitioner knowledge accumulates and gets superseded; academic research produces new techniques that may or may not transfer to production. Staying current requires continuous attention to multiple sources.

How It Works

Vendor prompt engineering guides: Anthropic’s prompt engineering documentation; OpenAI’s prompt engineering best practices; Google’s Gemini prompting strategies. The vendor guides are the most authoritative source for current recommendations on each vendor’s models; they update as the models and recommendations evolve.

Practitioner publications: Eugene Yan’s blog (eugeneyan.com), Hamel Husain’s writing on AI development, the Latent Space podcast and newsletter, various AI engineering newsletters. The practitioner literature captures working knowledge that vendor documentation may not include and that academic literature may not yet address.

Academic research: NeurIPS, ICLR, ACL conferences regularly publish prompt engineering research. The research is more rigorous than practitioner writing but slower to surface; useful for understanding the underlying mechanisms behind techniques that practitioners discover empirically.

Cookbooks and example collections: vendor-provided example collections (Anthropic’s prompt library, OpenAI’s cookbook), community-collected prompt collections (awesome-prompts on GitHub, various community libraries). The collections provide reference implementations of patterns; useful for seeing how others have approached similar problems.

Practical pattern: most teams develop their own working knowledge through experimentation, supplemented by tracking the sources above for new techniques and patterns. The discipline isn’t entirely public; significant working knowledge stays inside teams; the public sources provide enough to bootstrap teams that are starting fresh.

When to Use It

Teams building AI products who need to develop prompting expertise. Engineers transitioning to AI work from other domains. Continuous education as the discipline evolves. Reference when specific patterns prove insufficient in practice.

Alternatives --- outsourcing to specialized consultants for high-stakes cases. Internal pattern documentation for teams with mature practice. The combination of external tracking and internal knowledge is the working pattern for most production teams.

Sources

docs.claude.com, platform.openai.com, ai.google.dev (vendor guides)
eugeneyan.com, hamel.dev (practitioner writing)
Latent Space podcast and newsletter
GitHub anthropics/anthropic-cookbook, openai/openai-cookbook

Appendix A --- Pattern Reference Table

Cross-reference of the patterns covered in this volume with what each solves, when to use each, and the section that covers it in detail.

Pattern	Solves	When to use	Section
System prompt design	Establishing role and behavior	Most production agents	Section A
Few-shot / N-shot	Demonstrating desired patterns	Tasks needing specific formats	Section B
Chain-of-thought	Step-by-step reasoning	Complex tasks on non-reasoning models	Section B
Context selection	Choosing what model sees	Production agents with rich context	Section C
Context ordering	Position-aware placement	Long-context deployments	Section C
Context compression	Reducing to fit budget	When relevant content exceeds window	Section C
Structured output prompting	JSON/typed output guarantees	Downstream parsing requirements	Section D
Prompt registry / versioning	Operational lifecycle	Production deployments at scale	Section E
Prompt testing	Verification and regression detection	Frequent prompt changes	Section E
Model-specific conventions	Per-vendor optimization	Cross-model or vendor-optimized work	Section F
Meta-prompting	AI-generated prompts	Prompt development acceleration	Section G

Appendix B --- The Fifteen-Volume Series

This catalog joins the fourteen prior volumes to form a fifteen-layer vocabulary for agentic AI, with the explicit caveat that this volume is the weakest entry in the series.

Volume 1 --- Patterns of AI Agent Workflows --- the timing of agent runs.
Volume 2 --- The Claude Skills Catalog --- model instructions in packaged form.
Volume 3 --- The AI Agent Tools Catalog --- the function-calling primitives.
Volume 4 --- The AI Agent Events & Triggers Catalog --- the activation layer.
Volume 5 --- The AI Agent Fabric Catalog --- the infrastructure substrate.
Volume 6 --- The AI Agent Memory Catalog --- the state and context layer.
Volume 7 --- The Human-in-the-Loop Catalog --- HITL engineering.
Volume 8 --- The Evaluation & Guardrails Catalog --- LLM-internal safety.
Volume 9 --- The Multi-Agent Coordination Catalog --- agent-to-agent communication.
Volume 10 --- The Retrieval & Knowledge Engineering Catalog --- finding the right information.
Volume 11 --- The AI Compliance & Regulatory Catalog --- compliance-facing governance.
Volume 12 --- The AI Infrastructure Security Catalog --- security around the AI system.
Volume 13 --- The Agent UX Patterns Catalog --- design discipline for agent interaction.
Volume 14 --- The AI Agent Products Survey --- a snapshot, not structural vocabulary.
Volume 15 --- The Prompting and Context Engineering Catalog (this volume) --- defensibly skip or fold.

The series now has a clear taxonomy. Volumes 1—10 are the engineering substrate --- patterns and mechanisms that compose into agentic AI systems. Volumes 11—13 are complementary disciplines that consume the engineering substrate --- compliance, security, design --- each with its own audience and artifacts. Volume 14 is a deliberate-snapshot product survey. Volume 15 (this volume) is the weakest entry: the content is largely covered elsewhere in the series; the residual is real but narrower than prior volumes’ scope. Treating this volume as optional, with the fold-into-existing-volumes alternative as a legitimate choice, is honest framing the reader should consider.

Appendix C --- The Case for Skipping This Volume

This appendix makes the case for treating this volume as optional. The case is honest because the content largely exists elsewhere in the series; if a reader finds the consolidated treatment less valuable than the distributed treatment in the prior volumes, skipping this volume is a legitimate choice.

The strongest fold-into-existing-volumes argument is that prompting and context engineering are integrated practices, not separable disciplines. Real production agents use system prompts (could go in Volume 2 Skills), tool definitions (Volume 3 Tools), persistent context (Volume 6 Memory), retrieved content (Volume 10 Retrieval), safety patterns (Volume 8 Evaluation & Guardrails), and multi-agent prompts (Volume 9 Multi-Agent) all together. The integration is the point; separating prompting into its own volume artificially isolates a discipline that’s tightly coupled with the rest of the agent’s design.

The argument against fold-and-skip: the residual this volume covers (the system-prompt design patterns, the few-shot and chain-of-thought variants, the context-window engineering at production scale, the prompt versioning and testing discipline, the model-specific conventions, the meta-prompting patterns) is coherent enough that distributing it across the prior volumes would lose the coherence. The reader looking up “chain-of-thought” doesn’t want to find it discussed only in passing in Volume 8 as a safety technique; the reader wanting context-engineering guidance doesn’t want to find it scattered across Volumes 6 and 10 with the prompt-design angle implicit. Consolidating provides a discoverable reference; distributing optimizes for integration but loses discoverability.

The reader’s choice. If the reader values discoverable reference for prompting patterns, read this volume. If the reader values integration with the broader engineering substrate, treat this volume as appendix material that the prior volumes would have covered comprehensively if they’d been written with prompting-discipline awareness. Either choice is honest; the series accommodates both.

Appendix D --- Discovery and Standards

Resources for tracking prompting discipline evolution:

Anthropic prompt engineering guide (docs.claude.com)
OpenAI prompt engineering best practices (platform.openai.com)
Google Gemini prompting strategies (ai.google.dev)
Anthropic cookbook (github.com/anthropics/anthropic-cookbook)
OpenAI cookbook (github.com/openai/openai-cookbook)
Practitioner writing: Eugene Yan (eugeneyan.com), Hamel Husain (hamel.dev), Simon Willison (simonwillison.net)
Latent Space podcast and newsletter (latent.space)
Academic conferences: NeurIPS, ICLR, ACL for prompt engineering research
Prompt registries: LangChain Hub, PromptLayer, vendor playgrounds (as both pattern reference and infrastructure)
Eval tooling: Anthropic Evals, OpenAI Evals, Promptfoo, Braintrust (as both validation infrastructure and pattern reference)

Two practical recommendations. First, vendor documentation is the most reliable source for vendor-specific recommendations and updates as models evolve. Practitioner writing captures patterns not yet in vendor docs and provides commentary on what works in production. Academic literature provides the theoretical foundations and emerging techniques. The three categories complement each other. Second, the discipline accumulates faster through hands-on experimentation than through reading; this volume and the broader literature provide vocabulary and pattern recognition, but actual production prompts require iteration on the specific use case. Treat the reading as background; treat the experimentation as the main work.

Appendix E --- Omissions

This catalog covers about 12 substrates across 8 sections, a deliberately smaller count than prior volumes reflecting this volume’s narrower scope. The wider prompting and context engineering discipline includes content not covered here:

Packaged agent instructions (Volume 2 Skills covers).
Tool definitions and function calling schemas (Volume 3 Tools covers).
Memory persistence patterns (Volume 6 Memory covers).
RAG and retrieved context patterns (Volume 10 Retrieval & Knowledge Engineering covers).
Safety prompt patterns and constitutional AI (Volume 8 Evaluation & Guardrails covers).
Multi-agent prompting (Volume 9 Multi-Agent Coordination covers).
HITL approval prompting (Volume 7 Human-in-the-Loop covers).
Specific commercial prompt-optimization tools beyond brief mention. The category is small and changing; comprehensive coverage would age quickly.
Detailed comparison of prompt registries and eval tools. Volume 14 (Products Survey) covers some of these at product level.
Voice and multimodal prompting specifics. The patterns are emerging; the volume focuses on text where the discipline is more consolidated.
Fine-tuning as alternative to prompt engineering. Fine-tuning is its own discipline; this volume covers what prompts do, not what fine-tuning does instead.

Appendix F --- A Note on Weaker Candidates

This volume is the first “weaker candidate” in the series --- the first volume I would defensibly skip or fold into other volumes, distinct from the four “Not gaps” volumes (11—14) that earned their place despite my initial framing. The honest framing matters because the catalog’s value proposition depends on the volumes being worth their place; including volumes that don’t earn their place dilutes the series’ overall value. Acknowledging this volume’s weakness lets the reader make an informed choice about whether to engage with it.

Why include a weaker candidate at all? Because the discipline has accumulated working knowledge that’s worth consolidating somewhere, and consolidating it in one place is more useful than scattering it across the volumes it would nominally fit. The trade-off is real: scattering preserves integration; consolidating preserves discoverability. The series chose consolidation; readers who prefer scattering can treat this volume as optional and rely on the prior volumes’ distributed coverage.

The pattern of including weaker candidates has limits. Future volumes would need to earn their place more clearly than this one does; the series can’t sustain unlimited expansion without diluting the value proposition. Adjacent areas where comparable treatment might be valuable --- cost engineering for AI systems, model lifecycle management, enterprise integration patterns --- each face the same question: does standalone treatment add enough value over distributed treatment to justify the additional volume? The answer should be skeptical; the bar for additional volumes should rise as the series grows. This volume’s honest framing acknowledges that this volume’s answer is “defensible but contested”; future volumes should answer more clearly.

Fifteen volumes. Patterns, Skills, Tools, Events, Fabric, Memory, Human-in-the-Loop, Evaluation & Guardrails, Multi-Agent Coordination, Retrieval & Knowledge Engineering, AI Compliance & Regulatory, AI Infrastructure Security, Agent UX Patterns, AI Agent Products Survey, and now Prompting and Context Engineering with explicit acknowledgment that it’s the weakest entry. The first ten are the engineering substrate. The next three are complementary disciplines. The fourteenth is a perishable products snapshot. The fifteenth is a defensible-but-weaker consolidation of distributed content. The series at fifteen volumes covers what an architect could want to know about agentic AI in mid-2026, with each volume’s standing in the series honestly framed. The proposition still holds: the structural vocabulary in Volumes 1—13 should outlast the products in Volume 14 and the consolidations in Volume 15.

--- End of The Prompting and Context Engineering Catalog v0.1 ---