About This Catalog

This catalog organizes the recurring solutions that have emerged for building useful systems on top of large language models. It is consciously modeled on Martin Fowler’s Patterns of Enterprise Application Architecture (Addison-Wesley, 2002): each pattern gets a stable name, a one-line intent, a sketch, a description of the motivating problem, a How It Works section that discusses the solution and its variations, a When to Use It section, a Sources section naming the canonical origin of the pattern, Further Reading for deeper exposition, an Example, and a final Example artifacts section with concrete prompt templates, agent code, and --- where the pattern makes sense as a reusable unit --- a sketch of how it would package as an Anthropic-style SKILL.md.

The point, as Fowler put it, is vocabulary. If a colleague says “we’ll use Prompt Chaining for the ingestion path and an Orchestrator—Workers loop for the merge step, with a Human-in-the-Loop checkpoint before commit,” every architecturally important decision has been communicated in one sentence. That is what a pattern language is for.

What’s new in v0.3

Six new patterns added in a new Section G (Operational & Adaptive Patterns) --- Goal Setting and Monitoring, Prioritization, Exception Handling and Recovery, Resource-Aware Optimization, Learning and Adaptation, and Exploration and Discovery --- bringing the catalog to 29 patterns. These absorb the operational-discipline patterns named in Gulli’s Agentic Design Patterns (2025) that previously had no entry in this volume.
Chapter 5 (Cost Models) extended with rows for the six new patterns.
Appendix A extended with compositions involving the Section G overlays.

What’s new in v0.2

Three new patterns added --- Cascade/Fallback, Generator—Verifier, Map-Reduce over Corpus --- bringing the catalog to 23 patterns.
A new Chapter 5 (“Cost Models”) summarizing the LLM-call count, parallelism, and loop boundedness of every pattern in one table.
Every pattern now has a Sources section (canonical origin) distinct from Further Reading (deeper exposition).
Every pattern now has Example artifacts --- a prompt template, agent code (Python with the Claude SDK style), and where applicable, a SKILL.md sketch.
All diagrams have been redrawn using only straight lines, with double-arrows replacing pairs of single arrows where the relationship is bidirectional.

Scope

This catalog covers the control-flow and coordination patterns of agentic systems --- how LLM calls, tools, sub-agents, and humans are arranged in time and how information passes between them. It does not (yet) cover the orthogonal axes of:

Memory & state (working memory, episodic memory, vector stores, scratchpads)
Tool & ACI design (agent—computer interface patterns)
Retrieval (RAG variants, query rewriting, reranking)
Evaluation (offline evals, online evals, LLM-as-judge)
Safety & alignment (guardrails, constitutional patterns, sandboxing)

Each of these warrants its own catalog and will be drafted as separate volumes.

On Sources vs. Further Reading

The Sources section names the canonical origin of a pattern --- the paper that introduced it, the repository that ships the reference implementation, or the blog post that named it. Where the catalog itself coins or generalizes a pattern (Hybrid Pipeline, Cascade/Fallback, Map-Reduce over Corpus, Generator—Verifier as a named entry), the source is acknowledged as such. Further Reading points to deeper exposition, related variants, and production write-ups.

Part 1 — The Narratives

The pattern entries in Part 2 are reference material. They are not meant to be read front-to-back. Part 1 is the orientation: short essays that frame the design space so the patterns in Part 2 make sense in context.

Chapter 1. Workflows vs. Agents

Anthropic draws an architectural distinction that is worth borrowing wholesale. A workflow is a system in which LLMs and tools are arranged in predefined code paths. An agent is a system in which the LLM dynamically directs its own process and tool use, choosing the next step at runtime. Both are “agentic systems,” but they have very different operational characteristics.

Workflows give you predictability, observability, and bounded cost. Agents give you flexibility and reach at the cost of higher variance, higher token spend, and the risk of compounding errors over long horizons.

The practical rule is the same one that governs all software design: start with the simplest thing that works. For many problems, a single well-prompted LLM call with retrieval is enough. When that fails, add a workflow. When the workflow’s branching becomes unmanageable, then --- and only then --- promote to an agent.

Chapter 2. The Shape of Control Flow

There are four primitive shapes that recur throughout this catalog:

Sequential --- Step B runs after step A and uses A’s output. Prompt Chaining is the canonical case. Latency is the sum of stages; total cost is the sum of stages.
Parallel --- N independent calls run at the same time; their results are merged. Parallelization (Sectioning and Voting) is the canonical case. Latency is the slowest branch; cost is the sum of branches.
Hybrid (Pipeline) --- Stages are sequential, but each stage may itself fan out in parallel. Most real systems end up here.
Looping --- The same step (or short cycle) runs repeatedly until a termination condition is met. Evaluator—Optimizer, Reflexion, ReAct, and Autoresearch are all variations of looping. Cost is unbounded unless you cap iterations or budget tokens.

Every pattern in Part 2 can be classified along these axes, and the classification is given at the top of each entry.

Chapter 3. The Augmented LLM

The foundational unit of every pattern that follows is not the raw model but the augmented LLM: a model with access to retrieval, tools, and memory. Almost nothing useful in production is built on a bare completion call. When this catalog refers to “an LLM call,” assume the augmented form unless explicitly stated.

Chapter 4. Choosing a Pattern

A rough decision procedure:

One call enough? Try a single augmented LLM call with good prompting and retrieval. If the evaluation passes, stop here.
Decomposable into fixed steps? Use Prompt Chaining.
Distinct input classes need distinct handling? Add Routing in front.
Same task tried on many documents? Use Map-Reduce over Corpus.
Want to escalate to a stronger model only when needed? Use Cascade/Fallback.
Independent subtasks of one document/task? Use Parallelization --- Sectioning.
Want confidence through redundancy? Use Parallelization --- Voting.
Subtasks not knowable in advance? Use Orchestrator—Workers.
Need iterative refinement against a critic that gives feedback? Use Evaluator—Optimizer.
Need rejection sampling against a Pass/Fail check? Use Generator—Verifier.
Need the model to use tools in an open-ended loop? Use ReAct or, for more autonomy, the Autonomous Agent pattern.
Need to optimize a measurable metric overnight? Use Autoresearch.

Orthogonal to steps 1—12 are the operational patterns of Section G. Most of them are overlays rather than alternatives: a production system picks one or more control-flow patterns from the list above and then hardens the result with Goal Setting and Monitoring, Prioritization, Exception Handling and Recovery, Resource-Aware Optimization, and --- across deployments --- Learning and Adaptation. The exception is Exploration and Discovery, which is a genuine alternative top-level shape for work whose objective cannot be stated in advance.

The patterns compose. A production system is almost always a graph of these primitives, not a single one.

Chapter 5. Cost Models

The cost of running an agentic system is dominated by the number of LLM calls, and the wall-clock latency is dominated by the longest sequential path through those calls. These two quantities are first-order concerns at design time. The table below summarizes the cost profile of every pattern in this catalog along three axes:

LLM calls per task --- how many model invocations a single end-to-end execution incurs.
Parallelism --- whether and how the calls can run concurrently.
Loop bound --- what bounds the number of iterations (if any), or whether the cost is open-ended.

Reading the table from top to bottom is a useful exercise: it makes vivid why “start with the simplest thing that works” is the right rule. A single Augmented LLM call costs one invocation; an iterative pattern like LATS or Reflexion can cost hundreds. The cost-per-unit-quality gradient is steep, and most production systems sit well below the top of the table.

Pattern	LLM calls per task	Parallelism	Loop bound
Augmented LLM	1	None	None
Prompt Chaining	N (fixed stages)	None	None
Routing	1 router + 1 specialist	None	None
Sectioning	K sections + 1 aggregator	K-way	None
Voting	N + 1 aggregator	N-way	None
Orchestrator–Workers	2 + M workers (M dynamic)	M-way	Possible re-orchestration
Hybrid Pipeline	Sum of stages	Per stage	Per stage
Evaluator–Optimizer	2 × iter	None	max_iter
Reflexion	≥3 × iter (actor/critic/reflector)	None	max_iter
Autoresearch	1 × iter	None	Indefinite (overnight)
ReAct	1 × iter	None	max_iter / token budget
Plan-and-Execute	1 planner + N steps	Optional per step	Possible replanning
ReWOO	2 (planner + solver)	M-way over workers	None (single pass)
Tree-of-Thoughts	K × depth × branches	Per level	depth bound
LATS	Very high	Per branch	depth bound
Multi-Agent Debate	N × rounds + 1	N-way per round	rounds
Hierarchical Supervisor	Sum over levels	Per level	Depth of tree
Autonomous Agent	1 × iter	None	Step budget / token cap
HITL Checkpoint	(base pattern)	(base pattern)	(base pattern)
Deep Research	1 + N sub-agents + 1 synth	N-way	Per-sub-agent budget
Cascade/Fallback	1 to L levels	None	L levels
Generator–Verifier	2 × samples	Optional batch	max_samples
Map-Reduce over Corpus	N + 1	N-way	None
Goal Setting and Monitoring	(base pattern) + 1 per checkpoint	None	Review cadence
Prioritization	1 scoring call per (re)rank	Optional batch	Aging / queue empty
Exception Handling and Recovery	(base pattern) + retries	None	max_retries + circuit breaker
Resource-Aware Optimization	(policy layer; no extra calls)	N/A	Budget cap
Learning and Adaptation	Offline batch over episodes	Batch	Consolidation cadence
Exploration and Discovery	1 × iter, open-ended	Optional per probe	Novelty floor / budget

Notes. “LLM calls” counts model invocations, not tools. “Parallelism” refers to the maximum concurrent LLM calls during one execution --- actual concurrency depends on runtime and rate limits. “Loop bound” names the dominant termination control; in practice most looping patterns also have a token budget and a wall-clock cap as safety nets.

Part 2 — The Patterns

Section A — Foundational Building Blocks

Augmented LLM --- A model call enriched with retrieval, tools, and memory.

Section B — Control-Flow Workflows

Prompt Chaining --- Decompose a task into fixed sequential LLM steps.
Routing --- Classify the input, then dispatch to a specialized handler.
Cascade/Fallback --- Try cheaper first; escalate to stronger only on low confidence.
Parallelization --- Sectioning --- Split a task into independent subtasks run in parallel.
Parallelization --- Voting --- Run the same task N times and combine outputs.
Map-Reduce over Corpus --- One LLM call per document in parallel; a final reduce call synthesizes.
Orchestrator—Workers --- A central LLM dynamically decomposes and synthesizes.
Hybrid Pipeline --- Compose sequential stages where individual stages fan out in parallel.

Section C — Iterative & Self-Improving Patterns

Evaluator—Optimizer --- Generate, critique, refine. Loop until the critic accepts.
Generator—Verifier --- Rejection sampling against a binary Pass/Fail check.
Reflexion --- Evaluator—Optimizer with verbal episodic memory across attempts.
Autoresearch (Ratchet Loop) --- Propose, experiment, keep only on improvement; otherwise roll back.

Section D — Reasoning & Search Patterns

ReAct --- Interleave reasoning with tool use in a single rolling context.
Plan-and-Execute --- Produce a full plan up front, then execute it step by step.
ReWOO --- Plan with explicit dependencies, execute observation-free, synthesize.
Tree-of-Thoughts --- Branch the reasoning into multiple paths, evaluate, expand, backtrack.
LATS --- Tree-of-Thoughts over a ReAct-style action space, with reflection.

Section E — Multi-Agent Coordination

Multi-Agent Debate --- Multiple solver agents exchange and critique each other’s answers.
Hierarchical Supervisor --- A supervisor agent delegates to specialized sub-agents.

Section F — Autonomy & Oversight

Autonomous Agent --- An LLM in an open-ended tool-use loop with no predefined step count.
Human-in-the-Loop Checkpoint --- A workflow pauses at defined points for human review.
Deep Research Agent --- A planner spawns parallel research sub-agents and synthesizes findings.

Section G — Operational & Adaptive Patterns

Goal Setting and Monitoring --- Give the agent explicit, measurable objectives and check the trajectory against them while it runs.
Prioritization --- Rank competing tasks by value, urgency, effort, and dependencies before spending agent capacity on them.
Exception Handling and Recovery --- Detect failures in calls, tools, and services; retry, fall back, degrade, and resume from checkpoints.
Resource-Aware Optimization --- Make cost, latency, and token budget explicit runtime decision variables rather than fixed overhead.
Learning and Adaptation --- Improve the agent across episodes by turning outcomes and corrections into durable memories, instructions, and skills.
Exploration and Discovery --- Direct the agent toward novelty: map an unknown space instead of executing a known task.

Augmented LLM

Classification Foundational building block

Intent

Wrap a base model with retrieval, tools, and memory before composing it into anything else.

Sketch

Diagram of Augmented LLM pattern — The augmented LLM: a model with retrieval, tools, and memory. Bidirectional channels shown as double arrows.

Motivating Problem

A bare LLM call is stateless, has a fixed knowledge cutoff, and cannot affect the world. Any non-trivial application needs at least one of: fresh information, the ability to take actions, or continuity across calls. Re-implementing those wiring concerns for every pattern in this catalog would be exhausting and inconsistent.

How It Works

Three augmentations attach to a model call:

Retrieval brings in external information at query time. Vector store similarity, hybrid lexical+semantic search, multi-hop retrieval, reranking.
Tools are typed functions the model can invoke. Schema in the prompt; structured tool-use blocks in the output; runtime executes and returns results.
Memory is anything outside the current context window --- scratchpad, key-value store, episodic log, persistent profile.

The augmentations are capabilities, not control flow. Every other pattern in this catalog assumes its LLM calls are augmented unless explicitly noted.

The Model Context Protocol (MCP) is one concrete way to expose tools and data sources; the augmentations themselves are older than MCP and do not require it.

When to Use It

Always. There is essentially no production application where the bare model is preferable to the augmented model. The interesting design questions are which augmentations, not whether to use any.

Sources

Anthropic, Building Effective Agents (Schluntz & Zhang, December 2024) --- names the pattern.

Example

A customer support assistant has retrieval over the help center, tools for issuing refunds and looking up orders, and a memory store keyed on the conversation ID that records what has been tried so far. Every other pattern in the support stack --- routing, evaluation, escalation --- builds on this base.

Example artifacts

Prompt.

# System prompt

You are a customer support assistant.

When you need information from our help center, call the
retrieve_docs tool.

When you need to take an action (lookup_order, issue_refund), call
the

appropriate tool. The memory store contains prior context for this
user;

it is appended to your context automatically.

Be concise. Cite docs by their canonical URL when referenced.

Agent.

# anthropic-python: augmented LLM call

from anthropic import Anthropic

client = Anthropic()

response = client.messages.create(

model="claude-opus-4-7",

max_tokens=2048,

system=SYSTEM_PROMPT,

tools=[

{"name": "retrieve_docs", "input_schema": {...}},

{"name": "lookup_order", "input_schema": {...}},

{"name": "issue_refund", "input_schema": {...}},

],

messages=[

# memory store contents prepended here

*prior_context_for(conversation_id),

{"role": "user", "content": user_message},

],

)

Prompt Chaining

Classification Sequential

Intent

Decompose a task into a fixed sequence of LLM steps, where each step consumes the output of the previous one.

Sketch

Diagram of Prompt Chaining pattern — Sequential stages with optional programmatic gates between them.

Motivating Problem

A single LLM call asked to do a complex task --- outline, draft, fact-check, and translate a report --- usually does each subtask worse than a call dedicated to that subtask alone. Quality drops with prompt surface area. At the same time, splitting the task into truly independent pieces is impossible because each step depends on the previous one’s output.

How It Works

Hard-code the decomposition. Each stage gets its own prompt and its own model call. The output of stage n becomes part of the input to stage n+1. Between stages, place programmatic gates --- deterministic checks (length, schema validity, classifier scores) that decide whether to proceed, retry the previous stage, or abort.

Variations:

Linear chain. Stage n feeds stage n+1.
Chain with gates. Programmatic checks between stages.
Conditional chain. A gate may branch to alternative downstream stages.

When to Use It

When the task decomposes cleanly into a fixed pipeline whose stages are known in advance. If the decomposition depends on the input, prefer Orchestrator—Workers. If the stages are independent, prefer Parallelization --- Sectioning.

Sources

Anthropic, Building Effective Agents --- names the pattern.

Example

Generate marketing copy, then translate it into five languages: stage 1 drafts the copy; a gate checks word count and reading-grade level; stage 2 (run five times in parallel) translates the approved copy.

Example artifacts

Prompt.

# Stage 1 --- Draft

"Write a 200-word marketing blurb for {product_name}, targeting

{audience}. Output only the blurb, no commentary."

# Stage 2 --- Translate

"Translate the following blurb into {language}, preserving tone

and brand voice. Output only the translation.

Blurb:

{stage1_output}"

Agent.

def chain(product, audience, language):

blurb = llm(stage1_prompt.format(product_name=product,
audience=audience))

# Gate: enforce word count and reading grade

wc = len(blurb.split())

if not (150 <= wc <= 250):

raise GateFailure(f"blurb word count {wc} out of range")

return llm(stage2_prompt.format(language=language,
stage1_output=blurb))

Routing

Classification Sequential (classifier then specialist)

Intent

Classify the input and dispatch it to a specialized downstream handler.

Sketch

Diagram of Routing pattern — A classifier selects which specialist handles each input.

Motivating Problem

One prompt that tries to handle every kind of input ends up bad at all of them. Adding instructions for refund handling degrades the technical-support behavior; tuning for technical support breaks the small-talk path. Specialization wants separation of concerns, but the caller does not know the category in advance.

How It Works

A first call --- either an LLM classifier or a traditional classification model --- labels the input. The label selects a downstream prompt, model, or sub-workflow. Each downstream handler is independently prompted, evaluated, and versioned.

Routing has a useful second use beyond specialization: cost routing. Easy or common queries go to a smaller, cheaper model (Claude Haiku 4.5); hard or unusual queries go to a more capable one (Claude Opus 4.7). The router itself is usually small and cheap.

The pattern fails when the classifier is wrong. Calibrate it: keep a held-out evaluation set of labeled inputs, measure routing accuracy continuously, and watch the downstream metrics broken out per route.

When to Use It

When inputs fall into distinct categories that benefit from distinct handling, and when categorization is accurate enough that misrouting is rare or recoverable. If categories overlap or shift, prefer a single capable handler with retrieval. If the categorization itself requires multi-step reasoning, the decision belongs inside an Orchestrator—Workers pattern instead. For escalation based on confidence rather than category, prefer Cascade/Fallback.

Sources

Anthropic, Building Effective Agents --- names the pattern.

Example

A customer service deployment routes (a) FAQ-style questions to a Haiku model with retrieval, (b) account changes to a Sonnet model with account tools, and (c) anything flagged “complaint” or “escalation” to an Opus model with a human-in-the-loop checkpoint.

Example artifacts

Prompt.

# Router system prompt

You are a customer query router. Classify the user message into

exactly one category and emit JSON, no prose:

{

"category": "billing" | "tech_support" | "complaint" |
"general",

"confidence": 0.0-1.0

}

Use "complaint" for any expression of dissatisfaction or anger.

Use "general" as the fallback.

Agent.

import json

HANDLERS = {

"billing": ("claude-haiku-4-5", billing_prompt, billing_tools),

"tech_support": ("claude-sonnet-4-6", tech_prompt, tech_tools),

"complaint": ("claude-opus-4-7", complaint_prompt,
complaint_tools),

"general": ("claude-haiku-4-5", general_prompt, []),

}

def route_and_dispatch(user_msg):

raw = llm(router_prompt, user_msg, model="claude-haiku-4-5")

decision = json.loads(raw)

if decision["confidence"] < 0.6:

decision["category"] = "general" # safe fallback

model, prompt, tools = HANDLERS[decision["category"]]

return llm(prompt, user_msg, model=model, tools=tools)

Cascade / Fallback

Classification Sequential with early-exit gates

Intent

Try a cheaper or faster model first; escalate to a stronger one only when the cheaper attempt is unconfident or fails verification.

Sketch

Diagram of Cascade / Fallback pattern — Each level handles what it can; only failures escalate to the next-stronger model.

Motivating Problem

For workloads with a long tail of difficulty, sending every request to the largest model wastes money on the 80% of requests a smaller model would handle correctly. Routing-by-classification is one answer, but it requires the categorization itself to be reliable. Cascade is the alternative: try cheap first, ask the model (or a verifier) how confident it is, and escalate only when confidence is low.

How It Works

Order the available handlers from cheapest to most capable: Haiku, Sonnet, Opus. For each request, try the cheapest handler first. After the response, check confidence --- either by asking the model to self-report (“give me confidence 0—1”), by running a separate verifier on the output, or by inspecting tool-call success.

If confidence is above a threshold, return. Otherwise, escalate to the next handler. The escalated handler may receive the original input only, or the original input plus the cheaper handler’s attempt as additional context.

Cost analysis: if 70% of requests are handled at level 1, 20% at level 2, and 10% at level 3, the effective per-request cost is roughly 0.7·c₁ + 0.9·c₂ + 0.1·c₃ (since failed levels still incurred their cost) --- typically a large win over always-Opus for tasks with a comfortable success rate at the lower tiers.

Calibrate threshold offline. Self-reported confidence is famously poorly calibrated; a small verifier model or a structural check (“did the JSON validate?”, “did the SQL parse?”) often produces a more reliable signal.

When to Use It

When the difficulty distribution is long-tailed and a reliable signal exists for “this answer is probably right.” When latency-sensitive paths can afford the occasional escalation. Not for tasks where a wrong answer at the cheap tier is silently shipped --- escalation requires a real check, not a guess.

Sources

Catalog name. The pattern is widely deployed in production but does not have a single canonical paper.
Closest formal treatment: Chen et al., FrugalGPT: How to Use Large Language Models While Reducing Cost (2023).

Example

A code-explanation feature first tries Haiku on the snippet. The response includes a self-reported “confidence_low: true|false” field; on low confidence, the request escalates to Sonnet. If Sonnet also reports low confidence, the request goes to Opus. In production, 78% of requests resolve at Haiku, 17% at Sonnet, 5% at Opus.

Example artifacts

Prompt.

# Per-level prompt (all models)

Explain the code snippet below in 2-3 sentences.

After your explanation, on a new line, output exactly:

CONFIDENCE: HIGH | MEDIUM | LOW

Code:

{snippet}

Agent.

MODELS = ["claude-haiku-4-5", "claude-sonnet-4-6",
"claude-opus-4-7"]

def cascade(snippet):

for i, model in enumerate(MODELS):

response = llm(explain_prompt.format(snippet=snippet), model=model)

if "CONFIDENCE: HIGH" in response or i == len(MODELS) - 1:

return strip_confidence_line(response), model

# else loop and escalate

Parallelization — Sectioning

Classification Parallel

Intent

Split a task into independent subtasks, run them concurrently, and aggregate.

Sketch

Diagram of Parallelization — Sectioning pattern — Independent subtasks run in parallel; an aggregator merges their outputs.

Motivating Problem

A task with multiple independent facets --- a code review for correctness, performance, and security; a long document where independent sections can be drafted in any order --- wastes wall-clock time and model attention when handled in a single call. The model’s attention is finite; concentrating it on one facet at a time produces better output.

How It Works

Determine the sections in advance (this is what distinguishes sectioning from Orchestrator—Workers, where the decomposition is dynamic). Fire one LLM call per section concurrently. Then run an aggregation step --- often itself an LLM call --- that merges, de-duplicates, and reconciles the outputs.

The aggregator can be:

Concatenation for genuinely independent sections (different chapters of a report).
A merging LLM call when sections may overlap or contradict.
A deterministic combiner when the outputs are structured (JSON merges, set unions).

When to Use It

When the subtasks are genuinely independent (no subtask needs another’s output), and when running them concurrently reduces latency or improves quality through focused attention. If dependencies exist, prefer Prompt Chaining or ReWOO. If the same subtask is run many times for confidence rather than coverage, prefer Voting.

Sources

Anthropic, Building Effective Agents --- names the parallelization workflow and distinguishes sectioning from voting.

Example

A guardrail system runs two parallel LLM calls on every user message: one generates the response, one screens the message for policy violations. The aggregator returns the response only if the screen passes; otherwise it substitutes a refusal. Separating the two responsibilities produces better behavior on both than handling them in a single prompt.

Example artifacts

Prompt.

# Three facet prompts (one per section)

correctness_prompt = "Review the code below for correctness bugs
only. List each bug as: file:line --- description."

performance_prompt = "Review the code below for performance issues
only. List each issue as: file:line --- description."

security_prompt = "Review the code below for security
vulnerabilities only. List each as: file:line --- severity ---
description."

# Aggregator prompt

"You are given three independent reviews of the same code
(correctness, performance, security). Combine them into a single
report grouped by file, with one section per concern type. Remove
exact duplicates."

Agent.

import asyncio

async def review_code(code):

correctness, perf, security = await asyncio.gather(

llm_async(correctness_prompt + "\n" + code),

llm_async(performance_prompt + "\n" + code),

llm_async(security_prompt + "\n" + code),

)

return llm(aggregator_prompt, [correctness, perf, security])

Parallelization — Voting

Classification Parallel

Intent

Run the same task N times and combine the results to get a more confident answer.

Sketch

Diagram of Parallelization — Voting pattern — N parallel calls (diverse prompts or models) feed an aggregation rule.

Motivating Problem

A single LLM call on a high-stakes judgment --- is this code vulnerable? Is this content harmful? Which of these answers is correct? --- has nonzero variance. The cost of being wrong sometimes exceeds the marginal cost of running the call more than once.

How It Works

Run N independent calls on the same input. Combine the outputs by:

Majority vote for discrete labels (self-consistency in the chain-of-thought literature).
Threshold rules --- flag as harmful if any of N flags it; auto-approve only if all of N pass.
Aggregating LLM call for free-text outputs, reconciling the N candidates into a single answer.

Diversity helps. Different prompts, different temperatures, or different models. Identical calls at temperature zero give you N copies of the same answer and waste tokens.

When to Use It

When the cost of an individual wrong answer is high relative to the cost of N calls; when the task admits a defensible aggregation rule; and when the calls are genuinely diverse.

Sources

Anthropic, Building Effective Agents --- names voting as a parallelization workflow.

Example

A content moderation pipeline runs three differently-prompted classifiers on every post. The post is auto-approved only if all three pass; auto-rejected only if all three flag; otherwise routed to a human moderator.

Example artifacts

Agent.

from collections import Counter

PROMPTS = [

"Classify this post: harmful or safe. Output one word.",

"Is this post safe to publish on a family-friendly site? Yes or
no.",

"Does this post violate our content policy (hate, harassment,
sexual, violent)? Yes or no.",

]

def screen(post):

votes = [llm(p + "\n" + post, temperature=0.0) for p in PROMPTS]

flagged = sum(1 for v in votes if "harm" in v.lower() or "yes" in
v.lower())

if flagged == 0: return "auto_approve"

if flagged == len(votes): return "auto_reject"

return "human_review" # disagreement -> escalate

Map-Reduce over Corpus

Classification Parallel (per-document fan-out, sequential reduce)

Intent

Apply the same LLM operation to each document in a corpus in parallel, then combine the results in a final reduce call.

Sketch

Diagram of Map-Reduce over Corpus pattern — One LLM call per document in parallel; a final reduce call synthesizes.

Motivating Problem

When a question touches information spread across many documents, naively concatenating the corpus into one prompt blows the context window, dilutes attention, and pays for tokens that have nothing to do with the answer. The classical map-reduce shape --- do the same thing to each item, then combine --- carries over almost unchanged.

How It Works

The pattern decomposes a corpus query into three stages:

Map: for each document, run the same LLM call --- extract the relevant facts, summarize what is in scope, or score relevance. The map prompt is identical for every document.
Shuffle: optionally filter out null results (“not relevant”) before reducing, to keep the reduce call’s context manageable.
Reduce: a single LLM call (or a small number of them in a tree-reduce, if the surviving outputs are still too many) combines the per-document outputs into the final answer.

This is Parallelization --- Sectioning’s document-level cousin. Where Sectioning takes one input and splits it into facets, Map-Reduce takes one query and broadcasts it across many inputs. The pattern is worth its own name because the corpus-level shape recurs constantly: search over documentation, evidence aggregation, due diligence, log triage.

Hierarchical reduce. When N is in the thousands and the reduce prompt cannot fit all summaries, run reduce in a tree: combine groups of, say, 16 summaries each, then combine the group summaries. The cost is N + N/16 + N/256 + … ≈ (16/15)·N --- still linear in N.

When to Use It

When the question naturally decomposes per-document, when the corpus is bounded (hundreds to low thousands of documents), and when the per-document operation is uniform. For unbounded retrieval over a much larger corpus, use RAG to narrow to a relevant subset first, then Map-Reduce over the subset.

Sources

Catalog name. The map-reduce shape is the LLM-era specialization of Dean & Ghemawat’s 2004 paper (MapReduce: Simplified Data Processing on Large Clusters) for corpus-level LLM operations.
LangChain documentation names the same shape under “map_reduce” chains.

Example

A due-diligence pipeline receives a 300-document data room and the question “are there any unusual indemnification clauses?” Map: each document is summarized with respect to indemnification, with “NONE” returned for irrelevant documents. Filter: documents returning NONE are dropped. Reduce: a final call ranks the surviving clauses by unusualness and produces a memo.

Example artifacts

Prompt.

# Map prompt (run per document)

"Extract any clauses related to {topic} from the document below. If
none, output exactly: NONE. Otherwise output: <clause text> //
source: <page or section>.

Document:

{document}"

# Reduce prompt (run once over the surviving outputs)

"You are given per-document extracts about {topic}. Combine them
into a memo: group similar clauses, rank by unusualness, cite each by
source."

Agent.

import asyncio

async def map_reduce(corpus, topic):

# Map in parallel

extracts = await asyncio.gather(*[

llm_async(map_prompt.format(topic=topic, document=d))

for d in corpus

])

# Shuffle (filter)

relevant = [e for e in extracts if e.strip() != "NONE"]

if not relevant:

return "No relevant clauses found."

# Reduce

return llm(reduce_prompt.format(topic=topic),
"\n---\n".join(relevant))

Skill (SKILL.md sketch).

---

name: corpus-map-reduce

description: Apply the same extraction question to every document in
a corpus

and synthesize a single memo. Use when the user asks about something
that

may span many files in a folder, codebase, or data room.

---

# Map-Reduce over Corpus

## When to invoke

- "review these 50 contracts for X"

- "find all places in the codebase that do Y"

- "summarize what each of these documents says about Z"

## Recipe

1. Read every file in the target set in parallel.

2. For each file, run the map prompt (topic from the user).

3. Drop files whose output is "NONE".

4. Run the reduce prompt over the survivors.

5. Return the memo with source citations.

Orchestrator–Workers

Classification Hybrid (sequential orchestration of parallel workers, possibly looping)

Intent

A central LLM dynamically decomposes the task into subtasks, delegates each to a worker LLM, and synthesizes their outputs.

Sketch

Diagram of Orchestrator–Workers pattern — The orchestrator decomposes dynamically and synthesizes after workers return.

Motivating Problem

Some tasks decompose into subtasks whose number and shape are not knowable in advance. Editing a code change across an unknown number of files is the canonical case. Sectioning fails here because you cannot list the sections up front. Prompt Chaining fails because the decomposition is the problem.

How It Works

An orchestrator LLM call inspects the input and produces a list of subtasks. Each subtask is dispatched to a worker --- usually a fresh LLM call with a worker-specific prompt and a focused scope. Workers can run in parallel when their subtasks are independent. When they return, the orchestrator synthesizes their outputs into a final result, possibly issuing further subtasks if it judges the work incomplete.

Topologically similar to Sectioning, but the decomposition is dynamic. The orchestrator must be capable enough to produce a sensible decomposition, and the synthesis step has to handle a variable-size collection of worker outputs.

When to Use It

When the structure of the task depends on the input; when subtasks are roughly independent once identified; and when you can afford the latency of the orchestration + synthesis round-trips. If you know the subtasks in advance, prefer Sectioning. If subtasks have dependencies, you are reaching for ReWOO or Hierarchical Supervisor.

Sources

Anthropic, Building Effective Agents --- names the pattern.

Example

A coding agent receives “rename the User class to Account everywhere.” The orchestrator inspects the codebase, identifies files that need changes (including imports, tests, migrations), and spawns a worker for each file. Each worker proposes its edit. The orchestrator reviews the diff set and either commits or issues additional subtasks.

Example artifacts

Prompt.

# Orchestrator system prompt

You are a task orchestrator. Given the user goal, output a JSON

list of subtasks, each shaped:

{

"id": "t1",

"instruction": "<focused, self-contained instruction>",

"tool": "<which tool/worker prompt to use>",

"depends_on": [] // ids of subtasks that must complete first

}

After all subtasks return, you will be re-invoked with their

outputs to synthesize the final answer.

Agent.

def orchestrate(goal):

plan_json = llm(orchestrator_prompt, goal, model="claude-opus-4-7")

plan = json.loads(plan_json)

results = {}

while plan:

ready = [t for t in plan if all(d in results for d in
t["depends_on"])]

if not ready:

raise RuntimeError("dependency cycle")

batch = run_workers_parallel(ready) # one LLM call per subtask

results.update(batch)

plan = [t for t in plan if t["id"] not in results]

return llm(synthesis_prompt, goal, results,
model="claude-opus-4-7")

Hybrid Pipeline

Classification Sequential of parallel stages

Intent

Compose a workflow as a sequence of stages, where one or more stages internally fan out in parallel.

Sketch

Diagram of Hybrid Pipeline pattern — Sequential stages with internal parallel fan-out — the shape of most real workflows.

Motivating Problem

Real workflows are almost never purely sequential or purely parallel. A research pipeline reads a corpus (sequential), summarizes each document in parallel (parallel), then synthesizes the summaries into a report (sequential). Treating the whole thing as one pattern loses information about where the parallelism actually is and where the dependencies actually are.

How It Works

Hybrid Pipeline is a composition, not a primitive. It is the explicit recognition that the patterns in this catalog combine, and the act of naming the composition gives you a vocabulary for the shape of a workflow rather than just its parts. A Hybrid Pipeline is fully described by listing its stages, marking which are sequential and which fan out, and naming the sub-pattern used at each stage (Prompt Chaining, Sectioning, Map-Reduce, ReAct, etc.).

When to Use It

Whenever a workflow has more than one stage and at least one of those stages parallelizes. Which is almost always.

Sources

Catalog name. The composition has no single canonical reference because it is the implicit shape of nearly every production system.

Example

A legal-document review pipeline: (1) ingest and chunk documents (sequential), (2) classify each chunk by clause type (Map-Reduce over Corpus), (3) for each clause type, run a specialist reviewer (Routing inside Sectioning), (4) aggregate flags and produce a single review memo (sequential, with an Evaluator—Optimizer loop on the memo).

Example artifacts

Agent.

# Hybrid pipelines are best read as their constituent patterns
composed.

# Pseudo-DAG:

stage_1_ingest = sequential(chunk_documents)

stage_2_classify = map_reduce(per_chunk=classify_clause_type,
reduce=group_by_type)

stage_3_review =
sectioning_with_routing(per_type=specialist_reviewer)

stage_4_memo = evaluator_optimizer(generator=draft_memo,
evaluator=memo_checker, max_iter=3)

pipeline = compose(stage_1_ingest, stage_2_classify, stage_3_review,
stage_4_memo)

Evaluator–Optimizer

Classification Looping

Intent

One LLM generates a candidate; a second LLM evaluates it and produces feedback; the generator revises. Loop until the evaluator accepts or a budget is exhausted.

Sketch

Diagram of Evaluator–Optimizer pattern — Generator/Evaluator loop — feedback drives revision until the critic accepts.

Motivating Problem

Some outputs improve dramatically when a human gives the model targeted feedback (“the second paragraph contradicts the first,” “this translation lost the pun”). The first pass is rarely the best pass. But waiting for a human reviewer between every iteration does not scale.

How It Works

Stand up two LLM roles. The generator produces a candidate output from the input and any prior feedback. The evaluator scores the candidate against explicit criteria --- accuracy, completeness, style, schema validity --- and either accepts or emits feedback. On a reject, the generator runs again with the feedback in its prompt. The loop terminates on accept, on max iterations, or when feedback stops changing meaningfully.

Two preconditions matter: (1) the evaluator must be able to articulate useful feedback (“bad” doesn’t help the generator), and (2) feedback must demonstrably improve the output (test offline). When both hold, the pattern produces outputs significantly better than a single call. When they do not --- a sign that criteria are not legible to the model --- the loop wastes tokens without converging.

When to Use It

When you have clear evaluation criteria and iterative refinement provides measurable value; classically, translation, long-form drafting, code generation against tests, and complex search.

Sources

Anthropic, Building Effective Agents --- names the pattern.

Example

A literary translation pipeline: the generator translates a paragraph; the evaluator (prompted with the source, the translation, and a checklist of common failure modes) emits a structured critique; the generator re-translates incorporating the critique. Three rounds usually suffice; the loop caps at five.

Example artifacts

Prompt.

# Generator prompt

"Translate from {src_lang} to {tgt_lang}. Preserve tone and any
wordplay.

Source:

{source_text}

Prior critique to address (may be empty):

{feedback}"

# Evaluator prompt

"You will score a translation 0-10 on (a) fidelity, (b) naturalness,
(c) preserved wordplay. Output JSON:

{ \"score\": <float>, \"feedback\": \"<targeted
critique, max 3 bullets>\" }

Source: {source_text}

Translation: {translation}"

Agent.

def evaluator_optimizer(source, src_lang, tgt_lang, max_iter=5,
threshold=8.5):

feedback = ""

for i in range(max_iter):

candidate = llm(generator_prompt.format(

src_lang=src_lang, tgt_lang=tgt_lang,

source_text=source, feedback=feedback))

verdict = json.loads(llm(evaluator_prompt.format(

source_text=source, translation=candidate)))

if verdict["score"] >= threshold:

return candidate

feedback = verdict["feedback"]

return candidate # best-effort on budget exhaustion

Generator–Verifier

Classification Looping (rejection sampling)

Intent

Generate a candidate; check it with a binary Pass/Fail verifier; on Fail, sample again. The verifier emits no feedback; the generator simply produces a fresh sample.

Sketch

Diagram of Generator–Verifier pattern — Differs from Evaluator–Optimizer: the verifier has no feedback channel; the generator just samples again.

Motivating Problem

Sometimes the right success signal is binary --- the SQL parses, the JSON validates, the unit tests pass, the regex extracts a match --- and articulating useful natural-language feedback to the generator is either impossible or unhelpful. Asking a Pass/Fail verifier to also produce feedback adds prompt complexity and can introduce hallucinated critiques. Rejection sampling against a clean verifier is often simpler and faster.

How It Works

The verifier is a function --- deterministic checker, classifier, or LLM with a strict Pass/Fail output --- that takes the candidate and returns one of two labels. The generator samples k candidates (in parallel or serially) and returns the first one that passes; or returns failure after max_samples.

The pattern is the agent-era version of rejection sampling. Diversity in the generator (temperature, prompt variations) is the engine --- if the generator always produces the same output, k samples cost k× the tokens and gain nothing.

Why this is not Evaluator—Optimizer:

Evaluator—Optimizer’s critic produces feedback the generator incorporates. The generator’s prompt grows over the loop.
Generator—Verifier’s verifier produces Pass/Fail only. The generator’s prompt is constant; only the random seed (or temperature) differs.
When the failure mode is “close but not quite right,” feedback helps and Evaluator—Optimizer is appropriate. When the failure mode is “invalid in an objective way,” rejection sampling is appropriate.

Parallel batch variant. Run k generator calls concurrently, then pick the first to pass. Trades cost for latency --- useful when most candidates fail and the verifier is fast.

When to Use It

When validity is objective and binary; when feedback would not help the generator more than fresh sampling would; when the verifier is fast relative to the generator. Code generation against a test suite, structured output generation against a schema, query generation against a database parser.

Sources

Catalog name. The technique is classical rejection sampling (von Neumann, 1951) applied to LLM outputs.
The text-to-SQL literature has independently used this shape under names like “self-debugging” and “execution-guided decoding.”

Example

A natural-language-to-SQL feature: the generator produces a SQL candidate; the verifier runs EXPLAIN against the database (no execution) and accepts on a clean parse. On reject, the generator samples again at higher temperature. Cap: 5 samples. Below-cap success rate observed: ~92%.

Example artifacts

Prompt.

# Generator prompt

"Translate the user question into a SELECT query against the schema
below. Output only SQL.

Schema:

{schema}

Question:

{question}"

# Verifier (programmatic, not an LLM)

def verifier(sql):

try:

db.execute("EXPLAIN " + sql) # parse-only, no execution

return "PASS"

except SyntaxError:

return "FAIL"

Agent.

def generator_verifier(question, schema, max_samples=5):

for i in range(max_samples):

candidate = llm(

generator_prompt.format(schema=schema, question=question),

temperature=0.0 if i == 0 else 0.7,

)

if verifier(candidate) == "PASS":

return candidate

raise SamplingExhausted(f"no valid SQL in {max_samples} tries")

Reflexion

Classification Looping with episodic memory

Intent

Evaluator—Optimizer extended with persistent verbal lessons: the agent reflects on each failure in writing, stores the reflection, and consults the store on the next attempt.

Sketch

Diagram of Reflexion pattern — Failures generate written reflections; the actor consults the store on the next attempt.

Motivating Problem

Evaluator—Optimizer carries feedback only within a single task. If the agent fails the same way on a hundred different tasks, it learns nothing across tasks. Fine-tuning is the heavyweight answer; Reflexion is the lightweight one --- keep the lessons in natural language, in a memory store the next attempt can read.

How It Works

Three roles: the actor attempts the task; the critic judges the attempt; the reflector, run only on failures, writes a short natural-language post-mortem that names what went wrong and what to try next. The post-mortem is stored. On future attempts (same task or related), the actor’s prompt includes relevant reflections retrieved from the store.

The pattern is essentially verbal reinforcement learning: the reflection is the gradient, and the actor’s context window is the parameter update. It is cheaper and more interpretable than fine-tuning, and it composes with the rest of the patterns in this catalog.

When to Use It

When failures recur in legible ways and the task has a clean success signal (unit tests pass, retrieval recall hits a threshold, evaluator accepts). Reflexion is a poor fit when success is ambiguous or when failures are idiosyncratic.

Sources

Shinn et al., Reflexion: Language Agents with Verbal Reinforcement Learning (NeurIPS 2023) --- names and characterizes the pattern.

Example

A SWE-bench-style coding agent attempts a fix; the test suite runs as the critic; on failure, the reflector writes “the patch broke test_user_serialization because the migration order matters.” The reflection is stored. On the next attempt at this issue (or a similar one), the reflection is retrieved into the actor’s prompt.

Example artifacts

Prompt.

# Reflector prompt

"The previous attempt at this task failed. In 2-3 sentences, write a
lesson to apply on future attempts. Focus on the root cause, not the
surface symptom. Begin with: 'Lesson: '

Task: {task}

Attempt: {attempt}

Critic verdict: {critic_output}"

Agent.

def reflexion(task, max_iter=5):

lessons = retrieve_lessons(task) # vector search over store

for i in range(max_iter):

attempt = llm(actor_prompt, task=task, lessons=lessons)

verdict = critic(attempt, task)

if verdict.passed:

return attempt

lesson = llm(reflector_prompt, task=task,

attempt=attempt, critic_output=verdict.explanation)

store_lesson(task, lesson)

lessons.append(lesson) # available to next iteration

return attempt

Autoresearch (Ratchet Loop)

Classification Looping with monotonic state

Intent

Propose a change, run an experiment, measure a metric, keep the change only if the metric improved; otherwise roll back. Repeat indefinitely.

Sketch

Diagram of Autoresearch (Ratchet Loop) pattern — The ratchet: commit on improvement, rollback on regression. State on disk is always the current best.

Motivating Problem

Many optimization problems --- tuning an ML training loop, optimizing a prompt, improving SEO copy, refining an ad --- share a structure: there is a clear measurable metric, individual experiments take minutes to hours, the space of possible changes is large, and most changes don’t help. A human researcher spends most of their time waiting for the experiment, not thinking about the next one. The work is parallelizable in time: an agent can run experiments overnight while the human sleeps.

How It Works

Three preconditions: (1) a measurable scalar metric, (2) the ability to apply and roll back a change atomically, and (3) an experiment cycle that fits inside a reasonable wall-clock budget. With those, the loop is small:

An agent reads the current code/prompt/config and the log of prior experiments.
It proposes a single, targeted change, grounded in its knowledge of the domain.
The change is applied; the experiment runs.
The metric is compared against the best-so-far.
If improved, the change is committed (git commit) and the new best is recorded. If not, the change is rolled back (git reset —hard).
The loop repeats.

The git-based “ratchet” is what distinguishes the pattern from a generic optimizer. The state on disk is always the current best; failed experiments leave no residue. The agent can be naive about exploration strategy because the ratchet provides the floor.

Karpathy’s autoresearch repository (March 2026) is the canonical implementation: an AI coding agent (Claude Code, Codex, or equivalent) is pointed at a small LLM training setup and a Markdown instruction file describing the research agenda, and the loop runs unsupervised. Results from extended runs surfaced changes a careful human would eventually have found --- a missing QK-norm scaler, value-embedding regularization, AdamW beta tuning --- caught by the agent grinding through experiments overnight.

When to Use It

When the three preconditions hold and the experiment cycle is short enough to run a hundred trials overnight. If the metric is fuzzy, the rollback is messy, or each experiment takes a day, the pattern degrades into something closer to Evaluator—Optimizer with very expensive iterations.

Sources

Karpathy, autoresearch GitHub repository (March 2026) --- the reference implementation and the source of the name.

Example

Karpathy’s original run: nanochat-style LLM training on a single GPU, ~5-minute training jobs, validation bits-per-byte as the metric. Two days unattended, ~700 experiments, ~20 genuine improvements committed, ~11% reduction in training time. Tobi Lütke’s adaptation pointed the same loop at a 0.8B query-expansion model overnight: 37 experiments, 19% quality improvement.

Example artifacts

Agent.

# Pseudocode for the ratchet loop

def autoresearch(workdir, max_iter=1000, time_budget_hours=8):

deadline = time.time() + time_budget_hours * 3600

best_metric = current_metric(workdir)

for i in range(max_iter):

if time.time() > deadline: break

history = read_log(workdir / "results.tsv")

change_plan = llm(agent_prompt,

history=history,

state=read_files(workdir),

model="claude-opus-4-7")

apply_change(workdir, change_plan)

try:

new_metric = run_experiment(workdir)

except Exception:

git_reset(workdir); continue

if new_metric > best_metric:

git_commit(workdir, f"iter {i}: {change_plan.summary}")

best_metric = new_metric

append_log(workdir, i, change_plan, new_metric, "committed")

else:

git_reset(workdir)

append_log(workdir, i, change_plan, new_metric, "rolled_back")

Skill (SKILL.md sketch).

---

name: autoresearch

description: Run an unattended optimization loop on a measurable
metric.

Use when the user wants to "explore overnight," "optimize
automatically,"

or improve a metric they can measure programmatically.

---

# Autoresearch (Ratchet Loop)

## Preconditions to verify before starting

- There is ONE scalar metric to optimize, measurable by a script.

- A single experiment cycle completes in under ~10 minutes.

- The workdir is a git repo; the agent can commit and reset.

## Required files in the workdir

- agenda.md --- research goal, constraints, ideas to try

- run_experiment.sh --- exits 0 and prints METRIC: <float> on
stdout

- results.tsv --- appended to each iteration

## Loop (do not deviate)

1. Read results.tsv for history of prior experiments.

2. Propose ONE targeted change, grounded in agenda.md.

3. Apply the change.

4. Run run_experiment.sh; capture METRIC.

5. If METRIC improved over best-so-far: git commit; update best.

6. Else: git reset --hard.

7. Append a row to results.tsv. Go to 1.

ReAct

Classification Looping (single agent, single context)

Intent

Interleave the model’s reasoning (“Thought”) with tool calls (“Action”) and tool results (“Observation”) in a single rolling context until the task is done.

Sketch

Diagram of ReAct pattern — Three iterations of Thought → Action → Observation on a single rolling context.

Motivating Problem

Many real tasks require both reasoning and grounded actions --- the model must think about what to do, do it, see the result, and reason about the next step. Pure Chain-of-Thought reasons but cannot ground its conclusions in external state. Pure tool use acts but cannot plan across steps. ReAct interleaves the two.

How It Works

A single prompt instructs the model to emit a Thought, then an Action (a tool call), then to wait for an Observation, then to emit the next Thought. The runtime parses each Action, executes the tool, returns the Observation, and appends everything to the rolling context. The loop continues until the model emits a terminal Action (“Final Answer”) or a stop condition triggers (max iterations, token budget, error).

ReAct is the dominant pattern for production agents today. Most “agent frameworks” are implementations of ReAct with conveniences around tool definition, parsing, and observability.

Failure modes: the model gets stuck repeating the same Action, loses track in long contexts, or hallucinates Observations it never actually received. Guardrails --- iteration caps, repetition detectors, structured tool outputs the model cannot fabricate --- mitigate them.

When to Use It

When the task requires tool use, the steps are not knowable in advance, and the horizon is bounded (tens of steps, not thousands). For longer horizons, layer ReAct inside an outer pattern --- Plan-and-Execute for structure, Hierarchical Supervisor for delegation, Reflexion for cross-task learning.

Sources

Yao et al., ReAct: Synergizing Reasoning and Acting in Language Models (NeurIPS 2023) --- the paper that named the pattern.

Example

A research assistant asked “Who won the 2024 Australian Open and what is their hometown?” emits: Thought: I should look up the 2024 Australian Open winner. Action: web_search(“2024 Australian Open winner”). Observation: Jannik Sinner. Thought: Now I need his hometown. Action: web_search(“Jannik Sinner hometown”). Observation: Innichen, Italy. Action: final_answer(“Jannik Sinner, from Innichen, Italy.”).

Example artifacts

Prompt.

# ReAct system prompt

You have access to these tools: web_search(q), fetch(url),
final_answer(text).

At each step, respond in EXACTLY this format:

Thought: <your reasoning about what to do next>

Action: <tool_name>(<args>)

Wait for the Observation to be returned. Then continue with the next
Thought.

When you have the answer, call: Action: final_answer(<answer>)

Agent.

def react_loop(question, tools, max_iter=12):

context = [{"role": "user", "content": question}]

for i in range(max_iter):

response = llm(react_system_prompt, context, tools=tools)

context.append({"role": "assistant", "content": response})

action = parse_action(response)

if action.name == "final_answer":

return action.args["text"]

observation = tools[action.name](**action.args)

context.append({"role": "user",

"content": f"Observation: {observation}"})

raise IterationCapReached()

Plan-and-Execute

Classification Sequential (plan stage, then execution stages)

Intent

Produce a full step-by-step plan up front; then execute each step, often with a smaller and cheaper model.

Sketch

Diagram of Plan-and-Execute pattern — Strong planner produces the plan once; cheaper executor runs each step.

Motivating Problem

ReAct decides each next step at runtime, with all the context-window cost and variance that implies. For tasks whose structure is predictable --- automated software deployments, structured data extraction, well-bounded multi-step workflows --- re-deciding every step is waste. A plan produced once by a capable model is cheaper to execute than a sequence of full reasoning passes.

How It Works

A planner (typically a strong model) reads the goal and emits a structured plan: an ordered list of steps, each with the inputs and expected outputs named. An executor (potentially a smaller, cheaper model) runs each step in sequence. The executor’s prompt is constrained --- it sees the plan, the current step, the outputs of prior steps, and the tools available.

Variations:

Static plan. Plan fixed at planning time. Simple but brittle.
Replanning. After each step, the planner is invoked again with the latest observations and can revise the remaining plan.
Plan-and-Solve (Wang et al., 2023). A specific prompting variant separating “devise a plan” from “carry out the plan” in a single LLM call.

When to Use It

When the task structure is predictable, when separating planning from execution reduces cost meaningfully, and when the planner can produce a plan you would trust to execute without re-checking every step. If the environment is unpredictable, prefer ReAct or replanning.

Sources

Wang et al., Plan-and-Solve Prompting (ACL 2023) --- the prompting variant that names the pattern.

Example

An automated deployment agent receives “deploy v2.3 to staging.” The planner emits: (1) run pre-flight checks, (2) tag the release, (3) push to staging registry, (4) update the staging cluster, (5) run smoke tests. A smaller executor model runs each step. If smoke tests fail, the planner is re-invoked with the failure context to decide whether to roll back or to patch forward.

Example artifacts

Prompt.

# Planner prompt (run once, strong model)

"Goal: {goal}

Output a JSON plan: a list of steps. Each step has:

{ \"id\": \"s1\", \"instruction\": \"...\",
\"tool\": \"deploy|test|...\", \"depends_on\": []
}"

# Executor prompt (run per step, cheap model)

"You are executing step {step_id}: {instruction}

Prior step outputs (JSON):

{prior_outputs}

Use the tool {tool}. Return its result verbatim."

Agent.

def plan_and_execute(goal):

plan = json.loads(llm(planner_prompt.format(goal=goal),

model="claude-opus-4-7"))

outputs = {}

for step in plan:

try:

outputs[step["id"]] = llm(executor_prompt.format(

step_id=step["id"],

instruction=step["instruction"],

tool=step["tool"],

prior_outputs=json.dumps(outputs),

), model="claude-haiku-4-5", tools=[step["tool"]])

except StepFailure as e:

# Replan on failure

plan = json.loads(llm(replanner_prompt.format(

goal=goal, completed=outputs, failure=str(e)),

model="claude-opus-4-7"))

return outputs

Skill (SKILL.md sketch).

---

name: plan-and-execute

description: Decompose a multi-step goal into a structured plan and
run each

step. Use when the user describes a goal that decomposes into 3+
named

steps with clear dependencies (deployments, multi-stage data
processing,

structured extraction).

---

# Plan-and-Execute

## When to invoke

- "Deploy X to Y", "set up the pipeline that does A then B then
C"

- Goals with verbs that imply staged work

## Recipe

1. Call the planner (strong model) ONCE on the goal.

2. Validate the plan: each step has id, instruction, tool,
depends_on.

3. Execute steps in topological order; smaller model is fine per
step.

4. On step failure: re-invoke the planner with completed steps +
error.

5. Return all outputs keyed by step id.

ReWOO

Classification Sequential plan; parallel execution; sequential synthesis

Intent

Produce a plan with explicit data dependencies, execute the independent steps in parallel without intermediate model invocations, and combine the results in a final synthesis step.

Sketch

Diagram of ReWOO pattern — Plan once, execute as a DAG of pure tool calls, synthesize once — only two LLM invocations total.

Motivating Problem

ReAct re-invokes the LLM after every Observation, which is expensive in tokens and slow in latency when most of the Observations are routine. If the agent knew the structure of the work up front, it could fire many tool calls in parallel without consulting the model between them.

How It Works

The planner decomposes the task into a directed acyclic graph of steps, where each step is labeled (E1, E2, …) and may declare dependencies on the outputs of earlier steps via variable substitution. Workers execute the steps; independent steps run concurrently; dependent steps wait for their inputs. The model is not re-invoked between worker steps. Once all steps complete, the solver receives the original question, the plan, and all the observations, and produces the final answer.

ReWOO (“Reasoning WithOut Observation”) is named for the central insight: the planner reasons without observations, the workers act without reasoning, and the solver reasons after all observations have arrived. The LLM is invoked twice (planner + solver) instead of N times (ReAct).

When to Use It

When the task’s structure can be planned in advance, when independent steps exist that can run in parallel, and when intermediate LLM reasoning between steps adds little. If the next step depends on reasoning over the previous step’s content, prefer ReAct.

Sources

Xu et al., ReWOO: Decoupling Reasoning from Observations for Efficient Augmented Language Models (2023) --- the paper that names the pattern.

Example

“What is the population of the capital of the country that won the most gold medals at the 2024 Olympics?” Planner produces: E1 = lookup(“2024 Olympics gold medal leader”); E2 = lookup(“capital of #E1”); E3 = lookup(“population of #E2”). E1 must finish before E2; E2 before E3. Solver synthesizes a final answer from #E1, #E2, #E3.

Example artifacts

Prompt.

# Planner prompt

"Decompose the question into worker steps using variable
substitution. Use #E1, #E2, ... to reference prior step outputs.
Output exactly:

Plan: <one-paragraph reasoning>

#E1 = <tool>[<args>]

#E2 = <tool>[<args, possibly referencing #E1>]

...

Question: {question}"

# Solver prompt

"Question: {question}

Plan and observations:

{plan_with_observations}

Synthesize the final answer."

Agent.

def rewoo(question, tools):

plan_text = llm(planner_prompt.format(question=question))

steps = parse_steps(plan_text) # list of (id, tool, args)

observations = {}

pending = list(steps)

while pending:

ready = [s for s in pending if all(d in observations for d in
s.deps)]

results = run_in_parallel([

(s.id, tools[s.tool](**substitute(s.args, observations)))

for s in ready

])

observations.update(results)

pending = [s for s in pending if s.id not in observations]

return llm(solver_prompt.format(

question=question,

plan_with_observations=annotate(plan_text, observations)))

Tree-of-Thoughts

Classification Branching search over LLM reasoning

Intent

Generate multiple reasoning branches at each step, evaluate the partial branches, expand the most promising, backtrack when a branch fails.

Sketch

Diagram of Tree-of-Thoughts pattern — Search tree of partial reasoning states with scoring; pruned branches in pink.

Motivating Problem

Chain-of-Thought commits to one line of reasoning. If the first step is wrong, every subsequent step inherits the error. For tasks where the search space of reasoning paths is large and only a few paths lead to the right answer --- combinatorial puzzles, creative planning, mathematical proofs --- committing early is the failure mode.

How It Works

Treat reasoning as a search problem. At each node, the model proposes K candidate “thoughts” (next reasoning steps). Each candidate is scored --- by the model itself acting as a value function, by an external evaluator, or by partial execution. The search proceeds via BFS, DFS, or beam search through the tree, expanding promising nodes and pruning poor ones. Backtracking is cheap because the tree is preserved.

The pattern subsumes Chain-of-Thought (the K=1, no-backtrack degenerate case) and self-consistency voting (which samples N independent linear chains without sharing structure).

When to Use It

When the task has a large reasoning search space, when partial reasoning states are evaluable, and when the cost of expanding many branches is justified by the value of finding the right one. Token costs scale with branch count.

Sources

Yao et al., Tree of Thoughts: Deliberate Problem Solving with Large Language Models (NeurIPS 2023) --- names the pattern.

Example

Solving the Game of 24 puzzle (combine four numbers with arithmetic to reach 24). At each step, the model proposes candidate next operations; each partial state is scored as “promising” / “uncertain” / “impossible”; the search expands only promising states. ToT solves problems CoT fails on, at multiplicative token cost.

Example artifacts

Prompt.

# Propose prompt

"Given the current partial reasoning state, propose K=3 distinct
next thoughts. For each, also self-evaluate as: SURE | MAYBE |
IMPOSSIBLE.

State: {state}

Output JSON: [{ \"thought\": \"...\", \"eval\":
\"SURE|MAYBE|IMPOSSIBLE\" }]"

# Score prompt (optional, separate value model)

"Rate the partial reasoning below on a 1-10 scale for likelihood of
reaching the goal.

Trajectory: {trajectory}

Output: { \"score\": <float>, \"reason\": \"...\" }"

Agent.

def tot_search(initial_state, max_depth=4, k=3, beam=2):

frontier = [(initial_state, [])]

for depth in range(max_depth):

candidates = []

for state, path in frontier:

proposals = json.loads(llm(propose_prompt.format(state=state)))

for p in proposals:

if p["eval"] == "IMPOSSIBLE": continue

if is_goal(p["thought"]): return path + [p]

score = json.loads(llm(score_prompt.format(

trajectory=path + [p])))["score"]

candidates.append((score, p, path + [p]))

candidates.sort(reverse=True)

frontier = [(c[1]["thought"], c[2]) for c in
candidates[:beam]]

return frontier[0][1] # best path so far

LATS

Classification Tree search over a ReAct-style action space, with reflection

Intent

Combine Tree-of-Thoughts (branching search), ReAct (tool use), and Reflexion (verbal lessons) into a unified deliberation loop.

Sketch

Diagram of LATS pattern — Each tree node is a ReAct trajectory; dead ends produce reflections that steer remaining branches.

Motivating Problem

ReAct commits to whichever action it samples; a wrong commitment far down a long trajectory is expensive to recover from. Tree-of-Thoughts handles the branching but only over reasoning steps, not over tool actions. Real tasks need both: branching deliberation and grounding in tool results and the ability to learn from failed branches.

How It Works

Each node of the search tree is a state in a ReAct trajectory. At each node, multiple candidate Actions are generated and evaluated; the search expands the most promising. When a branch reaches a dead end or a failure, a reflection is generated and stored, and the search backtracks to a sibling. The reflections inform later expansions.

LATS = Tree Search + ReAct + Reflection. The compute cost is significant --- multiple ReAct branches running concurrently, each generating reflections --- but the success rate on hard tasks goes up correspondingly.

When to Use It

Hard tasks where ReAct alone is unreliable, the search space of actions is large, and the cost of expanded compute is acceptable. Web navigation, complex reasoning over external tools, and multi-step problem solving where dead ends are common.

Sources

Zhou et al., Language Agent Tree Search Unifies Reasoning, Acting, and Planning in Language Models (ICML 2024) --- names the pattern.

Example

A web-shopping agent searching for a product matching multiple constraints (size, price, in-stock, free shipping). Multiple ReAct branches explore different starting search queries; failed branches (out-of-stock results, mismatched specifications) trigger reflections that steer remaining branches away from similar dead ends.

Example artifacts

Agent.

# LATS = ToT structure + ReAct nodes + Reflexion on dead ends.

# In pseudocode (each tree node runs a short ReAct loop):

def lats(goal, max_depth=4, branching=3):

root = Node(state=initial_state(goal))

reflections = []

for depth in range(max_depth):

node = select_best(root) # UCT or score-based

actions = propose_actions(node, k=branching, reflections=reflections)

children = [Node(execute_react_step(node, a)) for a in actions]

node.children = children

for c in children:

if c.is_goal: return c.path

if c.dead_end:

reflections.append(reflect(c.path, c.dead_end_reason))

return select_best(root).path

Multi-Agent Debate

Classification Parallel solvers, sequential aggregation, possibly looping for rounds

Intent

Multiple solver agents independently answer a question; they then exchange and critique each other’s answers across one or more rounds; an aggregator synthesizes the final answer.

Sketch

Diagram of Multi-Agent Debate pattern — Two rounds: independent answers, then revision in light of peers, then aggregation.

Motivating Problem

Self-consistency voting samples N answers from one model but cannot resolve cases where the majority is confidently wrong. Multi-Agent Debate forces the models to articulate their reasoning to each other, which surfaces inconsistencies that pure voting misses.

How It Works

N solver agents --- distinct prompts, distinct models, or distinct roles --- each produce an initial answer with reasoning. In each subsequent round, every solver sees the other solvers’ answers and reasoning, and produces a revised answer. After K rounds (commonly 2—3), an aggregator reads the full transcript and emits the consensus answer.

The pattern’s effectiveness comes from the legibility of disagreement. Without exchange, the disagreement is invisible.

Variations include sparse communication topologies (each solver sees only some peers) and asymmetric debates (one solver is a “checker” prompted to find flaws in the others’ answers).

When to Use It

When the task admits explicit reasoning, when wrong answers tend to have legible flaws, and when the cost of N×K solver calls is justified. Strong on reasoning-heavy benchmarks (GSM8K, complex QA). Weaker when the task is mostly retrieval or when “reasoning” is post-hoc rationalization.

Sources

Du et al., Improving Factuality and Reasoning in Language Models through Multiagent Debate (ICML 2024) --- names the pattern.

Example

Three solver agents tackle a math word problem. After their initial answers diverge (one says 42, two say 56), the second round shows each its peers’ reasoning. The single dissenter spots the arithmetic slip in its own work and revises; the aggregator returns 56.

Example artifacts

Prompt.

# Round 1 solver prompt

"Solve the problem below. Show your reasoning step by step, then
give the final numeric answer on the last line as: ANSWER: <value>

Problem: {question}"

# Round 2+ solver prompt

"Here is your prior answer and the answers of two peers. Reconsider
your answer in light of their reasoning; agree or disagree
explicitly.

Your prior: {your_prior}

Peer 1: {peer1}

Peer 2: {peer2}

Then give the final answer on the last line as: ANSWER: <value>"

# Aggregator prompt

"Solvers' final answers: {final_answers}. Output the consensus (or
majority) and confidence."

Agent.

def debate(question, n=3, rounds=2):

solvers = [llm_with_persona(i) for i in range(n)]

answers = [s(round1_prompt.format(question=question)) for s in
solvers]

for r in range(rounds - 1):

new_answers = []

for i, s in enumerate(solvers):

peers = [a for j, a in enumerate(answers) if j != i]

new_answers.append(s(round2_prompt.format(

your_prior=answers[i], peer1=peers[0], peer2=peers[1])))

answers = new_answers

return llm(aggregator_prompt.format(final_answers=answers))

Hierarchical Supervisor

Classification Multi-level orchestration

Intent

A top-level supervisor agent delegates to mid-level supervisors, which in turn delegate to specialist workers, forming a tree of responsibility.

Sketch

Diagram of Hierarchical Supervisor pattern — A tree of delegation: the top supervisor coordinates, never executes.

Motivating Problem

A single Orchestrator—Workers layer becomes a bottleneck and a context-window problem when the task is large enough that the orchestrator cannot fit all subtask outputs in its context. The fix is the standard one in software architecture: layer the orchestration.

How It Works

Each level of the hierarchy is itself an Orchestrator—Workers pattern, but the “workers” at one level are themselves supervisors at the next. Typed contracts between levels --- explicit schemas for what a subordinate is asked to do and what it must return --- keep the interfaces clean. Context windows are managed by aggressively summarizing subordinate outputs before they reach a supervisor.

The pattern maps cleanly onto org-chart intuitions: a CEO does not read every line of code; they read summaries from VPs, who read summaries from directors. The same structural argument applies here.

When to Use It

When tasks are large enough that single-level orchestration overflows context windows or exceeds the orchestrator’s ability to reason over the whole. When the work has natural functional sub-domains (research, writing, review). When some sub-trees may run in parallel without contention.

The supervisor itself can become a bottleneck. Designs that route work peer-to-peer when possible, reserving the supervisor for coordination decisions only, scale better.

Sources

LangGraph multi-agent hierarchical patterns documentation --- reference implementation.
Roush et al., DeepDebater (2025) --- hierarchical multi-agent architecture for competitive policy debate.

Example

A long-form research report. The top supervisor decomposes the report into chapters and assigns each to a chapter lead. Each chapter lead decomposes its chapter into sections and assigns workers to draft, fact-check, and edit each section. Workers return finished sections; chapter leads return finished chapters; the top supervisor assembles the final document.

Example artifacts

Agent.

# A two-level hierarchy as nested Orchestrator-Workers

def top_supervisor(goal):

chapter_plan = llm(top_prompt, goal=goal) # decompose into chapters

chapters = run_in_parallel([

chapter_lead(ch.title, ch.outline) for ch in chapter_plan

])

return llm(top_synth_prompt, goal=goal,
chapters=summarize_each(chapters))

def chapter_lead(title, outline):

section_plan = llm(chapter_prompt, title=title, outline=outline)

sections = run_in_parallel([

worker(s.instruction) for s in section_plan

])

return llm(chapter_synth_prompt, title=title, sections=sections)

Autonomous Agent

Classification Looping, open-ended

Intent

An LLM operates a tool-use loop with no predetermined number of steps, deciding its own actions, observing the environment, and terminating when it judges the task complete.

Sketch

Diagram of Autonomous Agent pattern — Open-ended loop against a real environment; stop conditions, not step counts, end it.

Motivating Problem

Some tasks have no fixed step count and no a-priori decomposition. The number of files to edit is unknown; the number of pages to read is unknown; the number of corrections to make is unknown. Any pattern that requires the number of steps to be known up front fails on these tasks.

How It Works

The Autonomous Agent is structurally a ReAct loop, but with three features that distinguish it from the smaller ReAct patterns used inside other workflows:

No fixed horizon. The loop runs until the agent itself signals completion or until a stop condition triggers (max iterations, token budget, time limit, human halt).
Ground truth from the environment. Every action produces an observation from the real environment (filesystem, browser, API), not a model-generated stand-in.
Optional human checkpoints. The agent may pause for human review at predefined points; see Human-in-the-Loop Checkpoint.

Cost profile is unbounded by construction. Cost controls --- iteration caps, token budgets, watchdog timers --- are not optional. Errors compound: a wrong action early in a long trajectory can corrupt all subsequent reasoning. Sandboxed execution and rollback-capable environments are standard.

When to Use It

Open-ended tasks where the step count cannot be predicted, the environment is trusted or sandboxed, and you have some level of confidence in the model’s decision-making. Coding agents on multi-file refactors, computer-use agents, deep research agents. Not for high-stakes one-shot actions where a single wrong action is unrecoverable.

Sources

Anthropic, Building Effective Agents --- names the pattern as the open-ended endpoint of the agentic spectrum.

Example

A coding agent given a GitHub issue. It explores the repository, identifies relevant files, drafts a patch, runs the test suite, observes failures, revises the patch, repeats until tests pass or a budget is exhausted. The number of iterations and the set of files touched are both determined at runtime.

Example artifacts

Prompt.

# Autonomous-agent system prompt

You are an autonomous coding agent operating in a sandboxed repo.

Tools available: {tools}

Goal: {goal}

Operate the loop yourself: think, act, observe, repeat. Continue

until you have achieved the goal or determined it is impossible.

You may pause and call the request_human() tool only if genuinely

blocked. You have a budget of {max_iter} iterations and
{token_budget}

tokens. Track your remaining budget in your scratchpad.

Terminate by calling done(<summary>) when the goal is achieved.

Agent.

def autonomous(goal, tools, max_iter=200, token_budget=500_000):

tokens_used = 0

context = [{"role": "user", "content": goal}]

for i in range(max_iter):

if tokens_used > token_budget: return abort("token budget")

response = llm(autonomous_system_prompt, context, tools=tools)

tokens_used += response.usage.total_tokens

action = parse_action(response)

if action.name == "done": return action.args["summary"]

if action.name == "request_human": return await_human(action.args)

observation = tools[action.name](**action.args)

context.extend(append_turn(response, observation))

return abort("iteration cap")

Human-in-the-Loop Checkpoint

Classification Synchronization barrier

Intent

Pause an otherwise-autonomous workflow at defined points to obtain human approval, correction, or judgment before proceeding.

Sketch

Diagram of Human-in-the-Loop Checkpoint pattern — The workflow halts at a checkpoint; the human approves, edits, or rejects.

Motivating Problem

Pure autonomy has unbounded downside: a confidently wrong agent can do real damage before anyone notices. Pure manual operation has unbounded latency. The middle ground is autonomy with explicit checkpoints --- the workflow runs unattended through the routine stages and pauses at the high-stakes ones.

How It Works

Identify the points where (a) the cost of a wrong autonomous decision is high, (b) human judgment adds substantial value, or (c) a regulatory or policy requirement mandates human review. At each such point, the workflow halts, presents a structured summary to a human, and resumes only after explicit approval. The human’s response is stored in the workflow’s audit trail.

Modern orchestration frameworks (LangGraph, Temporal) implement this as durable, pausable execution: the workflow’s full state is checkpointed; a human can review hours or days later; resumption picks up exactly where the pause occurred.

Three common checkpoint placements:

Before irreversible actions (deletes, deployments, financial transactions).
At decision branches with high uncertainty.
At end-of-stage gates before passing work to a downstream consumer.

When to Use It

Whenever the workflow’s actions cross trust, regulatory, or reversibility boundaries; whenever the cost of an unattended wrong decision exceeds the cost of human review time; whenever audit trails matter.

Sources

LangGraph “human-in-the-loop” documentation --- the reference for durable pausable workflows in the agent ecosystem.
Temporal documentation on durable execution and signals --- the general distributed-systems primitive HITL builds on.

Example

A legal-document drafting workflow runs autonomously through ingestion, clause classification, draft generation, and self-review. Before the draft is sent to the client, the workflow halts at a checkpoint: a paralegal reviews the draft, edits inline if needed, and approves. Only then does the workflow send.

Example artifacts

Agent.

# LangGraph-style decorator pseudocode

\@checkpoint(approval_required=True, risk="high")

async def deploy_to_prod(plan):

yield Checkpoint(

summary=f"Deploy v{plan.version} to prod?",

diff=plan.diff,

affected_services=plan.services,

rollback_command=plan.rollback,

)

# Workflow durably suspended here. State persists for hours/days.

decision = await human_response() # 'approve' | 'reject' |
'edit'

if decision == "approve":

return await execute(plan)

if decision == "edit":

return await execute(decision.edited_plan)

raise RejectedByHuman(decision.reason)

Skill (SKILL.md sketch).

---

name: hitl-checkpoint

description: Insert a human approval pause into an
otherwise-autonomous

workflow. Use before any irreversible action (deletes, deployments,

outbound communication, financial transactions).

---

# Human-in-the-Loop Checkpoint

## Checkpoint payload (always include)

- summary --- one sentence describing what will happen

- risk --- low | medium | high

- diff or plan --- what changes, structured

- reversible --- true | false

- rollback --- command or steps to undo, if reversible

## Resume semantics

- approve → proceed as planned

- edit → proceed with the human's edits

- reject → record reason and abort downstream stages

Always persist the checkpoint payload AND the response to the audit
log.

Deep Research Agent

Classification Hierarchical / hybrid

Intent

Decompose an open-ended research question into parallel sub-investigations; each sub-investigation runs its own ReAct-style loop over external sources; a synthesizer combines the findings into a long-form report.

Sketch

Diagram of Deep Research Agent pattern — A planner spawns multiple ReAct sub-agents; the synthesizer assembles a cited report.

Motivating Problem

A single ReAct agent doing open-ended research hits two limits: its context window fills with search results before it has covered the topic, and it tends to commit early to one line of inquiry. Splitting the research across sub-agents --- each with its own clean context --- solves both problems.

How It Works

A planner reads the research question and emits a structured set of sub-questions. Each sub-question is delegated to a sub-agent running its own ReAct loop with search, fetch, and read tools. The sub-agent operates in its own context window, producing a focused intermediate summary with sources. When all sub-agents complete (or a budget is exhausted), a synthesizer reads the intermediate summaries and the original question, and produces the final report with citations.

The pattern is essentially Orchestrator—Workers + ReAct + Hierarchical, with the synthesis stage doing real intellectual work --- resolving contradictions across sub-agents, organizing material, deciding what is and isn’t in scope.

Production implementations add a feedback loop: the synthesizer may identify gaps and spawn additional sub-agents to fill them. This is Evaluator—Optimizer wrapped around the whole assembly.

When to Use It

Long-form research tasks where the question is too broad for a single agent’s context, where parallel investigation is feasible, and where the user expects a structured, cited report rather than a conversational answer.

Sources

Anthropic, Claude’s Research feature --- the canonical productized version of the pattern (technical write-ups, 2024—2025).
OpenAI, Deep Research product (2025) --- independent implementation.

Example

“Compare how the EU, US, and Japan are regulating AI in 2026.” The planner emits sub-questions per jurisdiction and per regulatory dimension (training data, model registration, deployment). A dozen sub-agents run in parallel; each returns a sourced summary; the synthesizer organizes the material by dimension and produces a comparative report.

Example artifacts

Prompt.

# Planner prompt

"Decompose this research question into 5-10 independent
sub-questions. Each sub-question should be answerable in 30 minutes
of focused web search, and have a single clear scope. Avoid overlaps.

Output JSON: [{ \"id\": \"sq1\", \"question\":
\"...\", \"scope\": \"...\" }]

Question: {question}"

# Sub-agent system prompt

"You research one sub-question. Use search, fetch, and read tools.
Cite every claim by URL. When you have a complete answer, call
done(summary, citations). Budget: 20 tool calls."

# Synthesizer prompt

"Original question: {question}

Sub-question findings (with citations): {findings}

Write a structured, cited report. Resolve contradictions explicitly.
Flag gaps."

Agent.

async def deep_research(question, sub_agent_budget=20):

plan = json.loads(await llm(planner_prompt, question=question,

model="claude-opus-4-7"))

findings = await asyncio.gather(*[

run_react_subagent(sq, max_iter=sub_agent_budget) for sq in plan

])

return await llm(synthesizer_prompt,

question=question, findings=findings,

model="claude-opus-4-7")

Skill (SKILL.md sketch).

---

name: deep-research

description: Produce a long-form, cited research report on an
open-ended

question. Use when the user asks "compare," "what are the latest

developments in," or otherwise expects a structured multi-page
answer

with citations.

---

# Deep Research Agent

## Phases (do not skip any)

1. PLAN: decompose the question into 5-10 independent sub-questions.

Show the plan to the user; allow them to edit before launch.

2. INVESTIGATE: launch a sub-agent per sub-question. Each uses
search,

fetch, and read tools. Each writes its findings with URL citations.

3. SYNTHESIZE: assemble a structured report. Resolve contradictions

explicitly. Flag remaining gaps.

## Budgets

- Per sub-agent: 20 tool calls, 50k tokens

- Whole run: 12 sub-agents max, 600k tokens

## Output format

Markdown with: executive summary, sections per major theme, citations
as

footnotes with URLs.

Goal Setting and Monitoring

Classification Cross-cutting (supervisory loop over a base pattern)

Intent

Give an agent explicit, measurable objectives and a monitoring loop that scores progress against them while the task runs, so drift is corrected mid-flight instead of discovered at the end.

Sketch

Diagram of Goal Setting and Monitoring pattern — A monitor checks the agent's trajectory against explicit criteria at each checkpoint: continue, replan, or escalate.

Motivating Problem

A task prompt states what the agent should do; nothing at runtime checks that it is actually being achieved. Long-horizon autonomous work fails silently: the agent keeps producing plausible steps while diverging from the objective, burning budget on a trajectory nobody wants. By the time a human reads the final output, the cheapest moment to intervene is hours behind. The same gap appears in operational deployments --- an agent bound by an SLA, a token budget, or a quality bar has no mechanism that notices when those constraints are about to be violated.

How It Works

Three parts: specify, instrument, act.

Specify. Decompose the objective into success criteria that can be checked --- a metric and a threshold, a deadline, a budget. “Refactor the module” becomes “all 214 tests pass, public API unchanged, under 2M tokens.” Criteria that cannot be checked mechanically get an LLM-as-judge rubric instead, but every criterion gets an explicit form. Vague goals produce vague monitoring.

Instrument. At a fixed cadence --- every N steps, every checkpoint, every K minutes --- a monitor evaluates the trajectory against the criteria. Use deterministic checks wherever possible (test runners, linters, token counters, diff inspectors); reserve LLM judgment for criteria that resist mechanization. The monitor sees a compressed summary of progress, not the full transcript.

Act. The monitor’s verdict feeds a small decision rule: on track (continue), drifting (inject a correction or trigger replanning), or off the rails (halt and escalate to a human --- see Human-in-the-Loop Checkpoint). The verdict and its evidence are logged either way; the trail is what makes overnight autonomy auditable.

The distinction from Evaluator—Optimizer matters: that pattern critiques an artifact inside one generate-critique iteration. Goal monitoring judges the trajectory of a whole task across many steps, and its verdict controls whether the task continues at all. The classic failure mode is Goodhart’s law --- an agent optimizing the metric rather than the objective --- so prefer several coarse criteria over one precise proxy, and keep a human check on the criteria themselves.

When to Use It

Any time an agent runs longer than a human is watching: overnight jobs, budgeted autonomous runs, SLA-bound operations. The overhead is one cheap call per checkpoint --- negligible against the cost of a multi-hour run that quietly went wrong at minute ten. Skip it for short interactive tasks, where the human in the conversation is the monitor.

Sources

Gulli, Agentic Design Patterns: A Hands-On Guide to Building Intelligent Systems (2025) --- names the pattern.

Example

An overnight dependency-upgrade agent is given three criteria: the full test suite passes, no public API signature changes, total spend under 2M tokens. A monitor runs every ten steps. At step 40 it notices the agent has begun rewriting call sites instead of adapting the wrapper --- an API-change violation in progress --- and triggers a replan with the violated criterion quoted. At step 90 the budget criterion trips and the run halts with a checkpoint and a summary for morning review.

Example artifacts

Prompt.

# Monitor prompt (run at every checkpoint)

You are the progress monitor for a long-running agent task.

Objective: {objective}

Success criteria: {criteria_json}

Progress summary: {state_summary}

Budget consumed: {tokens_used} of {token_budget} tokens

Judge whether the work is on track to satisfy every criterion.

Answer with exactly one of:

ON_TRACK | REPLAN: <what to change> | ABORT: <why>

Agent.

GOALS = {
    "tests_pass": lambda s: s.test_failures == 0,
    "api_stable": lambda s: not s.public_api_changed,
    "under_budget": lambda s: s.tokens_used < 2_000_000,
}

def monitored_run(agent, task, check_every=10):
    state = agent.start(task)
    while not state.done:
        state = agent.step(state)
        if state.step_count % check_every:
            continue
        failed = [g for g, ok in GOALS.items() if not ok(state)]
        if not failed:
            continue
        verdict = llm(monitor_prompt, goals=failed,
                      state_summary=state.summary(),
                      model="claude-haiku-4-5")
        log_verdict(state.step_count, failed, verdict)
        if verdict.startswith("REPLAN"):
            state = agent.replan(state, reason=verdict)
        elif verdict.startswith("ABORT"):
            return checkpoint_and_escalate(state, failed)
    return state.result

Prioritization

Classification Sequential (queue discipline in front of any executor)

Intent

Rank competing tasks by value, urgency, effort, and dependency-readiness before committing agent capacity, and re-rank as conditions change.

Sketch

Diagram of Prioritization pattern — Tasks are scored on arrival; a priority queue with aging decides what the executor works on next.

Motivating Problem

An agent with a task queue and no queue discipline serves the trivial and the critical alike, in arrival order. The problem compounds when agents generate their own subtasks --- an Orchestrator—Workers system or a BabyAGI-style loop can flood its queue faster than it drains it, and without a ranking the flood decides what gets done. Tokens, tool budgets, and wall-clock are finite; spending them in FIFO order is a decision, just an unexamined one.

How It Works

Score, order, serve, re-rank.

Score. Each task gets a priority from a scoring function over a few explicit factors: business value, urgency (cost of delay), estimated effort, and whether its dependencies are satisfied. Score deterministically where the factors are known (SLA class, deadline distance, customer tier); use a cheap LLM call where value judgment is genuinely fuzzy. Keep the factors visible --- a disputed priority should be arguable from its inputs.

Order. A priority queue serves the highest score among ready tasks (dependencies satisfied). Add an aging term so low-priority tasks rise with waiting time instead of starving --- the classic queue-discipline failure.

Re-rank. Priorities are not static. New arrivals, deadline shifts, and completed dependencies all change the picture; re-score on those events rather than on every dequeue. Preemption --- suspending a running task for a higher-priority arrival --- is possible but expensive for agents, since resuming means restoring context; reserve it for genuine emergencies and checkpoint before switching (see Exception Handling and Recovery).

The LLM can sit in two seats: scorer (rate this task’s value) or scheduler (given the queue, pick next). The scorer seat composes better --- deterministic code owns the ordering, and the model’s judgment is confined to the one factor that needs it.

When to Use It

Whenever there are more tasks than capacity and the tasks differ in consequence: triage queues, multi-tenant agents, task-generating orchestrators. Skip it for single linear workflows, where order is dictated by the work itself.

Sources

Gulli, Agentic Design Patterns (2025) --- names the pattern.
Nakajima, BabyAGI (2023) --- the canonical early implementation: a task-creation agent paired with a task-prioritization agent over a shared queue.

Example

A support automation agent receives tickets continuously. Each ticket is scored on arrival: SLA class and revenue at risk from the account record, urgency from a Haiku sentiment-and-severity call, effort from historical resolution times for the category. The queue serves by score with an aging bonus of half a point per hour waited. When a premium-tier outage arrives, it lands at the top of the queue; a two-day-old low-tier question eventually outranks fresh routine tickets and gets handled rather than starving.

Example artifacts

Prompt.

# Scoring prompt (one cheap call per new task; batchable)

Score this task for an automated work queue. Emit JSON only:

{
  "value": 1-10,      // impact if completed
  "urgency": 1-10,    // cost of delay
  "effort": 1-10,     // estimated size
  "blocked_by": []    // ids of tasks that must finish first
}

Task: {task_description}

Queue context: {queue_context}

Agent.

import time

def priority(score, enqueued_at, now):
    base = (score["value"] * score["urgency"]) / score["effort"]
    age_hours = (now - enqueued_at) / 3600
    return base + 0.5 * age_hours   # aging prevents starvation

def run_queue(queue, agent):
    while queue:
        now = time.time()
        ready = [t for t in queue if not t.blocked_by]
        if not ready:
            raise DeadlockError(queue)
        task = max(ready, key=lambda t:
                   priority(t.score, t.enqueued_at, now))
        queue.remove(task)
        agent.run(task)
        for t in queue:                 # unblock dependents
            t.blocked_by.discard(task.id)

Exception Handling and Recovery

Classification Cross-cutting (failure-path control around any pattern)

Intent

Detect failures in LLM calls, tools, and downstream services; retry what is transient, repair what is malformed, fall back when a dependency is down, and checkpoint state so long tasks resume instead of restarting.

Sketch

Diagram of Exception Handling and Recovery pattern — Failures are classified: transient errors retry with backoff, semantic errors are repaired by the model, persistent failures open the breaker and fall back; checkpoints make recovery a resume.

Motivating Problem

Demos run the happy path. Production runs into tool timeouts, rate limits, malformed JSON, hallucinated tool arguments, and APIs that were up an hour ago. An agent loop with no failure discipline does one of two bad things: it treats the error text as ordinary context and spirals --- retrying the same broken call, apologizing to a stack trace --- or it crashes and throws away forty minutes of accumulated work. Failure is not the exceptional case for a tool-using agent; over a long enough task it is a certainty, and the architecture has to price it in.

How It Works

Start from a failure taxonomy, because the right response depends on the class:

Transient (timeouts, rate limits, 5xx): retry with exponential backoff and jitter, capped by max_retries. Never retry in a tight loop --- coordinated retries are how one outage becomes two.
Persistent (a dependency that keeps failing): a circuit breaker trips after K consecutive failures and short-circuits further calls to a fallback --- an alternative tool, cached data, or an honestly degraded answer --- until a probe succeeds. This is Nygard’s pattern imported wholesale from distributed systems, and it transfers cleanly.
Semantic (the call succeeded, the content is wrong: unparseable JSON, a hallucinated argument, a tool used against its contract): feed the error back to the model with an explicit repair instruction. Models are good at fixing their own malformed calls when told precisely what broke; they are bad at noticing unprompted.

Beneath all three: checkpointing. Persist task state after each completed step, so recovery means resuming from step 41, not restarting from zero. Durable-execution runtimes (Temporal, Restate, Inngest) and framework checkpointers (LangGraph) provide this off the shelf. And when the budget of retries, repairs, and fallbacks is exhausted, fail loudly to a human with the checkpoint attached --- see Human-in-the-Loop Checkpoint.

The boundary with Cascade/Fallback: cascade escalates to a stronger model on low confidence, a cost optimization on the happy path. This pattern handles faults. They compose --- each cascade level wants its own retry-and-repair discipline.

When to Use It

Every production deployment; the only question is how much. Interactive tools need retries and semantic repair. Long-running autonomous agents additionally need checkpoints and circuit breakers --- for them this pattern is the difference between “resumed after the rate-limit window” and “lost the night’s work.”

Sources

Gulli, Agentic Design Patterns (2025) --- names the pattern for agentic systems.
Nygard, Release It! (2007) --- circuit breaker and bulkhead; the failure-path canon this pattern imports.

Example

A data-integration agent syncs records through a third-party API. Rate-limit responses trigger backoff-with-jitter retries; after five consecutive failures the breaker opens and the agent switches to yesterday’s cached schema, flagging records for re-validation. When the model emits a filter expression the API rejects, the error body is fed back with a repair prompt and the corrected call succeeds. State checkpoints after every batch; when the job is killed mid-run, it resumes at batch 17 of 60.

Example artifacts

Prompt.

# Repair prompt (fed back to the model on a semantic failure)

Your last tool call failed.

Call: {tool_name}({arguments})

Error: {error_message}

Attempt {attempt} of {max_retries}.

Diagnose the cause. If the arguments were wrong, emit a corrected

call. If the tool cannot do this at all, choose an alternative from

your remaining tools, or reply IMPASSE: <what you still need>.

Agent.

import random, time

def call_tool(tool, args, max_retries=4):
    delay = 1.0
    for attempt in range(1, max_retries + 1):
        if breaker.is_open(tool.name):
            return fallback_for(tool.name, args)  # cached/degraded
        try:
            result = tool(**args)
            breaker.record_success(tool.name)
            return result
        except TransientError:
            time.sleep(delay + random.uniform(0, delay))
            delay *= 2
        except DeterministicError as e:
            reply = llm(repair_prompt, tool_name=tool.name,
                        arguments=args, error_message=str(e),
                        attempt=attempt, max_retries=max_retries)
            if reply.startswith("IMPASSE"):
                break
            args = parse_tool_call(reply).arguments
    breaker.record_failure(tool.name)
    return checkpoint_and_escalate(tool, args)

Resource-Aware Optimization

Classification Cross-cutting (runtime cost and latency policy)

Intent

Make cost, latency, and token budget explicit decision variables at runtime --- choosing model, cache, batch, and context size per request --- instead of paying the peak price for every call.

Sketch

Diagram of Resource-Aware Optimization pattern — A cached, triaged front door sends each request down the cheapest adequate path; budgets and cost metrics govern the whole.

Motivating Problem

Chapter 5’s table is the warning: agentic patterns multiply LLM calls, and several of them multiply without bound. A system that sends every one of those calls to the most capable model at interactive latency has its unit economics decided by its worst case. Meanwhile most of the calls did not need the best model, did not need a fresh computation, and did not need an answer in two seconds. Budgets --- per task, per tenant, per month --- are organizational constraints, and a runtime that cannot see them cannot honor them.

How It Works

This pattern is a discipline wired into the runtime rather than a single mechanism. The toolbox, roughly in order of leverage:

Model selection. Match model tier to task difficulty --- Routing does it by classification, Cascade/Fallback by observed confidence. Both are special cases of this pattern.
Caching. Prompt/prefix caching for the shared system prompt and stable context; response caching keyed on a request fingerprint for repeated questions. Often the single largest saving in high-volume systems.
Batching. Anything non-interactive goes through a batch API at a steep discount --- typically half price --- in exchange for latency measured in minutes or hours.
Token budgets. Per-task and per-tenant ceilings enforced by the runtime, not suggested in the prompt. A budget the model can overrun is a wish.
Context discipline. Compression, truncation, and selective retrieval keep the paid-per-token context to what the call needs (Volume 15 treats this in depth).
Tiering. Spend follows value: premium tenants and high-stakes tasks get the strong model and the fast path; the long tail gets the efficient one.

The connective tissue is measurement: cost per task, tracked as a first-class production metric with the same seriousness as error rate. Without it every threshold above is a guess. The failure mode is over-rotation --- degrading quality invisibly to save visible dollars --- so pair every downgrade decision with a quality signal (an eval sample, a verifier, user feedback) that can veto it.

When to Use It

Any deployment at scale, and any deployment with a budget --- which is to say, nearly all of them. The machinery pays for itself once call volume makes cost-per-task a number someone reviews. For a low-volume internal tool, a fixed sensible model choice is fine and this pattern is premature.

Sources

Gulli, Agentic Design Patterns (2025) --- names the pattern.
Chen et al., FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance (2023) --- the formal treatment of LLM cost—quality tradeoffs.

Example

A document-processing SaaS triages every request with a cached Haiku call. Routine extraction runs on Haiku; documents the verifier flags as ambiguous escalate to Opus. The shared 20k-token system prompt is prompt-cached; overnight re-processing runs through the batch API; each tenant has a monthly token ceiling with alerts at 80%. Cost per document drops an order of magnitude against the original always-Opus implementation, with the verifier holding extraction accuracy flat.

Example artifacts

Prompt.

# Triage prompt (cheap model, cached system prefix)

Estimate the difficulty of the request below for routing.

Emit JSON only:

{"difficulty": "low" | "medium" | "high",
 "interactive": true | false}

Request: {request}

Agent.

import json

POLICY = {
    "low":    dict(model="claude-haiku-4-5",  max_tokens=1024),
    "medium": dict(model="claude-sonnet-4-6", max_tokens=4096),
    "high":   dict(model="claude-opus-4-7",   max_tokens=8192),
}

def handle(request, tenant):
    if (hit := response_cache.get(request.fingerprint)):
        return hit
    triage = json.loads(llm(triage_prompt, request=request.text,
                            model="claude-haiku-4-5"))
    cfg = POLICY[triage["difficulty"]]
    if not triage["interactive"]:
        return batch_queue.submit(request, **cfg)
    tenant.budget.charge_or_raise(estimate_cost(request, cfg))
    reply = llm(request.text, system=SHARED_PREFIX, **cfg)
    response_cache.put(request.fingerprint, reply)
    metrics.record("cost_per_task", cost_of(reply), tenant.id)
    return reply

Learning and Adaptation

Classification Looping across episodes (offline consolidation)

Intent

Improve an agent across episodes --- not within one task --- by turning outcomes, corrections, and feedback into durable updates: memories, instructions, skills, and only rarely weights.

Sketch

Diagram of Learning and Adaptation pattern — Feedback consolidates offline into memories, instructions, and skills — adopted only after the eval gate passes.

Motivating Problem

A stateless agent makes the same mistake every day. The user corrects the same formatting choice for the fortieth time; the support bot re-attempts the resolution that has failed on this account before; the Reflexion memory that fixed a mistake mid-task evaporates when the session ends. Every deployed agent generates a stream of signal about its own failures --- human takeovers, edited outputs, thumbs-down, abandoned sessions --- and by default that signal is discarded. Improvement then happens only when an engineer hand-edits the prompt, which does not scale past the first few lessons.

How It Works

A capture step, a consolidation loop, and a graduated set of storage targets.

Capture. Log feedback with its episode: explicit signals (ratings, corrections, approvals) and implicit ones (human takeover, edit distance between the agent’s draft and what was actually used, task success or failure). The implicit signals are noisier but arrive for free on every episode.

Consolidate. A periodic batch job --- nightly is typical --- reads the recent episodes and distills recurring failure modes into proposed changes. Require multiple supporting episodes per proposal; a single complaint is an anecdote, not a lesson.

Store, in increasing order of cost and risk:

Memory entries --- durable facts and rules in the agent’s memory layer (Mem0, Letta, Zep; Volume 06). Cheapest, most reversible.
Instruction updates --- the system prompt as a living document, amended the way a team amends its CLAUDE.md.
Skill library growth --- procedures that worked get packaged as reusable skills, the move Voyager made canonical (Volume 02 covers the skill ecosystem).
Fine-tuning --- last resort, when the lesson volume is large and stable; it is the most expensive target and the hardest to reverse.

The non-negotiable guard: every learned change passes an eval suite before adoption (Volume 08). A learning loop without a regression gate is a drift machine --- and an attack surface, since whoever can inject feedback can steer the agent. Keep an audit trail from each adopted change back to the episodes that justified it, and watch for the feedback distribution itself shifting over time.

When to Use It

Recurring task distributions with real feedback volume: support desks, coding assistants inside one team’s conventions, anything personalized per user or tenant. Not for one-off tasks, and not where the feedback signal is too sparse or too noisy to clear the multiple-episodes bar --- learning from noise is just drift with a schedule.

Sources

Gulli, Agentic Design Patterns (2025) --- names the pattern.
Shinn et al., Reflexion: Language Agents with Verbal Reinforcement Learning (NeurIPS 2023) --- the within-episode precursor; this pattern is Reflexion’s memory made persistent and gated.
Wang et al., Voyager: An Open-Ended Embodied Agent with Large Language Models (2023) --- the skill-library target.

Example

A support agent’s nightly job mines the day’s human-takeover episodes. It finds nine episodes where agents took over because the bot offered refunds on a product line that requires manager approval. The job proposes a memory rule scoped to that product line plus a one-line system-prompt amendment, cites the nine episodes, and stages both. The eval suite shows no regression; the changes promote automatically. Takeover rate on that category drops from 14% to 3% over the following month, and the audit log can show exactly why the behavior changed.

Example artifacts

Prompt.

# Consolidation prompt (nightly batch job)

Below are {n} episodes from the last 24 hours where a human

corrected the agent or took over.

For each RECURRING failure mode, propose exactly one of:

- MEMORY: a durable rule or fact, with its scope

- INSTRUCTION: a minimal diff to the system prompt

- SKILL: a reusable procedure worth packaging

Only propose changes supported by 3+ episodes. Every proposal

must cite the episode ids that justify it.

Episodes: {episodes_json}

Agent.

def nightly_consolidation(episodes, eval_suite):
    raw = llm(consolidation_prompt, n=len(episodes),
              episodes_json=episodes,
              model="claude-opus-4-7")
    for p in parse_proposals(raw):
        if len(p.supporting_episodes) < 3:
            continue                      # anecdote, not lesson
        candidate = stage(p)              # memory/prompt/skill
        scores = eval_suite.run(candidate)
        if scores.regressed():
            log_rejection(p, scores)      # never adopt on faith
            continue
        promote(candidate)
        audit_log.record(p, scores, p.supporting_episodes)

Exploration and Discovery

Classification Looping, open-ended (novelty-driven)

Intent

Direct an agent to map an unknown space --- generating hypotheses, probing, and accumulating a picture of what exists --- rather than executing a known task toward a known end state.

Sketch

Diagram of Exploration and Discovery pattern — The loop probes where the map is thinnest; findings update the map and new questions refill the frontier.

Motivating Problem

Every pattern so far assumes the task can be stated. Some work cannot state its task: scouting a technology landscape for opportunities nobody has named, understanding an undocumented legacy system, generating research hypotheses worth testing. Deep Research answers a question you asked; exploration’s deliverable includes the questions you did not know to ask. Pointed at open-ended work, a task-execution loop either narrows immediately to the first concrete objective it can find or wanders without accumulating anything --- both failures of the same missing ingredient, an explicit representation of what is not yet known.

How It Works

The loop maintains two artifacts a task executor does not have: a map (what has been learned) and a frontier (what is known to be unknown --- open questions, untested hypotheses, unvisited regions).

Select. Choose the next probe from the frontier by expected information gain --- prefer the region the map says least about over refining what is already well understood. This is the exploration/exploitation dial, and it is the step where an LLM’s judgment earns its cost.
Probe. Run it: a search, an experiment, a file read, a piece of code. Probes are budgeted individually so no single rabbit hole eats the run.
Integrate. Deduplicate findings against the map, record what is genuinely new, and --- the step that makes it exploration --- extract the new questions the findings raise, which join the frontier.
Stop. On budget exhaustion or diminishing novelty: N consecutive probes adding nothing new means the accessible space is mapped at this resolution.

Two named variants. The curriculum variant (Voyager) has the agent propose its own next task just beyond its current competence, banking each success as a reusable skill --- exploration of a capability space. The hypothesis-driven variant (AI Scientist, AI Co-Scientist) structures the loop as generate hypotheses, rank by interestingness and testability, test, and feed results back into generation --- exploration of an idea space.

When to Use It

When the deliverable is a map, not an artifact: research scoping, opportunity scanning, codebase archaeology, hypothesis generation. It is the wrong tool when the objective is statable --- use Deep Research or Plan-and-Execute and finish sooner --- and it is expensive by construction, an open-ended loop of probes. It also needs an honest novelty measure; without one the loop rediscovers the same facts in new phrasing and calls it progress.

Sources

Gulli, Agentic Design Patterns (2025) --- names the pattern.
Wang et al., Voyager: An Open-Ended Embodied Agent with Large Language Models (2023) --- the automatic-curriculum exploration loop.

Example

A technology-scouting agent runs weekly over patent filings and arXiv. It maintains an atlas of themes and a frontier of open questions (“who else is working on X?”, “is technique Y migrating across domains?”). Each run it selects the twenty highest-information-gain probes, integrates findings, and flags clusters that did not exist last month. After three runs it surfaces a convergence between two previously unrelated subfields --- a connection no stated research question would have asked for, and exactly the kind of finding the pattern exists to produce.

Example artifacts

Prompt.

# Frontier-selection prompt (each exploration round)

You are exploring: {domain}

Map so far (compressed): {knowledge_summary}

Open frontier: {frontier_items}

Recent probes and what they added: {recent_gains}

Choose the next probe that maximizes expected NEW information.

Prefer regions the map says least about over refining what is

already well understood. Emit JSON:

{"probe": "...",
 "kind": "search" | "experiment" | "read",
 "expected_gain": "..."}

Agent.

import json

def explore(domain, budget, dry_limit=3):
    frontier = seed_frontier(domain)
    atlas, dry = KnowledgeBase(), 0
    while budget.remaining() > 0 and dry < dry_limit:
        choice = json.loads(llm(frontier_prompt, domain=domain,
                                knowledge_summary=atlas.summary(),
                                frontier_items=frontier.top(20),
                                recent_gains=atlas.recent()))
        findings = run_probe(choice, budget)  # search/code/read
        novel = atlas.integrate(findings)     # dedup vs known
        dry = 0 if novel else dry + 1
        frontier.update(atlas.new_questions(findings))
    return atlas.report()

Appendix A --- Pattern Composition

The patterns above almost never appear alone. The table below names some common compositions seen in production systems:

Composition	Description
Router + Specialist Chain	A Routing front door dispatches to one of several Prompt Chains.
Orchestrator over ReAct workers	The orchestrator decomposes; each worker runs a small ReAct loop on its subtask.
Evaluator–Optimizer wrapping any generator	Any synthesis step can be wrapped in a critic loop.
Cascade + Generator–Verifier	Each cascade level is itself a generator–verifier; only verified outputs propagate.
Hierarchical Supervisor of Deep Research Agents	Used in long, multi-domain reports where each domain itself decomposes.
Autonomous Agent with HITL checkpoints	The default shape of “agent in production.”
Plan-and-Execute with ReAct executors	Plan once at the top, run each plan step as a small ReAct.
Map-Reduce + Evaluator–Optimizer on the reduce step	Per-document map calls feed a reduce that loops against a critic until the memo is acceptable.
Goal Monitor over an Autonomous Agent	A supervisory loop scores progress against explicit goals and halts, replans, or escalates on drift.
Exception Handling around any tool-using pattern	Retries, circuit breakers, and checkpoints wrap ReAct or Orchestrator–Workers; the difference between a demo and a production system.
Prioritization in front of Orchestrator–Workers	The orchestrator generates subtasks; a scored queue decides which of them deserve workers first.

Appendix B --- A Note on Pattern Drift

Pattern names in this space drift. Agent, workflow, chain, loop, and orchestration are used inconsistently across the literature and the marketing material of the major frameworks. This catalog has tried to use the Anthropic taxonomy where it is clearest and to disambiguate where it is not. Where a pattern in this catalog overlaps substantially with a named concept elsewhere (LangGraph “subgraphs,” AutoGen “GroupChat,” CrewAI “Crews”), the Further Reading of the relevant entry says so.

The catalog is also, of course, a snapshot. The deeper agent patterns --- long-horizon planning, multi-modal grounding, agent-to-agent protocols --- are moving fast. Treat v0.3 as a starting structure, not a final word.

--- End of Patterns of AI Agent Workflows v0.3 ---