Catalog · Workflow & Primitives
Volume 01Patterns of AI Agent Workflows
Volume 01 of the Agentic AI Series
About This Catalog
This catalog organizes the recurring solutions that have emerged for building useful systems on top of large language models. It is consciously modeled on Martin Fowler’s Patterns of Enterprise Application Architecture (Addison-Wesley, 2002): each pattern gets a stable name, a one-line intent, a sketch, a description of the motivating problem, a How It Works section that discusses the solution and its variations, a When to Use It section, a Sources section naming the canonical origin of the pattern, Further Reading for deeper exposition, an Example, and a final Example artifacts section with concrete prompt templates, agent code, and --- where the pattern makes sense as a reusable unit --- a sketch of how it would package as an Anthropic-style SKILL.md.
The point, as Fowler put it, is vocabulary. If a colleague says “we’ll use Prompt Chaining for the ingestion path and an Orchestrator—Workers loop for the merge step, with a Human-in-the-Loop checkpoint before commit,” every architecturally important decision has been communicated in one sentence. That is what a pattern language is for.
What’s new in v0.2
-
Three new patterns added --- Cascade/Fallback, Generator—Verifier, Map-Reduce over Corpus --- bringing the catalog to 23 patterns.
-
A new Chapter 5 (“Cost Models”) summarizing the LLM-call count, parallelism, and loop boundedness of every pattern in one table.
-
Every pattern now has a Sources section (canonical origin) distinct from Further Reading (deeper exposition).
-
Every pattern now has Example artifacts --- a prompt template, agent code (Python with the Claude SDK style), and where applicable, a SKILL.md sketch.
-
All diagrams have been redrawn using only straight lines, with double-arrows replacing pairs of single arrows where the relationship is bidirectional.
Scope
This catalog covers the control-flow and coordination patterns of agentic systems --- how LLM calls, tools, sub-agents, and humans are arranged in time and how information passes between them. It does not (yet) cover the orthogonal axes of:
-
Memory & state (working memory, episodic memory, vector stores, scratchpads)
-
Tool & ACI design (agent—computer interface patterns)
-
Retrieval (RAG variants, query rewriting, reranking)
-
Evaluation (offline evals, online evals, LLM-as-judge)
-
Safety & alignment (guardrails, constitutional patterns, sandboxing)
Each of these warrants its own catalog and will be drafted as separate volumes.
On Sources vs. Further Reading
The Sources section names the canonical origin of a pattern --- the paper that introduced it, the repository that ships the reference implementation, or the blog post that named it. Where the catalog itself coins or generalizes a pattern (Hybrid Pipeline, Cascade/Fallback, Map-Reduce over Corpus, Generator—Verifier as a named entry), the source is acknowledged as such. Further Reading points to deeper exposition, related variants, and production write-ups.
Part 1 — The Narratives
The pattern entries in Part 2 are reference material. They are not meant to be read front-to-back. Part 1 is the orientation: short essays that frame the design space so the patterns in Part 2 make sense in context.
Chapter 1. Workflows vs. Agents
Anthropic draws an architectural distinction that is worth borrowing wholesale. A workflow is a system in which LLMs and tools are arranged in predefined code paths. An agent is a system in which the LLM dynamically directs its own process and tool use, choosing the next step at runtime. Both are “agentic systems,” but they have very different operational characteristics.
Workflows give you predictability, observability, and bounded cost. Agents give you flexibility and reach at the cost of higher variance, higher token spend, and the risk of compounding errors over long horizons.
The practical rule is the same one that governs all software design: start with the simplest thing that works. For many problems, a single well-prompted LLM call with retrieval is enough. When that fails, add a workflow. When the workflow’s branching becomes unmanageable, then --- and only then --- promote to an agent.
Chapter 2. The Shape of Control Flow
There are four primitive shapes that recur throughout this catalog:
-
Sequential --- Step B runs after step A and uses A’s output. Prompt Chaining is the canonical case. Latency is the sum of stages; total cost is the sum of stages.
-
Parallel --- N independent calls run at the same time; their results are merged. Parallelization (Sectioning and Voting) is the canonical case. Latency is the slowest branch; cost is the sum of branches.
-
Hybrid (Pipeline) --- Stages are sequential, but each stage may itself fan out in parallel. Most real systems end up here.
-
Looping --- The same step (or short cycle) runs repeatedly until a termination condition is met. Evaluator—Optimizer, Reflexion, ReAct, and Autoresearch are all variations of looping. Cost is unbounded unless you cap iterations or budget tokens.
Every pattern in Part 2 can be classified along these axes, and the classification is given at the top of each entry.
Chapter 3. The Augmented LLM
The foundational unit of every pattern that follows is not the raw model but the augmented LLM: a model with access to retrieval, tools, and memory. Almost nothing useful in production is built on a bare completion call. When this catalog refers to “an LLM call,” assume the augmented form unless explicitly stated.
Chapter 4. Choosing a Pattern
A rough decision procedure:
-
One call enough? Try a single augmented LLM call with good prompting and retrieval. If the evaluation passes, stop here.
-
Decomposable into fixed steps? Use Prompt Chaining.
-
Distinct input classes need distinct handling? Add Routing in front.
-
Same task tried on many documents? Use Map-Reduce over Corpus.
-
Want to escalate to a stronger model only when needed? Use Cascade/Fallback.
-
Independent subtasks of one document/task? Use Parallelization --- Sectioning.
-
Want confidence through redundancy? Use Parallelization --- Voting.
-
Subtasks not knowable in advance? Use Orchestrator—Workers.
-
Need iterative refinement against a critic that gives feedback? Use Evaluator—Optimizer.
-
Need rejection sampling against a Pass/Fail check? Use Generator—Verifier.
-
Need the model to use tools in an open-ended loop? Use ReAct or, for more autonomy, the Autonomous Agent pattern.
-
Need to optimize a measurable metric overnight? Use Autoresearch.
The patterns compose. A production system is almost always a graph of these primitives, not a single one.
Chapter 5. Cost Models
The cost of running an agentic system is dominated by the number of LLM calls, and the wall-clock latency is dominated by the longest sequential path through those calls. These two quantities are first-order concerns at design time. The table below summarizes the cost profile of every pattern in this catalog along three axes:
-
LLM calls per task --- how many model invocations a single end-to-end execution incurs.
-
Parallelism --- whether and how the calls can run concurrently.
-
Loop bound --- what bounds the number of iterations (if any), or whether the cost is open-ended.
Reading the table from top to bottom is a useful exercise: it makes vivid why “start with the simplest thing that works” is the right rule. A single Augmented LLM call costs one invocation; an iterative pattern like LATS or Reflexion can cost hundreds. The cost-per-unit-quality gradient is steep, and most production systems sit well below the top of the table.
| Pattern | LLM calls per task | Parallelism | Loop bound |
|---|---|---|---|
| Augmented LLM | 1 | None | None |
| Prompt Chaining | N (fixed stages) | None | None |
| Routing | 1 router + 1 specialist | None | None |
| Sectioning | K sections + 1 aggregator | K-way | None |
| Voting | N + 1 aggregator | N-way | None |
| Orchestrator–Workers | 2 + M workers (M dynamic) | M-way | Possible re-orchestration |
| Hybrid Pipeline | Sum of stages | Per stage | Per stage |
| Evaluator–Optimizer | 2 × iter | None | max_iter |
| Reflexion | ≥3 × iter (actor/critic/reflector) | None | max_iter |
| Autoresearch | 1 × iter | None | Indefinite (overnight) |
| ReAct | 1 × iter | None | max_iter / token budget |
| Plan-and-Execute | 1 planner + N steps | Optional per step | Possible replanning |
| ReWOO | 2 (planner + solver) | M-way over workers | None (single pass) |
| Tree-of-Thoughts | K × depth × branches | Per level | depth bound |
| LATS | Very high | Per branch | depth bound |
| Multi-Agent Debate | N × rounds + 1 | N-way per round | rounds |
| Hierarchical Supervisor | Sum over levels | Per level | Depth of tree |
| Autonomous Agent | 1 × iter | None | Step budget / token cap |
| HITL Checkpoint | (base pattern) | (base pattern) | (base pattern) |
| Deep Research | 1 + N sub-agents + 1 synth | N-way | Per-sub-agent budget |
| Cascade/Fallback | 1 to L levels | None | L levels |
| Generator–Verifier | 2 × samples | Optional batch | max_samples |
| Map-Reduce over Corpus | N + 1 | N-way | None |
Notes. “LLM calls” counts model invocations, not tools. “Parallelism” refers to the maximum concurrent LLM calls during one execution --- actual concurrency depends on runtime and rate limits. “Loop bound” names the dominant termination control; in practice most looping patterns also have a token budget and a wall-clock cap as safety nets.
Part 2 — The Patterns
Section A — Foundational Building Blocks
- Augmented LLM --- A model call enriched with retrieval, tools, and memory.
Section B — Control-Flow Workflows
-
Prompt Chaining --- Decompose a task into fixed sequential LLM steps.
-
Routing --- Classify the input, then dispatch to a specialized handler.
-
Cascade/Fallback --- Try cheaper first; escalate to stronger only on low confidence.
-
Parallelization --- Sectioning --- Split a task into independent subtasks run in parallel.
-
Parallelization --- Voting --- Run the same task N times and combine outputs.
-
Map-Reduce over Corpus --- One LLM call per document in parallel; a final reduce call synthesizes.
-
Orchestrator—Workers --- A central LLM dynamically decomposes and synthesizes.
-
Hybrid Pipeline --- Compose sequential stages where individual stages fan out in parallel.
Section C — Iterative & Self-Improving Patterns
-
Evaluator—Optimizer --- Generate, critique, refine. Loop until the critic accepts.
-
Generator—Verifier --- Rejection sampling against a binary Pass/Fail check.
-
Reflexion --- Evaluator—Optimizer with verbal episodic memory across attempts.
-
Autoresearch (Ratchet Loop) --- Propose, experiment, keep only on improvement; otherwise roll back.
Section D — Reasoning & Search Patterns
-
ReAct --- Interleave reasoning with tool use in a single rolling context.
-
Plan-and-Execute --- Produce a full plan up front, then execute it step by step.
-
ReWOO --- Plan with explicit dependencies, execute observation-free, synthesize.
-
Tree-of-Thoughts --- Branch the reasoning into multiple paths, evaluate, expand, backtrack.
-
LATS --- Tree-of-Thoughts over a ReAct-style action space, with reflection.
Section E — Multi-Agent Coordination
-
Multi-Agent Debate --- Multiple solver agents exchange and critique each other’s answers.
-
Hierarchical Supervisor --- A supervisor agent delegates to specialized sub-agents.
Section F — Autonomy & Oversight
-
Autonomous Agent --- An LLM in an open-ended tool-use loop with no predefined step count.
-
Human-in-the-Loop Checkpoint --- A workflow pauses at defined points for human review.
-
Deep Research Agent --- A planner spawns parallel research sub-agents and synthesizes findings.
Augmented LLM
Classification Foundational building block
Intent
Wrap a base model with retrieval, tools, and memory before composing it into anything else.
Sketch
Motivating Problem
A bare LLM call is stateless, has a fixed knowledge cutoff, and cannot affect the world. Any non-trivial application needs at least one of: fresh information, the ability to take actions, or continuity across calls. Re-implementing those wiring concerns for every pattern in this catalog would be exhausting and inconsistent.
How It Works
Three augmentations attach to a model call:
-
Retrieval brings in external information at query time. Vector store similarity, hybrid lexical+semantic search, multi-hop retrieval, reranking.
-
Tools are typed functions the model can invoke. Schema in the prompt; structured tool-use blocks in the output; runtime executes and returns results.
-
Memory is anything outside the current context window --- scratchpad, key-value store, episodic log, persistent profile.
The augmentations are capabilities, not control flow. Every other pattern in this catalog assumes its LLM calls are augmented unless explicitly noted.
The Model Context Protocol (MCP) is one concrete way to expose tools and data sources; the augmentations themselves are older than MCP and do not require it.
When to Use It
Always. There is essentially no production application where the bare model is preferable to the augmented model. The interesting design questions are which augmentations, not whether to use any.
Sources
- Anthropic, Building Effective Agents (Schluntz & Zhang, December 2024) --- names the pattern.
Further Reading
-
Lewis et al., Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (NeurIPS 2020)
-
Schick et al., Toolformer (NeurIPS 2023)
-
Anthropic, Model Context Protocol specification (2024)
-
Anthropic, function calling / tool use reference
Example
A customer support assistant has retrieval over the help center, tools for issuing refunds and looking up orders, and a memory store keyed on the conversation ID that records what has been tried so far. Every other pattern in the support stack --- routing, evaluation, escalation --- builds on this base.
Example artifacts
Prompt.
# System prompt
You are a customer support assistant.
When you need information from our help center, call the
retrieve_docs tool.
When you need to take an action (lookup_order, issue_refund), call
the
appropriate tool. The memory store contains prior context for this
user;
it is appended to your context automatically.
Be concise. Cite docs by their canonical URL when referenced.
Agent.
# anthropic-python: augmented LLM call
from anthropic import Anthropic
client = Anthropic()
response = client.messages.create(
model="claude-opus-4-7",
max_tokens=2048,
system=SYSTEM_PROMPT,
tools=[
{"name": "retrieve_docs", "input_schema": {...}},
{"name": "lookup_order", "input_schema": {...}},
{"name": "issue_refund", "input_schema": {...}},
],
messages=[
# memory store contents prepended here
*prior_context_for(conversation_id),
{"role": "user", "content": user_message},
],
)
Prompt Chaining
Classification Sequential
Intent
Decompose a task into a fixed sequence of LLM steps, where each step consumes the output of the previous one.
Sketch
Motivating Problem
A single LLM call asked to do a complex task --- outline, draft, fact-check, and translate a report --- usually does each subtask worse than a call dedicated to that subtask alone. Quality drops with prompt surface area. At the same time, splitting the task into truly independent pieces is impossible because each step depends on the previous one’s output.
How It Works
Hard-code the decomposition. Each stage gets its own prompt and its own model call. The output of stage n becomes part of the input to stage n+1. Between stages, place programmatic gates --- deterministic checks (length, schema validity, classifier scores) that decide whether to proceed, retry the previous stage, or abort.
Variations:
-
Linear chain. Stage n feeds stage n+1.
-
Chain with gates. Programmatic checks between stages.
-
Conditional chain. A gate may branch to alternative downstream stages.
When to Use It
When the task decomposes cleanly into a fixed pipeline whose stages are known in advance. If the decomposition depends on the input, prefer Orchestrator—Workers. If the stages are independent, prefer Parallelization --- Sectioning.
Sources
- Anthropic, Building Effective Agents --- names the pattern.
Further Reading
-
Wu et al., AI Chains: Transparent and Controllable Human-AI Interaction by Chaining Large Language Model Prompts (CHI 2022)
-
LangGraph linear graph patterns
Example
Generate marketing copy, then translate it into five languages: stage 1 drafts the copy; a gate checks word count and reading-grade level; stage 2 (run five times in parallel) translates the approved copy.
Example artifacts
Prompt.
# Stage 1 --- Draft
"Write a 200-word marketing blurb for {product_name}, targeting
{audience}. Output only the blurb, no commentary."
# Stage 2 --- Translate
"Translate the following blurb into {language}, preserving tone
and brand voice. Output only the translation.
Blurb:
{stage1_output}"
Agent.
def chain(product, audience, language):
blurb = llm(stage1_prompt.format(product_name=product,
audience=audience))
# Gate: enforce word count and reading grade
wc = len(blurb.split())
if not (150 <= wc <= 250):
raise GateFailure(f"blurb word count {wc} out of range")
return llm(stage2_prompt.format(language=language,
stage1_output=blurb))
Routing
Classification Sequential (classifier then specialist)
Intent
Classify the input and dispatch it to a specialized downstream handler.
Sketch
Motivating Problem
One prompt that tries to handle every kind of input ends up bad at all of them. Adding instructions for refund handling degrades the technical-support behavior; tuning for technical support breaks the small-talk path. Specialization wants separation of concerns, but the caller does not know the category in advance.
How It Works
A first call --- either an LLM classifier or a traditional classification model --- labels the input. The label selects a downstream prompt, model, or sub-workflow. Each downstream handler is independently prompted, evaluated, and versioned.
Routing has a useful second use beyond specialization: cost routing. Easy or common queries go to a smaller, cheaper model (Claude Haiku 4.5); hard or unusual queries go to a more capable one (Claude Opus 4.7). The router itself is usually small and cheap.
The pattern fails when the classifier is wrong. Calibrate it: keep a held-out evaluation set of labeled inputs, measure routing accuracy continuously, and watch the downstream metrics broken out per route.
When to Use It
When inputs fall into distinct categories that benefit from distinct handling, and when categorization is accurate enough that misrouting is rare or recoverable. If categories overlap or shift, prefer a single capable handler with retrieval. If the categorization itself requires multi-step reasoning, the decision belongs inside an Orchestrator—Workers pattern instead. For escalation based on confidence rather than category, prefer Cascade/Fallback.
Sources
- Anthropic, Building Effective Agents --- names the pattern.
Further Reading
-
Ong et al., RouteLLM: Learning to Route LLMs with Preference Data (2024)
-
LangGraph conditional edges documentation
Example
A customer service deployment routes (a) FAQ-style questions to a Haiku model with retrieval, (b) account changes to a Sonnet model with account tools, and (c) anything flagged “complaint” or “escalation” to an Opus model with a human-in-the-loop checkpoint.
Example artifacts
Prompt.
# Router system prompt
You are a customer query router. Classify the user message into
exactly one category and emit JSON, no prose:
{
"category": "billing" | "tech_support" | "complaint" |
"general",
"confidence": 0.0-1.0
}
Use "complaint" for any expression of dissatisfaction or anger.
Use "general" as the fallback.
Agent.
import json
HANDLERS = {
"billing": ("claude-haiku-4-5", billing_prompt, billing_tools),
"tech_support": ("claude-sonnet-4-6", tech_prompt, tech_tools),
"complaint": ("claude-opus-4-7", complaint_prompt,
complaint_tools),
"general": ("claude-haiku-4-5", general_prompt, []),
}
def route_and_dispatch(user_msg):
raw = llm(router_prompt, user_msg, model="claude-haiku-4-5")
decision = json.loads(raw)
if decision["confidence"] < 0.6:
decision["category"] = "general" # safe fallback
model, prompt, tools = HANDLERS[decision["category"]]
return llm(prompt, user_msg, model=model, tools=tools)
Cascade / Fallback
Classification Sequential with early-exit gates
Intent
Try a cheaper or faster model first; escalate to a stronger one only when the cheaper attempt is unconfident or fails verification.
Sketch
Motivating Problem
For workloads with a long tail of difficulty, sending every request to the largest model wastes money on the 80% of requests a smaller model would handle correctly. Routing-by-classification is one answer, but it requires the categorization itself to be reliable. Cascade is the alternative: try cheap first, ask the model (or a verifier) how confident it is, and escalate only when confidence is low.
How It Works
Order the available handlers from cheapest to most capable: Haiku, Sonnet, Opus. For each request, try the cheapest handler first. After the response, check confidence --- either by asking the model to self-report (“give me confidence 0—1”), by running a separate verifier on the output, or by inspecting tool-call success.
If confidence is above a threshold, return. Otherwise, escalate to the next handler. The escalated handler may receive the original input only, or the original input plus the cheaper handler’s attempt as additional context.
Cost analysis: if 70% of requests are handled at level 1, 20% at level 2, and 10% at level 3, the effective per-request cost is roughly 0.7·c₁ + 0.9·c₂ + 0.1·c₃ (since failed levels still incurred their cost) --- typically a large win over always-Opus for tasks with a comfortable success rate at the lower tiers.
Calibrate threshold offline. Self-reported confidence is famously poorly calibrated; a small verifier model or a structural check (“did the JSON validate?”, “did the SQL parse?”) often produces a more reliable signal.
When to Use It
When the difficulty distribution is long-tailed and a reliable signal exists for “this answer is probably right.” When latency-sensitive paths can afford the occasional escalation. Not for tasks where a wrong answer at the cheap tier is silently shipped --- escalation requires a real check, not a guess.
Sources
-
Catalog name. The pattern is widely deployed in production but does not have a single canonical paper.
-
Closest formal treatment: Chen et al., FrugalGPT: How to Use Large Language Models While Reducing Cost (2023).
Further Reading
-
Yue et al., Large Language Model Cascades with Mixture of Thought Representations (ICLR 2024)
-
Compare to Routing: Routing decides at classification time; Cascade decides at runtime based on the actual attempt.
Example
A code-explanation feature first tries Haiku on the snippet. The response includes a self-reported “confidence_low: true|false” field; on low confidence, the request escalates to Sonnet. If Sonnet also reports low confidence, the request goes to Opus. In production, 78% of requests resolve at Haiku, 17% at Sonnet, 5% at Opus.
Example artifacts
Prompt.
# Per-level prompt (all models)
Explain the code snippet below in 2-3 sentences.
After your explanation, on a new line, output exactly:
CONFIDENCE: HIGH | MEDIUM | LOW
Code:
{snippet}
Agent.
MODELS = ["claude-haiku-4-5", "claude-sonnet-4-6",
"claude-opus-4-7"]
def cascade(snippet):
for i, model in enumerate(MODELS):
response = llm(explain_prompt.format(snippet=snippet), model=model)
if "CONFIDENCE: HIGH" in response or i == len(MODELS) - 1:
return strip_confidence_line(response), model
# else loop and escalate
Parallelization — Sectioning
Classification Parallel
Intent
Split a task into independent subtasks, run them concurrently, and aggregate.
Sketch
Motivating Problem
A task with multiple independent facets --- a code review for correctness, performance, and security; a long document where independent sections can be drafted in any order --- wastes wall-clock time and model attention when handled in a single call. The model’s attention is finite; concentrating it on one facet at a time produces better output.
How It Works
Determine the sections in advance (this is what distinguishes sectioning from Orchestrator—Workers, where the decomposition is dynamic). Fire one LLM call per section concurrently. Then run an aggregation step --- often itself an LLM call --- that merges, de-duplicates, and reconciles the outputs.
The aggregator can be:
-
Concatenation for genuinely independent sections (different chapters of a report).
-
A merging LLM call when sections may overlap or contradict.
-
A deterministic combiner when the outputs are structured (JSON merges, set unions).
When to Use It
When the subtasks are genuinely independent (no subtask needs another’s output), and when running them concurrently reduces latency or improves quality through focused attention. If dependencies exist, prefer Prompt Chaining or ReWOO. If the same subtask is run many times for confidence rather than coverage, prefer Voting.
Sources
- Anthropic, Building Effective Agents --- names the parallelization workflow and distinguishes sectioning from voting.
Further Reading
- Khot et al., Decomposed Prompting: A Modular Approach for Solving Complex Tasks (ICLR 2023)
Example
A guardrail system runs two parallel LLM calls on every user message: one generates the response, one screens the message for policy violations. The aggregator returns the response only if the screen passes; otherwise it substitutes a refusal. Separating the two responsibilities produces better behavior on both than handling them in a single prompt.
Example artifacts
Prompt.
# Three facet prompts (one per section)
correctness_prompt = "Review the code below for correctness bugs
only. List each bug as: file:line --- description."
performance_prompt = "Review the code below for performance issues
only. List each issue as: file:line --- description."
security_prompt = "Review the code below for security
vulnerabilities only. List each as: file:line --- severity ---
description."
# Aggregator prompt
"You are given three independent reviews of the same code
(correctness, performance, security). Combine them into a single
report grouped by file, with one section per concern type. Remove
exact duplicates."
Agent.
import asyncio
async def review_code(code):
correctness, perf, security = await asyncio.gather(
llm_async(correctness_prompt + "\n" + code),
llm_async(performance_prompt + "\n" + code),
llm_async(security_prompt + "\n" + code),
)
return llm(aggregator_prompt, [correctness, perf, security])
Parallelization — Voting
Classification Parallel
Intent
Run the same task N times and combine the results to get a more confident answer.
Sketch
Motivating Problem
A single LLM call on a high-stakes judgment --- is this code vulnerable? Is this content harmful? Which of these answers is correct? --- has nonzero variance. The cost of being wrong sometimes exceeds the marginal cost of running the call more than once.
How It Works
Run N independent calls on the same input. Combine the outputs by:
-
Majority vote for discrete labels (self-consistency in the chain-of-thought literature).
-
Threshold rules --- flag as harmful if any of N flags it; auto-approve only if all of N pass.
-
Aggregating LLM call for free-text outputs, reconciling the N candidates into a single answer.
Diversity helps. Different prompts, different temperatures, or different models. Identical calls at temperature zero give you N copies of the same answer and waste tokens.
When to Use It
When the cost of an individual wrong answer is high relative to the cost of N calls; when the task admits a defensible aggregation rule; and when the calls are genuinely diverse.
Sources
- Anthropic, Building Effective Agents --- names voting as a parallelization workflow.
Further Reading
- Wang et al., Self-Consistency Improves Chain-of-Thought Reasoning in Language Models (ICLR 2023)
Example
A content moderation pipeline runs three differently-prompted classifiers on every post. The post is auto-approved only if all three pass; auto-rejected only if all three flag; otherwise routed to a human moderator.
Example artifacts
Agent.
from collections import Counter
PROMPTS = [
"Classify this post: harmful or safe. Output one word.",
"Is this post safe to publish on a family-friendly site? Yes or
no.",
"Does this post violate our content policy (hate, harassment,
sexual, violent)? Yes or no.",
]
def screen(post):
votes = [llm(p + "\n" + post, temperature=0.0) for p in PROMPTS]
flagged = sum(1 for v in votes if "harm" in v.lower() or "yes" in
v.lower())
if flagged == 0: return "auto_approve"
if flagged == len(votes): return "auto_reject"
return "human_review" # disagreement -> escalate
Map-Reduce over Corpus
Classification Parallel (per-document fan-out, sequential reduce)
Intent
Apply the same LLM operation to each document in a corpus in parallel, then combine the results in a final reduce call.
Sketch
Motivating Problem
When a question touches information spread across many documents, naively concatenating the corpus into one prompt blows the context window, dilutes attention, and pays for tokens that have nothing to do with the answer. The classical map-reduce shape --- do the same thing to each item, then combine --- carries over almost unchanged.
How It Works
The pattern decomposes a corpus query into three stages:
-
Map: for each document, run the same LLM call --- extract the relevant facts, summarize what is in scope, or score relevance. The map prompt is identical for every document.
-
Shuffle: optionally filter out null results (“not relevant”) before reducing, to keep the reduce call’s context manageable.
-
Reduce: a single LLM call (or a small number of them in a tree-reduce, if the surviving outputs are still too many) combines the per-document outputs into the final answer.
This is Parallelization --- Sectioning’s document-level cousin. Where Sectioning takes one input and splits it into facets, Map-Reduce takes one query and broadcasts it across many inputs. The pattern is worth its own name because the corpus-level shape recurs constantly: search over documentation, evidence aggregation, due diligence, log triage.
Hierarchical reduce. When N is in the thousands and the reduce prompt cannot fit all summaries, run reduce in a tree: combine groups of, say, 16 summaries each, then combine the group summaries. The cost is N + N/16 + N/256 + … ≈ (16/15)·N --- still linear in N.
When to Use It
When the question naturally decomposes per-document, when the corpus is bounded (hundreds to low thousands of documents), and when the per-document operation is uniform. For unbounded retrieval over a much larger corpus, use RAG to narrow to a relevant subset first, then Map-Reduce over the subset.
Sources
-
Catalog name. The map-reduce shape is the LLM-era specialization of Dean & Ghemawat’s 2004 paper (MapReduce: Simplified Data Processing on Large Clusters) for corpus-level LLM operations.
-
LangChain documentation names the same shape under “map_reduce” chains.
Further Reading
-
Liu et al., Lost in the Middle: How Language Models Use Long Contexts (TACL 2024) --- motivates per-document calls over single long-context calls.
-
Compare to Sectioning (one input, many facets) and Voting (one input, many attempts).
Example
A due-diligence pipeline receives a 300-document data room and the question “are there any unusual indemnification clauses?” Map: each document is summarized with respect to indemnification, with “NONE” returned for irrelevant documents. Filter: documents returning NONE are dropped. Reduce: a final call ranks the surviving clauses by unusualness and produces a memo.
Example artifacts
Prompt.
# Map prompt (run per document)
"Extract any clauses related to {topic} from the document below. If
none, output exactly: NONE. Otherwise output: <clause text> //
source: <page or section>.
Document:
{document}"
# Reduce prompt (run once over the surviving outputs)
"You are given per-document extracts about {topic}. Combine them
into a memo: group similar clauses, rank by unusualness, cite each by
source."
Agent.
import asyncio
async def map_reduce(corpus, topic):
# Map in parallel
extracts = await asyncio.gather(*[
llm_async(map_prompt.format(topic=topic, document=d))
for d in corpus
])
# Shuffle (filter)
relevant = [e for e in extracts if e.strip() != "NONE"]
if not relevant:
return "No relevant clauses found."
# Reduce
return llm(reduce_prompt.format(topic=topic),
"\n---\n".join(relevant))
Skill (SKILL.md sketch).
---
name: corpus-map-reduce
description: Apply the same extraction question to every document in
a corpus
and synthesize a single memo. Use when the user asks about something
that
may span many files in a folder, codebase, or data room.
---
# Map-Reduce over Corpus
## When to invoke
- "review these 50 contracts for X"
- "find all places in the codebase that do Y"
- "summarize what each of these documents says about Z"
## Recipe
1. Read every file in the target set in parallel.
2. For each file, run the map prompt (topic from the user).
3. Drop files whose output is "NONE".
4. Run the reduce prompt over the survivors.
5. Return the memo with source citations.
Orchestrator–Workers
Classification Hybrid (sequential orchestration of parallel workers, possibly looping)
Intent
A central LLM dynamically decomposes the task into subtasks, delegates each to a worker LLM, and synthesizes their outputs.
Sketch
Motivating Problem
Some tasks decompose into subtasks whose number and shape are not knowable in advance. Editing a code change across an unknown number of files is the canonical case. Sectioning fails here because you cannot list the sections up front. Prompt Chaining fails because the decomposition is the problem.
How It Works
An orchestrator LLM call inspects the input and produces a list of subtasks. Each subtask is dispatched to a worker --- usually a fresh LLM call with a worker-specific prompt and a focused scope. Workers can run in parallel when their subtasks are independent. When they return, the orchestrator synthesizes their outputs into a final result, possibly issuing further subtasks if it judges the work incomplete.
Topologically similar to Sectioning, but the decomposition is dynamic. The orchestrator must be capable enough to produce a sensible decomposition, and the synthesis step has to handle a variable-size collection of worker outputs.
When to Use It
When the structure of the task depends on the input; when subtasks are roughly independent once identified; and when you can afford the latency of the orchestration + synthesis round-trips. If you know the subtasks in advance, prefer Sectioning. If subtasks have dependencies, you are reaching for ReWOO or Hierarchical Supervisor.
Sources
- Anthropic, Building Effective Agents --- names the pattern.
Further Reading
-
LangGraph multi-agent supervisor documentation
-
Microsoft AutoGen “GroupChatManager” as a similar orchestration shape
Example
A coding agent receives “rename the User class to Account everywhere.” The orchestrator inspects the codebase, identifies files that need changes (including imports, tests, migrations), and spawns a worker for each file. Each worker proposes its edit. The orchestrator reviews the diff set and either commits or issues additional subtasks.
Example artifacts
Prompt.
# Orchestrator system prompt
You are a task orchestrator. Given the user goal, output a JSON
list of subtasks, each shaped:
{
"id": "t1",
"instruction": "<focused, self-contained instruction>",
"tool": "<which tool/worker prompt to use>",
"depends_on": [] // ids of subtasks that must complete first
}
After all subtasks return, you will be re-invoked with their
outputs to synthesize the final answer.
Agent.
def orchestrate(goal):
plan_json = llm(orchestrator_prompt, goal, model="claude-opus-4-7")
plan = json.loads(plan_json)
results = {}
while plan:
ready = [t for t in plan if all(d in results for d in
t["depends_on"])]
if not ready:
raise RuntimeError("dependency cycle")
batch = run_workers_parallel(ready) # one LLM call per subtask
results.update(batch)
plan = [t for t in plan if t["id"] not in results]
return llm(synthesis_prompt, goal, results,
model="claude-opus-4-7")
Hybrid Pipeline
Classification Sequential of parallel stages
Intent
Compose a workflow as a sequence of stages, where one or more stages internally fan out in parallel.
Sketch
Motivating Problem
Real workflows are almost never purely sequential or purely parallel. A research pipeline reads a corpus (sequential), summarizes each document in parallel (parallel), then synthesizes the summaries into a report (sequential). Treating the whole thing as one pattern loses information about where the parallelism actually is and where the dependencies actually are.
How It Works
Hybrid Pipeline is a composition, not a primitive. It is the explicit recognition that the patterns in this catalog combine, and the act of naming the composition gives you a vocabulary for the shape of a workflow rather than just its parts. A Hybrid Pipeline is fully described by listing its stages, marking which are sequential and which fan out, and naming the sub-pattern used at each stage (Prompt Chaining, Sectioning, Map-Reduce, ReAct, etc.).
When to Use It
Whenever a workflow has more than one stage and at least one of those stages parallelizes. Which is almost always.
Sources
- Catalog name. The composition has no single canonical reference because it is the implicit shape of nearly every production system.
Further Reading
- Read the orchestration graphs in any LangGraph or Temporal workflow case study to see Hybrid Pipelines in practice.
Example
A legal-document review pipeline: (1) ingest and chunk documents (sequential), (2) classify each chunk by clause type (Map-Reduce over Corpus), (3) for each clause type, run a specialist reviewer (Routing inside Sectioning), (4) aggregate flags and produce a single review memo (sequential, with an Evaluator—Optimizer loop on the memo).
Example artifacts
Agent.
# Hybrid pipelines are best read as their constituent patterns
composed.
# Pseudo-DAG:
stage_1_ingest = sequential(chunk_documents)
stage_2_classify = map_reduce(per_chunk=classify_clause_type,
reduce=group_by_type)
stage_3_review =
sectioning_with_routing(per_type=specialist_reviewer)
stage_4_memo = evaluator_optimizer(generator=draft_memo,
evaluator=memo_checker, max_iter=3)
pipeline = compose(stage_1_ingest, stage_2_classify, stage_3_review,
stage_4_memo)
Evaluator–Optimizer
Classification Looping
Intent
One LLM generates a candidate; a second LLM evaluates it and produces feedback; the generator revises. Loop until the evaluator accepts or a budget is exhausted.
Sketch
Motivating Problem
Some outputs improve dramatically when a human gives the model targeted feedback (“the second paragraph contradicts the first,” “this translation lost the pun”). The first pass is rarely the best pass. But waiting for a human reviewer between every iteration does not scale.
How It Works
Stand up two LLM roles. The generator produces a candidate output from the input and any prior feedback. The evaluator scores the candidate against explicit criteria --- accuracy, completeness, style, schema validity --- and either accepts or emits feedback. On a reject, the generator runs again with the feedback in its prompt. The loop terminates on accept, on max iterations, or when feedback stops changing meaningfully.
Two preconditions matter: (1) the evaluator must be able to articulate useful feedback (“bad” doesn’t help the generator), and (2) feedback must demonstrably improve the output (test offline). When both hold, the pattern produces outputs significantly better than a single call. When they do not --- a sign that criteria are not legible to the model --- the loop wastes tokens without converging.
When to Use It
When you have clear evaluation criteria and iterative refinement provides measurable value; classically, translation, long-form drafting, code generation against tests, and complex search.
Sources
- Anthropic, Building Effective Agents --- names the pattern.
Further Reading
-
Madaan et al., Self-Refine: Iterative Refinement with Self-Feedback (NeurIPS 2023)
-
Pan et al., Automatically Correcting Large Language Models: Surveying the Landscape (TMLR 2024)
Example
A literary translation pipeline: the generator translates a paragraph; the evaluator (prompted with the source, the translation, and a checklist of common failure modes) emits a structured critique; the generator re-translates incorporating the critique. Three rounds usually suffice; the loop caps at five.
Example artifacts
Prompt.
# Generator prompt
"Translate from {src_lang} to {tgt_lang}. Preserve tone and any
wordplay.
Source:
{source_text}
Prior critique to address (may be empty):
{feedback}"
# Evaluator prompt
"You will score a translation 0-10 on (a) fidelity, (b) naturalness,
(c) preserved wordplay. Output JSON:
{ \"score\": <float>, \"feedback\": \"<targeted
critique, max 3 bullets>\" }
Source: {source_text}
Translation: {translation}"
Agent.
def evaluator_optimizer(source, src_lang, tgt_lang, max_iter=5,
threshold=8.5):
feedback = ""
for i in range(max_iter):
candidate = llm(generator_prompt.format(
src_lang=src_lang, tgt_lang=tgt_lang,
source_text=source, feedback=feedback))
verdict = json.loads(llm(evaluator_prompt.format(
source_text=source, translation=candidate)))
if verdict["score"] >= threshold:
return candidate
feedback = verdict["feedback"]
return candidate # best-effort on budget exhaustion
Generator–Verifier
Classification Looping (rejection sampling)
Intent
Generate a candidate; check it with a binary Pass/Fail verifier; on Fail, sample again. The verifier emits no feedback; the generator simply produces a fresh sample.
Sketch
Motivating Problem
Sometimes the right success signal is binary --- the SQL parses, the JSON validates, the unit tests pass, the regex extracts a match --- and articulating useful natural-language feedback to the generator is either impossible or unhelpful. Asking a Pass/Fail verifier to also produce feedback adds prompt complexity and can introduce hallucinated critiques. Rejection sampling against a clean verifier is often simpler and faster.
How It Works
The verifier is a function --- deterministic checker, classifier, or LLM with a strict Pass/Fail output --- that takes the candidate and returns one of two labels. The generator samples k candidates (in parallel or serially) and returns the first one that passes; or returns failure after max_samples.
The pattern is the agent-era version of rejection sampling. Diversity in the generator (temperature, prompt variations) is the engine --- if the generator always produces the same output, k samples cost k× the tokens and gain nothing.
Why this is not Evaluator—Optimizer:
-
Evaluator—Optimizer’s critic produces feedback the generator incorporates. The generator’s prompt grows over the loop.
-
Generator—Verifier’s verifier produces Pass/Fail only. The generator’s prompt is constant; only the random seed (or temperature) differs.
-
When the failure mode is “close but not quite right,” feedback helps and Evaluator—Optimizer is appropriate. When the failure mode is “invalid in an objective way,” rejection sampling is appropriate.
Parallel batch variant. Run k generator calls concurrently, then pick the first to pass. Trades cost for latency --- useful when most candidates fail and the verifier is fast.
When to Use It
When validity is objective and binary; when feedback would not help the generator more than fresh sampling would; when the verifier is fast relative to the generator. Code generation against a test suite, structured output generation against a schema, query generation against a database parser.
Sources
-
Catalog name. The technique is classical rejection sampling (von Neumann, 1951) applied to LLM outputs.
-
The text-to-SQL literature has independently used this shape under names like “self-debugging” and “execution-guided decoding.”
Further Reading
-
Chen et al., Teaching Large Language Models to Self-Debug (ICLR 2024) --- a generator-verifier loop guided by program execution.
-
Anthropic, function calling structured outputs documentation --- the JSON-schema verifier is built in.
Example
A natural-language-to-SQL feature: the generator produces a SQL candidate; the verifier runs EXPLAIN against the database (no execution) and accepts on a clean parse. On reject, the generator samples again at higher temperature. Cap: 5 samples. Below-cap success rate observed: ~92%.
Example artifacts
Prompt.
# Generator prompt
"Translate the user question into a SELECT query against the schema
below. Output only SQL.
Schema:
{schema}
Question:
{question}"
# Verifier (programmatic, not an LLM)
def verifier(sql):
try:
db.execute("EXPLAIN " + sql) # parse-only, no execution
return "PASS"
except SyntaxError:
return "FAIL"
Agent.
def generator_verifier(question, schema, max_samples=5):
for i in range(max_samples):
candidate = llm(
generator_prompt.format(schema=schema, question=question),
temperature=0.0 if i == 0 else 0.7,
)
if verifier(candidate) == "PASS":
return candidate
raise SamplingExhausted(f"no valid SQL in {max_samples} tries")
Reflexion
Classification Looping with episodic memory
Intent
Evaluator—Optimizer extended with persistent verbal lessons: the agent reflects on each failure in writing, stores the reflection, and consults the store on the next attempt.
Sketch
Motivating Problem
Evaluator—Optimizer carries feedback only within a single task. If the agent fails the same way on a hundred different tasks, it learns nothing across tasks. Fine-tuning is the heavyweight answer; Reflexion is the lightweight one --- keep the lessons in natural language, in a memory store the next attempt can read.
How It Works
Three roles: the actor attempts the task; the critic judges the attempt; the reflector, run only on failures, writes a short natural-language post-mortem that names what went wrong and what to try next. The post-mortem is stored. On future attempts (same task or related), the actor’s prompt includes relevant reflections retrieved from the store.
The pattern is essentially verbal reinforcement learning: the reflection is the gradient, and the actor’s context window is the parameter update. It is cheaper and more interpretable than fine-tuning, and it composes with the rest of the patterns in this catalog.
When to Use It
When failures recur in legible ways and the task has a clean success signal (unit tests pass, retrieval recall hits a threshold, evaluator accepts). Reflexion is a poor fit when success is ambiguous or when failures are idiosyncratic.
Sources
- Shinn et al., Reflexion: Language Agents with Verbal Reinforcement Learning (NeurIPS 2023) --- names and characterizes the pattern.
Further Reading
-
Park et al., Generative Agents (UIST 2023) --- a richer episodic-memory architecture that subsumes verbal reflection.
-
Compare to Evaluator—Optimizer (single-task) and to fine-tuning (parametric memory).
Example
A SWE-bench-style coding agent attempts a fix; the test suite runs as the critic; on failure, the reflector writes “the patch broke test_user_serialization because the migration order matters.” The reflection is stored. On the next attempt at this issue (or a similar one), the reflection is retrieved into the actor’s prompt.
Example artifacts
Prompt.
# Reflector prompt
"The previous attempt at this task failed. In 2-3 sentences, write a
lesson to apply on future attempts. Focus on the root cause, not the
surface symptom. Begin with: 'Lesson: '
Task: {task}
Attempt: {attempt}
Critic verdict: {critic_output}"
Agent.
def reflexion(task, max_iter=5):
lessons = retrieve_lessons(task) # vector search over store
for i in range(max_iter):
attempt = llm(actor_prompt, task=task, lessons=lessons)
verdict = critic(attempt, task)
if verdict.passed:
return attempt
lesson = llm(reflector_prompt, task=task,
attempt=attempt, critic_output=verdict.explanation)
store_lesson(task, lesson)
lessons.append(lesson) # available to next iteration
return attempt
Autoresearch (Ratchet Loop)
Classification Looping with monotonic state
Intent
Propose a change, run an experiment, measure a metric, keep the change only if the metric improved; otherwise roll back. Repeat indefinitely.
Sketch
Motivating Problem
Many optimization problems --- tuning an ML training loop, optimizing a prompt, improving SEO copy, refining an ad --- share a structure: there is a clear measurable metric, individual experiments take minutes to hours, the space of possible changes is large, and most changes don’t help. A human researcher spends most of their time waiting for the experiment, not thinking about the next one. The work is parallelizable in time: an agent can run experiments overnight while the human sleeps.
How It Works
Three preconditions: (1) a measurable scalar metric, (2) the ability to apply and roll back a change atomically, and (3) an experiment cycle that fits inside a reasonable wall-clock budget. With those, the loop is small:
-
An agent reads the current code/prompt/config and the log of prior experiments.
-
It proposes a single, targeted change, grounded in its knowledge of the domain.
-
The change is applied; the experiment runs.
-
The metric is compared against the best-so-far.
-
If improved, the change is committed (git commit) and the new best is recorded. If not, the change is rolled back (git reset —hard).
-
The loop repeats.
The git-based “ratchet” is what distinguishes the pattern from a generic optimizer. The state on disk is always the current best; failed experiments leave no residue. The agent can be naive about exploration strategy because the ratchet provides the floor.
Karpathy’s autoresearch repository (March 2026) is the canonical implementation: an AI coding agent (Claude Code, Codex, or equivalent) is pointed at a small LLM training setup and a Markdown instruction file describing the research agenda, and the loop runs unsupervised. Results from extended runs surfaced changes a careful human would eventually have found --- a missing QK-norm scaler, value-embedding regularization, AdamW beta tuning --- caught by the agent grinding through experiments overnight.
When to Use It
When the three preconditions hold and the experiment cycle is short enough to run a hundred trials overnight. If the metric is fuzzy, the rollback is messy, or each experiment takes a day, the pattern degrades into something closer to Evaluator—Optimizer with very expensive iterations.
Sources
- Karpathy, autoresearch GitHub repository (March 2026) --- the reference implementation and the source of the name.
Further Reading
-
Coverage in Fortune and DataCamp, March—April 2026
-
Tobi Lütke’s published adaptation of the autoresearch loop to query-expansion model tuning (Shopify, 2026)
-
Compare to classical AutoML (Bayesian optimization, hyperparameter sweeps), which has convergence guarantees the Ratchet Loop deliberately gives up in exchange for using LLM domain knowledge to propose changes.
Example
Karpathy’s original run: nanochat-style LLM training on a single GPU, ~5-minute training jobs, validation bits-per-byte as the metric. Two days unattended, ~700 experiments, ~20 genuine improvements committed, ~11% reduction in training time. Tobi Lütke’s adaptation pointed the same loop at a 0.8B query-expansion model overnight: 37 experiments, 19% quality improvement.
Example artifacts
Agent.
# Pseudocode for the ratchet loop
def autoresearch(workdir, max_iter=1000, time_budget_hours=8):
deadline = time.time() + time_budget_hours * 3600
best_metric = current_metric(workdir)
for i in range(max_iter):
if time.time() > deadline: break
history = read_log(workdir / "results.tsv")
change_plan = llm(agent_prompt,
history=history,
state=read_files(workdir),
model="claude-opus-4-7")
apply_change(workdir, change_plan)
try:
new_metric = run_experiment(workdir)
except Exception:
git_reset(workdir); continue
if new_metric > best_metric:
git_commit(workdir, f"iter {i}: {change_plan.summary}")
best_metric = new_metric
append_log(workdir, i, change_plan, new_metric, "committed")
else:
git_reset(workdir)
append_log(workdir, i, change_plan, new_metric, "rolled_back")
Skill (SKILL.md sketch).
---
name: autoresearch
description: Run an unattended optimization loop on a measurable
metric.
Use when the user wants to "explore overnight," "optimize
automatically,"
or improve a metric they can measure programmatically.
---
# Autoresearch (Ratchet Loop)
## Preconditions to verify before starting
- There is ONE scalar metric to optimize, measurable by a script.
- A single experiment cycle completes in under ~10 minutes.
- The workdir is a git repo; the agent can commit and reset.
## Required files in the workdir
- agenda.md --- research goal, constraints, ideas to try
- run_experiment.sh --- exits 0 and prints METRIC: <float> on
stdout
- results.tsv --- appended to each iteration
## Loop (do not deviate)
1. Read results.tsv for history of prior experiments.
2. Propose ONE targeted change, grounded in agenda.md.
3. Apply the change.
4. Run run_experiment.sh; capture METRIC.
5. If METRIC improved over best-so-far: git commit; update best.
6. Else: git reset --hard.
7. Append a row to results.tsv. Go to 1.
ReAct
Classification Looping (single agent, single context)
Intent
Interleave the model’s reasoning (“Thought”) with tool calls (“Action”) and tool results (“Observation”) in a single rolling context until the task is done.
Sketch
Motivating Problem
Many real tasks require both reasoning and grounded actions --- the model must think about what to do, do it, see the result, and reason about the next step. Pure Chain-of-Thought reasons but cannot ground its conclusions in external state. Pure tool use acts but cannot plan across steps. ReAct interleaves the two.
How It Works
A single prompt instructs the model to emit a Thought, then an Action (a tool call), then to wait for an Observation, then to emit the next Thought. The runtime parses each Action, executes the tool, returns the Observation, and appends everything to the rolling context. The loop continues until the model emits a terminal Action (“Final Answer”) or a stop condition triggers (max iterations, token budget, error).
ReAct is the dominant pattern for production agents today. Most “agent frameworks” are implementations of ReAct with conveniences around tool definition, parsing, and observability.
Failure modes: the model gets stuck repeating the same Action, loses track in long contexts, or hallucinates Observations it never actually received. Guardrails --- iteration caps, repetition detectors, structured tool outputs the model cannot fabricate --- mitigate them.
When to Use It
When the task requires tool use, the steps are not knowable in advance, and the horizon is bounded (tens of steps, not thousands). For longer horizons, layer ReAct inside an outer pattern --- Plan-and-Execute for structure, Hierarchical Supervisor for delegation, Reflexion for cross-task learning.
Sources
- Yao et al., ReAct: Synergizing Reasoning and Acting in Language Models (NeurIPS 2023) --- the paper that named the pattern.
Further Reading
-
The interleaved Thought/Action/Observation format is load-bearing infrastructure across LangGraph, AutoGen, the OpenAI Agents SDK, and the Claude Agent SDK.
-
Anthropic, tool use documentation --- the production implementation of ReAct’s Action/Observation contract.
Example
A research assistant asked “Who won the 2024 Australian Open and what is their hometown?” emits: Thought: I should look up the 2024 Australian Open winner. Action: web_search(“2024 Australian Open winner”). Observation: Jannik Sinner. Thought: Now I need his hometown. Action: web_search(“Jannik Sinner hometown”). Observation: Innichen, Italy. Action: final_answer(“Jannik Sinner, from Innichen, Italy.”).
Example artifacts
Prompt.
# ReAct system prompt
You have access to these tools: web_search(q), fetch(url),
final_answer(text).
At each step, respond in EXACTLY this format:
Thought: <your reasoning about what to do next>
Action: <tool_name>(<args>)
Wait for the Observation to be returned. Then continue with the next
Thought.
When you have the answer, call: Action: final_answer(<answer>)
Agent.
def react_loop(question, tools, max_iter=12):
context = [{"role": "user", "content": question}]
for i in range(max_iter):
response = llm(react_system_prompt, context, tools=tools)
context.append({"role": "assistant", "content": response})
action = parse_action(response)
if action.name == "final_answer":
return action.args["text"]
observation = tools[action.name](**action.args)
context.append({"role": "user",
"content": f"Observation: {observation}"})
raise IterationCapReached()
Plan-and-Execute
Classification Sequential (plan stage, then execution stages)
Intent
Produce a full step-by-step plan up front; then execute each step, often with a smaller and cheaper model.
Sketch
Motivating Problem
ReAct decides each next step at runtime, with all the context-window cost and variance that implies. For tasks whose structure is predictable --- automated software deployments, structured data extraction, well-bounded multi-step workflows --- re-deciding every step is waste. A plan produced once by a capable model is cheaper to execute than a sequence of full reasoning passes.
How It Works
A planner (typically a strong model) reads the goal and emits a structured plan: an ordered list of steps, each with the inputs and expected outputs named. An executor (potentially a smaller, cheaper model) runs each step in sequence. The executor’s prompt is constrained --- it sees the plan, the current step, the outputs of prior steps, and the tools available.
Variations:
-
Static plan. Plan fixed at planning time. Simple but brittle.
-
Replanning. After each step, the planner is invoked again with the latest observations and can revise the remaining plan.
-
Plan-and-Solve (Wang et al., 2023). A specific prompting variant separating “devise a plan” from “carry out the plan” in a single LLM call.
When to Use It
When the task structure is predictable, when separating planning from execution reduces cost meaningfully, and when the planner can produce a plan you would trust to execute without re-checking every step. If the environment is unpredictable, prefer ReAct or replanning.
Sources
- Wang et al., Plan-and-Solve Prompting (ACL 2023) --- the prompting variant that names the pattern.
Further Reading
-
LangGraph plan-and-execute notebook
-
BabyAGI (2023) --- an early production-style implementation of the planning loop.
Example
An automated deployment agent receives “deploy v2.3 to staging.” The planner emits: (1) run pre-flight checks, (2) tag the release, (3) push to staging registry, (4) update the staging cluster, (5) run smoke tests. A smaller executor model runs each step. If smoke tests fail, the planner is re-invoked with the failure context to decide whether to roll back or to patch forward.
Example artifacts
Prompt.
# Planner prompt (run once, strong model)
"Goal: {goal}
Output a JSON plan: a list of steps. Each step has:
{ \"id\": \"s1\", \"instruction\": \"...\",
\"tool\": \"deploy|test|...\", \"depends_on\": []
}"
# Executor prompt (run per step, cheap model)
"You are executing step {step_id}: {instruction}
Prior step outputs (JSON):
{prior_outputs}
Use the tool {tool}. Return its result verbatim."
Agent.
def plan_and_execute(goal):
plan = json.loads(llm(planner_prompt.format(goal=goal),
model="claude-opus-4-7"))
outputs = {}
for step in plan:
try:
outputs[step["id"]] = llm(executor_prompt.format(
step_id=step["id"],
instruction=step["instruction"],
tool=step["tool"],
prior_outputs=json.dumps(outputs),
), model="claude-haiku-4-5", tools=[step["tool"]])
except StepFailure as e:
# Replan on failure
plan = json.loads(llm(replanner_prompt.format(
goal=goal, completed=outputs, failure=str(e)),
model="claude-opus-4-7"))
return outputs
Skill (SKILL.md sketch).
---
name: plan-and-execute
description: Decompose a multi-step goal into a structured plan and
run each
step. Use when the user describes a goal that decomposes into 3+
named
steps with clear dependencies (deployments, multi-stage data
processing,
structured extraction).
---
# Plan-and-Execute
## When to invoke
- "Deploy X to Y", "set up the pipeline that does A then B then
C"
- Goals with verbs that imply staged work
## Recipe
1. Call the planner (strong model) ONCE on the goal.
2. Validate the plan: each step has id, instruction, tool,
depends_on.
3. Execute steps in topological order; smaller model is fine per
step.
4. On step failure: re-invoke the planner with completed steps +
error.
5. Return all outputs keyed by step id.
ReWOO
Classification Sequential plan; parallel execution; sequential synthesis
Intent
Produce a plan with explicit data dependencies, execute the independent steps in parallel without intermediate model invocations, and combine the results in a final synthesis step.
Sketch
Motivating Problem
ReAct re-invokes the LLM after every Observation, which is expensive in tokens and slow in latency when most of the Observations are routine. If the agent knew the structure of the work up front, it could fire many tool calls in parallel without consulting the model between them.
How It Works
The planner decomposes the task into a directed acyclic graph of steps, where each step is labeled (E1, E2, …) and may declare dependencies on the outputs of earlier steps via variable substitution. Workers execute the steps; independent steps run concurrently; dependent steps wait for their inputs. The model is not re-invoked between worker steps. Once all steps complete, the solver receives the original question, the plan, and all the observations, and produces the final answer.
ReWOO (“Reasoning WithOut Observation”) is named for the central insight: the planner reasons without observations, the workers act without reasoning, and the solver reasons after all observations have arrived. The LLM is invoked twice (planner + solver) instead of N times (ReAct).
When to Use It
When the task’s structure can be planned in advance, when independent steps exist that can run in parallel, and when intermediate LLM reasoning between steps adds little. If the next step depends on reasoning over the previous step’s content, prefer ReAct.
Sources
- Xu et al., ReWOO: Decoupling Reasoning from Observations for Efficient Augmented Language Models (2023) --- the paper that names the pattern.
Further Reading
-
LangGraph documentation on parallel branching execution
-
Compare to Plan-and-Execute (sequential steps) and to Orchestrator—Workers (LLM re-invoked between rounds).
Example
“What is the population of the capital of the country that won the most gold medals at the 2024 Olympics?” Planner produces: E1 = lookup(“2024 Olympics gold medal leader”); E2 = lookup(“capital of #E1”); E3 = lookup(“population of #E2”). E1 must finish before E2; E2 before E3. Solver synthesizes a final answer from #E1, #E2, #E3.
Example artifacts
Prompt.
# Planner prompt
"Decompose the question into worker steps using variable
substitution. Use #E1, #E2, ... to reference prior step outputs.
Output exactly:
Plan: <one-paragraph reasoning>
#E1 = <tool>[<args>]
#E2 = <tool>[<args, possibly referencing #E1>]
...
Question: {question}"
# Solver prompt
"Question: {question}
Plan and observations:
{plan_with_observations}
Synthesize the final answer."
Agent.
def rewoo(question, tools):
plan_text = llm(planner_prompt.format(question=question))
steps = parse_steps(plan_text) # list of (id, tool, args)
observations = {}
pending = list(steps)
while pending:
ready = [s for s in pending if all(d in observations for d in
s.deps)]
results = run_in_parallel([
(s.id, tools[s.tool](**substitute(s.args, observations)))
for s in ready
])
observations.update(results)
pending = [s for s in pending if s.id not in observations]
return llm(solver_prompt.format(
question=question,
plan_with_observations=annotate(plan_text, observations)))
Tree-of-Thoughts
Classification Branching search over LLM reasoning
Intent
Generate multiple reasoning branches at each step, evaluate the partial branches, expand the most promising, backtrack when a branch fails.
Sketch
Motivating Problem
Chain-of-Thought commits to one line of reasoning. If the first step is wrong, every subsequent step inherits the error. For tasks where the search space of reasoning paths is large and only a few paths lead to the right answer --- combinatorial puzzles, creative planning, mathematical proofs --- committing early is the failure mode.
How It Works
Treat reasoning as a search problem. At each node, the model proposes K candidate “thoughts” (next reasoning steps). Each candidate is scored --- by the model itself acting as a value function, by an external evaluator, or by partial execution. The search proceeds via BFS, DFS, or beam search through the tree, expanding promising nodes and pruning poor ones. Backtracking is cheap because the tree is preserved.
The pattern subsumes Chain-of-Thought (the K=1, no-backtrack degenerate case) and self-consistency voting (which samples N independent linear chains without sharing structure).
When to Use It
When the task has a large reasoning search space, when partial reasoning states are evaluable, and when the cost of expanding many branches is justified by the value of finding the right one. Token costs scale with branch count.
Sources
- Yao et al., Tree of Thoughts: Deliberate Problem Solving with Large Language Models (NeurIPS 2023) --- names the pattern.
Further Reading
-
Besta et al., Graph of Thoughts (AAAI 2024) --- generalizes the tree to a DAG.
-
Long, Large Language Model Guided Tree-of-Thought (2023)
Example
Solving the Game of 24 puzzle (combine four numbers with arithmetic to reach 24). At each step, the model proposes candidate next operations; each partial state is scored as “promising” / “uncertain” / “impossible”; the search expands only promising states. ToT solves problems CoT fails on, at multiplicative token cost.
Example artifacts
Prompt.
# Propose prompt
"Given the current partial reasoning state, propose K=3 distinct
next thoughts. For each, also self-evaluate as: SURE | MAYBE |
IMPOSSIBLE.
State: {state}
Output JSON: [{ \"thought\": \"...\", \"eval\":
\"SURE|MAYBE|IMPOSSIBLE\" }]"
# Score prompt (optional, separate value model)
"Rate the partial reasoning below on a 1-10 scale for likelihood of
reaching the goal.
Trajectory: {trajectory}
Output: { \"score\": <float>, \"reason\": \"...\" }"
Agent.
def tot_search(initial_state, max_depth=4, k=3, beam=2):
frontier = [(initial_state, [])]
for depth in range(max_depth):
candidates = []
for state, path in frontier:
proposals = json.loads(llm(propose_prompt.format(state=state)))
for p in proposals:
if p["eval"] == "IMPOSSIBLE": continue
if is_goal(p["thought"]): return path + [p]
score = json.loads(llm(score_prompt.format(
trajectory=path + [p])))["score"]
candidates.append((score, p, path + [p]))
candidates.sort(reverse=True)
frontier = [(c[1]["thought"], c[2]) for c in
candidates[:beam]]
return frontier[0][1] # best path so far
LATS
Classification Tree search over a ReAct-style action space, with reflection
Intent
Combine Tree-of-Thoughts (branching search), ReAct (tool use), and Reflexion (verbal lessons) into a unified deliberation loop.
Sketch
Motivating Problem
ReAct commits to whichever action it samples; a wrong commitment far down a long trajectory is expensive to recover from. Tree-of-Thoughts handles the branching but only over reasoning steps, not over tool actions. Real tasks need both: branching deliberation and grounding in tool results and the ability to learn from failed branches.
How It Works
Each node of the search tree is a state in a ReAct trajectory. At each node, multiple candidate Actions are generated and evaluated; the search expands the most promising. When a branch reaches a dead end or a failure, a reflection is generated and stored, and the search backtracks to a sibling. The reflections inform later expansions.
LATS = Tree Search + ReAct + Reflection. The compute cost is significant --- multiple ReAct branches running concurrently, each generating reflections --- but the success rate on hard tasks goes up correspondingly.
When to Use It
Hard tasks where ReAct alone is unreliable, the search space of actions is large, and the cost of expanded compute is acceptable. Web navigation, complex reasoning over external tools, and multi-step problem solving where dead ends are common.
Sources
- Zhou et al., Language Agent Tree Search Unifies Reasoning, Acting, and Planning in Language Models (ICML 2024) --- names the pattern.
Further Reading
- Compare to ReAct (linear), Tree-of-Thoughts (reasoning-only branching), and Reflexion (linear with memory).
Example
A web-shopping agent searching for a product matching multiple constraints (size, price, in-stock, free shipping). Multiple ReAct branches explore different starting search queries; failed branches (out-of-stock results, mismatched specifications) trigger reflections that steer remaining branches away from similar dead ends.
Example artifacts
Agent.
# LATS = ToT structure + ReAct nodes + Reflexion on dead ends.
# In pseudocode (each tree node runs a short ReAct loop):
def lats(goal, max_depth=4, branching=3):
root = Node(state=initial_state(goal))
reflections = []
for depth in range(max_depth):
node = select_best(root) # UCT or score-based
actions = propose_actions(node, k=branching, reflections=reflections)
children = [Node(execute_react_step(node, a)) for a in actions]
node.children = children
for c in children:
if c.is_goal: return c.path
if c.dead_end:
reflections.append(reflect(c.path, c.dead_end_reason))
return select_best(root).path
Multi-Agent Debate
Classification Parallel solvers, sequential aggregation, possibly looping for rounds
Intent
Multiple solver agents independently answer a question; they then exchange and critique each other’s answers across one or more rounds; an aggregator synthesizes the final answer.
Sketch
Motivating Problem
Self-consistency voting samples N answers from one model but cannot resolve cases where the majority is confidently wrong. Multi-Agent Debate forces the models to articulate their reasoning to each other, which surfaces inconsistencies that pure voting misses.
How It Works
N solver agents --- distinct prompts, distinct models, or distinct roles --- each produce an initial answer with reasoning. In each subsequent round, every solver sees the other solvers’ answers and reasoning, and produces a revised answer. After K rounds (commonly 2—3), an aggregator reads the full transcript and emits the consensus answer.
The pattern’s effectiveness comes from the legibility of disagreement. Without exchange, the disagreement is invisible.
Variations include sparse communication topologies (each solver sees only some peers) and asymmetric debates (one solver is a “checker” prompted to find flaws in the others’ answers).
When to Use It
When the task admits explicit reasoning, when wrong answers tend to have legible flaws, and when the cost of N×K solver calls is justified. Strong on reasoning-heavy benchmarks (GSM8K, complex QA). Weaker when the task is mostly retrieval or when “reasoning” is post-hoc rationalization.
Sources
- Du et al., Improving Factuality and Reasoning in Language Models through Multiagent Debate (ICML 2024) --- names the pattern.
Further Reading
-
Liang et al., Improving Multi-Agent Debate with Sparse Communication Topology (2024)
-
AutoGen’s multi_agent_debate.ipynb
Example
Three solver agents tackle a math word problem. After their initial answers diverge (one says 42, two say 56), the second round shows each its peers’ reasoning. The single dissenter spots the arithmetic slip in its own work and revises; the aggregator returns 56.
Example artifacts
Prompt.
# Round 1 solver prompt
"Solve the problem below. Show your reasoning step by step, then
give the final numeric answer on the last line as: ANSWER: <value>
Problem: {question}"
# Round 2+ solver prompt
"Here is your prior answer and the answers of two peers. Reconsider
your answer in light of their reasoning; agree or disagree
explicitly.
Your prior: {your_prior}
Peer 1: {peer1}
Peer 2: {peer2}
Then give the final answer on the last line as: ANSWER: <value>"
# Aggregator prompt
"Solvers' final answers: {final_answers}. Output the consensus (or
majority) and confidence."
Agent.
def debate(question, n=3, rounds=2):
solvers = [llm_with_persona(i) for i in range(n)]
answers = [s(round1_prompt.format(question=question)) for s in
solvers]
for r in range(rounds - 1):
new_answers = []
for i, s in enumerate(solvers):
peers = [a for j, a in enumerate(answers) if j != i]
new_answers.append(s(round2_prompt.format(
your_prior=answers[i], peer1=peers[0], peer2=peers[1])))
answers = new_answers
return llm(aggregator_prompt.format(final_answers=answers))
Hierarchical Supervisor
Classification Multi-level orchestration
Intent
A top-level supervisor agent delegates to mid-level supervisors, which in turn delegate to specialist workers, forming a tree of responsibility.
Sketch
Motivating Problem
A single Orchestrator—Workers layer becomes a bottleneck and a context-window problem when the task is large enough that the orchestrator cannot fit all subtask outputs in its context. The fix is the standard one in software architecture: layer the orchestration.
How It Works
Each level of the hierarchy is itself an Orchestrator—Workers pattern, but the “workers” at one level are themselves supervisors at the next. Typed contracts between levels --- explicit schemas for what a subordinate is asked to do and what it must return --- keep the interfaces clean. Context windows are managed by aggressively summarizing subordinate outputs before they reach a supervisor.
The pattern maps cleanly onto org-chart intuitions: a CEO does not read every line of code; they read summaries from VPs, who read summaries from directors. The same structural argument applies here.
When to Use It
When tasks are large enough that single-level orchestration overflows context windows or exceeds the orchestrator’s ability to reason over the whole. When the work has natural functional sub-domains (research, writing, review). When some sub-trees may run in parallel without contention.
The supervisor itself can become a bottleneck. Designs that route work peer-to-peer when possible, reserving the supervisor for coordination decisions only, scale better.
Sources
-
LangGraph multi-agent hierarchical patterns documentation --- reference implementation.
-
Roush et al., DeepDebater (2025) --- hierarchical multi-agent architecture for competitive policy debate.
Further Reading
-
Production write-ups on hierarchical delegation models (Imran, Medium, 2026)
-
Compare to flat Orchestrator—Workers and to Deep Research Agent (a specific 2-level hierarchical pattern).
Example
A long-form research report. The top supervisor decomposes the report into chapters and assigns each to a chapter lead. Each chapter lead decomposes its chapter into sections and assigns workers to draft, fact-check, and edit each section. Workers return finished sections; chapter leads return finished chapters; the top supervisor assembles the final document.
Example artifacts
Agent.
# A two-level hierarchy as nested Orchestrator-Workers
def top_supervisor(goal):
chapter_plan = llm(top_prompt, goal=goal) # decompose into chapters
chapters = run_in_parallel([
chapter_lead(ch.title, ch.outline) for ch in chapter_plan
])
return llm(top_synth_prompt, goal=goal,
chapters=summarize_each(chapters))
def chapter_lead(title, outline):
section_plan = llm(chapter_prompt, title=title, outline=outline)
sections = run_in_parallel([
worker(s.instruction) for s in section_plan
])
return llm(chapter_synth_prompt, title=title, sections=sections)
Autonomous Agent
Classification Looping, open-ended
Intent
An LLM operates a tool-use loop with no predetermined number of steps, deciding its own actions, observing the environment, and terminating when it judges the task complete.
Sketch
Motivating Problem
Some tasks have no fixed step count and no a-priori decomposition. The number of files to edit is unknown; the number of pages to read is unknown; the number of corrections to make is unknown. Any pattern that requires the number of steps to be known up front fails on these tasks.
How It Works
The Autonomous Agent is structurally a ReAct loop, but with three features that distinguish it from the smaller ReAct patterns used inside other workflows:
-
No fixed horizon. The loop runs until the agent itself signals completion or until a stop condition triggers (max iterations, token budget, time limit, human halt).
-
Ground truth from the environment. Every action produces an observation from the real environment (filesystem, browser, API), not a model-generated stand-in.
-
Optional human checkpoints. The agent may pause for human review at predefined points; see Human-in-the-Loop Checkpoint.
Cost profile is unbounded by construction. Cost controls --- iteration caps, token budgets, watchdog timers --- are not optional. Errors compound: a wrong action early in a long trajectory can corrupt all subsequent reasoning. Sandboxed execution and rollback-capable environments are standard.
When to Use It
Open-ended tasks where the step count cannot be predicted, the environment is trusted or sandboxed, and you have some level of confidence in the model’s decision-making. Coding agents on multi-file refactors, computer-use agents, deep research agents. Not for high-stakes one-shot actions where a single wrong action is unrecoverable.
Sources
- Anthropic, Building Effective Agents --- names the pattern as the open-ended endpoint of the agentic spectrum.
Further Reading
-
Anthropic, Claude on SWE-bench (technical write-ups, 2024)
-
Anthropic, Computer use reference implementation
-
AutoGPT (2023) and BabyAGI (2023) --- early influential autonomous-agent designs.
Example
A coding agent given a GitHub issue. It explores the repository, identifies relevant files, drafts a patch, runs the test suite, observes failures, revises the patch, repeats until tests pass or a budget is exhausted. The number of iterations and the set of files touched are both determined at runtime.
Example artifacts
Prompt.
# Autonomous-agent system prompt
You are an autonomous coding agent operating in a sandboxed repo.
Tools available: {tools}
Goal: {goal}
Operate the loop yourself: think, act, observe, repeat. Continue
until you have achieved the goal or determined it is impossible.
You may pause and call the request_human() tool only if genuinely
blocked. You have a budget of {max_iter} iterations and
{token_budget}
tokens. Track your remaining budget in your scratchpad.
Terminate by calling done(<summary>) when the goal is achieved.
Agent.
def autonomous(goal, tools, max_iter=200, token_budget=500_000):
tokens_used = 0
context = [{"role": "user", "content": goal}]
for i in range(max_iter):
if tokens_used > token_budget: return abort("token budget")
response = llm(autonomous_system_prompt, context, tools=tools)
tokens_used += response.usage.total_tokens
action = parse_action(response)
if action.name == "done": return action.args["summary"]
if action.name == "request_human": return await_human(action.args)
observation = tools[action.name](**action.args)
context.extend(append_turn(response, observation))
return abort("iteration cap")
Human-in-the-Loop Checkpoint
Classification Synchronization barrier
Intent
Pause an otherwise-autonomous workflow at defined points to obtain human approval, correction, or judgment before proceeding.
Sketch
Motivating Problem
Pure autonomy has unbounded downside: a confidently wrong agent can do real damage before anyone notices. Pure manual operation has unbounded latency. The middle ground is autonomy with explicit checkpoints --- the workflow runs unattended through the routine stages and pauses at the high-stakes ones.
How It Works
Identify the points where (a) the cost of a wrong autonomous decision is high, (b) human judgment adds substantial value, or (c) a regulatory or policy requirement mandates human review. At each such point, the workflow halts, presents a structured summary to a human, and resumes only after explicit approval. The human’s response is stored in the workflow’s audit trail.
Modern orchestration frameworks (LangGraph, Temporal) implement this as durable, pausable execution: the workflow’s full state is checkpointed; a human can review hours or days later; resumption picks up exactly where the pause occurred.
Three common checkpoint placements:
-
Before irreversible actions (deletes, deployments, financial transactions).
-
At decision branches with high uncertainty.
-
At end-of-stage gates before passing work to a downstream consumer.
When to Use It
Whenever the workflow’s actions cross trust, regulatory, or reversibility boundaries; whenever the cost of an unattended wrong decision exceeds the cost of human review time; whenever audit trails matter.
Sources
-
LangGraph “human-in-the-loop” documentation --- the reference for durable pausable workflows in the agent ecosystem.
-
Temporal documentation on durable execution and signals --- the general distributed-systems primitive HITL builds on.
Further Reading
-
Anthropic, computer-use safety documentation --- HITL placement around irreversible actions.
-
Production write-ups on approval workflows in autonomous coding (Cognition, Devin; Cursor agent mode).
Example
A legal-document drafting workflow runs autonomously through ingestion, clause classification, draft generation, and self-review. Before the draft is sent to the client, the workflow halts at a checkpoint: a paralegal reviews the draft, edits inline if needed, and approves. Only then does the workflow send.
Example artifacts
Agent.
# LangGraph-style decorator pseudocode
\@checkpoint(approval_required=True, risk="high")
async def deploy_to_prod(plan):
yield Checkpoint(
summary=f"Deploy v{plan.version} to prod?",
diff=plan.diff,
affected_services=plan.services,
rollback_command=plan.rollback,
)
# Workflow durably suspended here. State persists for hours/days.
decision = await human_response() # 'approve' | 'reject' |
'edit'
if decision == "approve":
return await execute(plan)
if decision == "edit":
return await execute(decision.edited_plan)
raise RejectedByHuman(decision.reason)
Skill (SKILL.md sketch).
---
name: hitl-checkpoint
description: Insert a human approval pause into an
otherwise-autonomous
workflow. Use before any irreversible action (deletes, deployments,
outbound communication, financial transactions).
---
# Human-in-the-Loop Checkpoint
## Checkpoint payload (always include)
- summary --- one sentence describing what will happen
- risk --- low | medium | high
- diff or plan --- what changes, structured
- reversible --- true | false
- rollback --- command or steps to undo, if reversible
## Resume semantics
- approve → proceed as planned
- edit → proceed with the human's edits
- reject → record reason and abort downstream stages
Always persist the checkpoint payload AND the response to the audit
log.
Deep Research Agent
Classification Hierarchical / hybrid
Intent
Decompose an open-ended research question into parallel sub-investigations; each sub-investigation runs its own ReAct-style loop over external sources; a synthesizer combines the findings into a long-form report.
Sketch
Motivating Problem
A single ReAct agent doing open-ended research hits two limits: its context window fills with search results before it has covered the topic, and it tends to commit early to one line of inquiry. Splitting the research across sub-agents --- each with its own clean context --- solves both problems.
How It Works
A planner reads the research question and emits a structured set of sub-questions. Each sub-question is delegated to a sub-agent running its own ReAct loop with search, fetch, and read tools. The sub-agent operates in its own context window, producing a focused intermediate summary with sources. When all sub-agents complete (or a budget is exhausted), a synthesizer reads the intermediate summaries and the original question, and produces the final report with citations.
The pattern is essentially Orchestrator—Workers + ReAct + Hierarchical, with the synthesis stage doing real intellectual work --- resolving contradictions across sub-agents, organizing material, deciding what is and isn’t in scope.
Production implementations add a feedback loop: the synthesizer may identify gaps and spawn additional sub-agents to fill them. This is Evaluator—Optimizer wrapped around the whole assembly.
When to Use It
Long-form research tasks where the question is too broad for a single agent’s context, where parallel investigation is feasible, and where the user expects a structured, cited report rather than a conversational answer.
Sources
-
Anthropic, Claude’s Research feature --- the canonical productized version of the pattern (technical write-ups, 2024—2025).
-
OpenAI, Deep Research product (2025) --- independent implementation.
Further Reading
-
Roush et al., DeepDebater (2025) --- hierarchical research applied to competitive policy debate.
-
Compare to Hierarchical Supervisor (the general pattern) and to Map-Reduce over Corpus (when the source set is known and bounded).
Example
“Compare how the EU, US, and Japan are regulating AI in 2026.” The planner emits sub-questions per jurisdiction and per regulatory dimension (training data, model registration, deployment). A dozen sub-agents run in parallel; each returns a sourced summary; the synthesizer organizes the material by dimension and produces a comparative report.
Example artifacts
Prompt.
# Planner prompt
"Decompose this research question into 5-10 independent
sub-questions. Each sub-question should be answerable in 30 minutes
of focused web search, and have a single clear scope. Avoid overlaps.
Output JSON: [{ \"id\": \"sq1\", \"question\":
\"...\", \"scope\": \"...\" }]
Question: {question}"
# Sub-agent system prompt
"You research one sub-question. Use search, fetch, and read tools.
Cite every claim by URL. When you have a complete answer, call
done(summary, citations). Budget: 20 tool calls."
# Synthesizer prompt
"Original question: {question}
Sub-question findings (with citations): {findings}
Write a structured, cited report. Resolve contradictions explicitly.
Flag gaps."
Agent.
async def deep_research(question, sub_agent_budget=20):
plan = json.loads(await llm(planner_prompt, question=question,
model="claude-opus-4-7"))
findings = await asyncio.gather(*[
run_react_subagent(sq, max_iter=sub_agent_budget) for sq in plan
])
return await llm(synthesizer_prompt,
question=question, findings=findings,
model="claude-opus-4-7")
Skill (SKILL.md sketch).
---
name: deep-research
description: Produce a long-form, cited research report on an
open-ended
question. Use when the user asks "compare," "what are the latest
developments in," or otherwise expects a structured multi-page
answer
with citations.
---
# Deep Research Agent
## Phases (do not skip any)
1. PLAN: decompose the question into 5-10 independent sub-questions.
Show the plan to the user; allow them to edit before launch.
2. INVESTIGATE: launch a sub-agent per sub-question. Each uses
search,
fetch, and read tools. Each writes its findings with URL citations.
3. SYNTHESIZE: assemble a structured report. Resolve contradictions
explicitly. Flag remaining gaps.
## Budgets
- Per sub-agent: 20 tool calls, 50k tokens
- Whole run: 12 sub-agents max, 600k tokens
## Output format
Markdown with: executive summary, sections per major theme, citations
as
footnotes with URLs.
Appendix A --- Pattern Composition
The patterns above almost never appear alone. The table below names some common compositions seen in production systems:
| Composition | Description |
|---|---|
| Router + Specialist Chain | A Routing front door dispatches to one of several Prompt Chains. |
| Orchestrator over ReAct workers | The orchestrator decomposes; each worker runs a small ReAct loop on its subtask. |
| Evaluator–Optimizer wrapping any generator | Any synthesis step can be wrapped in a critic loop. |
| Cascade + Generator–Verifier | Each cascade level is itself a generator–verifier; only verified outputs propagate. |
| Hierarchical Supervisor of Deep Research Agents | Used in long, multi-domain reports where each domain itself decomposes. |
| Autonomous Agent with HITL checkpoints | The default shape of “agent in production.” |
| Plan-and-Execute with ReAct executors | Plan once at the top, run each plan step as a small ReAct. |
| Map-Reduce + Evaluator–Optimizer on the reduce step | Per-document map calls feed a reduce that loops against a critic until the memo is acceptable. |
Appendix B --- A Note on Pattern Drift
Pattern names in this space drift. Agent, workflow, chain, loop, and orchestration are used inconsistently across the literature and the marketing material of the major frameworks. This catalog has tried to use the Anthropic taxonomy where it is clearest and to disambiguate where it is not. Where a pattern in this catalog overlaps substantially with a named concept elsewhere (LangGraph “subgraphs,” AutoGen “GroupChat,” CrewAI “Crews”), the Further Reading of the relevant entry says so.
The catalog is also, of course, a snapshot. The deeper agent patterns --- long-horizon planning, multi-modal grounding, agent-to-agent protocols --- are moving fast. Treat v0.2 as a starting structure, not a final word.
--- End of Patterns of AI Agent Workflows v0.2 ---