Catalog · Infrastructure

Volume 07

The Human-in-the-Loop Catalog

Volume 07 of the Agentic AI Series

17 patterns draft-v0.1 2026-05 Infrastructure

A Catalog of Approval, Observation, and Interaction

Draft v0.1

May 2026

Table of Contents

About This Catalog

This is the seventh volume in a catalog of the working vocabulary of agentic AI. The six prior volumes covered patterns (the timing of agent runs), skills (model instructions in packaged form), tools (the function-calling primitives), events and triggers (what activates the agent), fabric (the substrate beneath orchestration), and memory (state, context, and recall). This seventh volume covers the layer that, in production, often determines whether an agent system succeeds or fails: how humans approve, observe, and interact with running agents.

“Human in the loop” is a phrase that does too much work. In practice it names three distinct engineering concerns that get conflated under one label. The first is approval --- the agent pauses, a human decides whether some action proceeds, the agent resumes. The second is observation --- the agent runs to completion while every step is captured for later audit, with no human in the synchronous path. The third is interaction --- a human steers the agent mid-flight, editing state, redirecting attention, providing feedback that becomes input to the next step. Auditors care about approval. Engineers care about observation. Users care about interaction. A serious HITL design addresses all three; most teams build one well and assume it covers the others.

The category has matured rapidly in 2024—2026. LangGraph’s interrupt mechanism made approval gates a first-class framework primitive. Temporal’s Signal mechanism, productized for the AI agent use case, established durable approval (pause for hours or days, survive crashes) as the engineering gold standard. The observability platforms --- LangSmith, Arize Phoenix, Langfuse --- converged on a shared mental model (the trace tree) and increasingly on shared standards (OpenInference, the OpenTelemetry GenAI semantic conventions). The UI side saw Chainlit emerge as the default chat-style framework and CopilotKit as the default for embedding agents in existing React applications. Generative UI --- the agent rendering its own widgets mid-conversation --- became a viable third pattern thanks to Anthropic Artifacts and the Vercel AI SDK’s UI streaming primitives. Seven volumes in, the category is no longer young.

Scope

Coverage:

  • Approval gates in frameworks: LangGraph interrupts (both static interrupt_before and dynamic interrupt()), CrewAI human-input loops.

  • Durable approval substrates: Temporal Signals, AWS Step Functions wait-for-callback patterns.

  • Observability and tracing platforms: LangSmith, Arize Phoenix, Langfuse.

  • Open trace standards: OpenInference, OpenTelemetry GenAI semantic conventions.

  • Chat-style human-agent UIs: Chainlit, Streamlit-as-agent-UI.

  • Embedded agent UIs: CopilotKit, Vercel AI SDK + ai-sdk/react.

  • Generative UI: Anthropic Artifacts as a delivery surface, Vercel AI SDK streaming UI components.

  • Native LLM-API HITL primitives: the tool_use ↔ tool_result handshake as the foundational pause-and-resume mechanism.

Out of scope:

  • General-purpose workflow engines (Airflow, Prefect, Dagster) when not specifically used as HITL substrates.

  • Closed enterprise approval products (ServiceNow flows, Salesforce approval processes) when not specifically integrated with agent workflows.

  • Conventional BI and observability tools (Datadog, Grafana, Splunk) when not adapted with LLM-specific semantics.

  • Frontend frameworks in general (React, Vue, Svelte). They appear here only as substrates for specific HITL libraries.

How to read this catalog

Part 1 (“The Narratives”) is conceptual orientation: the three faces of HITL, the five-point approval spectrum, why durable approval matters, the trace tree as the auditable record, and the three UI topologies. Five diagrams sit in Part 1; everything in Part 2 is text and code.

Part 2 (“The Substrates”) is reference material organized by section. Each section opens with a short essay on what its entries have in common and how they relate to alternatives. Representative substrates appear in the Fowler-style template established by the prior six volumes. The entries are not meant to be read front-to-back; jump in via the table of contents to whatever matches the task at hand.

Part 1 — The Narratives

Five short essays frame the design space for human-in-the-loop. The reference entries in Part 2 assume the vocabulary established here.

Chapter 1. The Three Faces of Human in the Loop

Three different problems get filed under “human in the loop.” They use the same words and the same products and the same diagrams, but they are not the same problem, and a HITL design that handles one of them does not necessarily handle the others. Naming the three explicitly is the first move.

Three faces of human in the loop
Approval is synchronous and blocking. Observation is asynchronous and non-blocking. Interaction is bidirectional and mid-flight.

Approval is the case where a human decision sits in the synchronous path of the agent’s execution. The agent reaches a point --- a high-risk tool call, a financial transaction, an irreversible action --- and stops. A human gets a notification, reviews the proposed action, either approves or rejects, and the agent then resumes. The engineering requirement is that the agent’s state survives the wait. The human can take 30 seconds or 30 days; the agent has to be ready to pick up exactly where it left off when the decision arrives. This is what auditors care about: a chain of explicit human sign-offs at the points where the system commits to consequential actions.

Observation is the case where humans watch the agent rather than gate it. The agent runs to completion without pause; every step, every LLM call, every tool result is captured into a structured trace; engineers and analysts review the traces afterward to debug failures, monitor drift, audit decisions, or improve the agent. There is no human in the synchronous path. The engineering requirement is comprehensive instrumentation: anything that happens during a run must be reconstructable from the trace, and the trace UI must make it possible to find the interesting cases among thousands of mundane ones. This is what engineers care about: the trace is the primary debugging artifact for agent systems, the way the stack trace was for procedural code.

Interaction is the case where a human and the agent collaborate in real time. The human gives the agent a task, watches it work, intervenes when the direction looks wrong, edits intermediate state, provides feedback that the agent incorporates into its next move. The agent isn’t blocked by human input the way it is in the approval case --- it can keep working in the background --- but the human and the agent are in continuous bidirectional communication throughout the run. The engineering requirement is a UI that exposes the agent’s state in a way the human can usefully edit. This is what users care about: a productive partnership, not a button to click.

The three concerns require different engineering, and conflating them produces designs that handle none well. An approval gate is not an observability solution --- humans approving don’t have time to read full traces. An observability platform is not an interaction surface --- a trace UI shows what happened, not how to steer what’s happening. An interactive UI is not an approval workflow --- the human is not the bottleneck in the synchronous path. Picking the right product, or the right combination, starts with knowing which face of HITL the application actually needs.

Chapter 2. The Approval Spectrum

Most teams talk about approval as if it were a binary: either the agent runs autonomously, or every action requires a human OK. In practice the choice is a five-point spectrum, and the most common mistake is choosing one global setting rather than routing different actions to different points on the spectrum.

The approval spectrum
Five points from YOLO to Co-pilot. Match the approval mode to each action's reversibility, risk, and trust level.

YOLO is the case with no approval and no observation --- the agent runs, the work happens, nobody is watching. This is correct for one-off scripts, demos, and personal experiments where the consequences of failure are negligible. It is wrong for almost any production deployment, including ones that look like personal experiments at first --- “it’s just a script” is how production incidents start. Post-hoc audit only is one step up: the agent still runs without synchronous gates, but every action is logged and reviewable. This is the right answer for low-risk batch workloads where the cost of slowing the agent down with approvals exceeds the cost of catching errors after they happen.

Risk-tiered approval is the production default for most serious agent deployments. The agent runs freely for the actions it can safely take --- reading data, drafting outputs, calling idempotent APIs --- and pauses for human approval on the actions that matter. Sending an email, executing a transaction, modifying a record, calling an external paid service: these become the explicit gates. The classification of which actions need approval and which don’t is the design work; the framework primitives (LangGraph interrupts, Temporal Signals) are the easy part. The hardest part of risk-tiering is keeping the gates consistent as the agent surface evolves --- new tools get added, the risk profile shifts, the gate definitions get out of date.

Each-action approval has the human OK every tool call. This is regulated-workflow territory --- healthcare prescription drafting, financial trade execution, legal document filing --- where the regulatory framework or the contractual obligations require explicit human accountability for every step. The cost is real: every interaction has human latency, often measured in minutes rather than seconds, which caps the throughput of the system at the rate at which the human can review. The benefit is the strongest possible audit posture and the clearest accountability chain.

Co-pilot mode inverts the usual relationship: the human drives, the agent suggests. The human makes the active decisions; the agent provides options, drafts, recommendations, or analyses. The agent is doing significant work but not committing the actions --- the human commits each action explicitly. This is the right pattern for the highest-stakes work, where the cost of an agent error is catastrophic and the cost of agent latency is acceptable. Most enterprise “AI copilots” actually live here, regardless of how their marketing frames them: the agent suggests, the human chooses.

Three axes determine where on the spectrum a particular action should sit: reversibility (can the action be undone if it’s wrong), risk (how bad if it’s wrong), and trust (how much have we tested this specific agent at this specific task). Production systems route different action classes to different points: a customer-support agent can YOLO its way through reading the customer’s history and drafting responses, escalate to risk-tiered approval for issuing credits or refunds under $50, and require co-pilot mode for refunds over $500 or escalations to legal. A single global approval setting fails this routing problem; a thoughtful design embraces it.

Chapter 3. Durable Approval: Surviving the Wait

The hardest engineering problem in approval-style HITL is not building the pause itself --- every framework provides some mechanism for that. The hardest problem is making the pause durable: surviving system restarts, process crashes, deploys, and weekends, while the agent’s state waits for human input that may take hours or days to arrive.

Durable approval: surviving the wait
The naive approach loses state when the process restarts. Durable execution checkpoints state to an external store and survives indefinite waits.

The naive approach is to hold the agent’s state in the running process: the agent function is paused at an await, the local variables sit in memory, when the human approval arrives a callback resolves the await and execution continues. This works for the demo case where the human is sitting at the screen and approves within 30 seconds. It fails the moment the process restarts --- a deploy, a crash, a routine container rotation --- because the in-memory state is gone and the approval event arrives at a process that no longer exists. In production, processes restart all the time; this isn’t edge-case engineering, it’s the default case.

The durable-execution approach checkpoints the agent’s state to an external store after every significant step. When the agent reaches an approval gate, the current state is fully persisted; the process can crash, the cluster can restart, the human can take three days to respond, and when the approval finally arrives, the framework reconstructs the agent’s state from the checkpoint and resumes execution exactly where it stopped. The wait duration is irrelevant --- 30 seconds or 30 days, the engineering is the same.

Temporal is the gold standard for this pattern. Its core abstraction --- the workflow function --- is replayable: every step’s inputs and outputs are recorded; on restart, the workflow replays its history and arrives at its current execution point; Signal events from outside the system (“the human approved”) get delivered to the running workflow regardless of how many process restarts have happened since the workflow started. For high-stakes applications --- financial transactions, healthcare workflows, anything where the wait could be days and the cost of losing state is unacceptable --- Temporal is the most-trodden production path.

LangGraph plus a Postgres checkpointer is the lightweight alternative for agent-shaped workflows. Less powerful than Temporal at the orchestration layer but with the same essential property: state persists across the wait, approvals survive restarts, the agent resumes correctly. For most agent-shaped HITL designs that don’t need Temporal’s full power (long-running workflows with complex retries and timeouts), LangGraph + Postgres is the right answer; for the cases that do (mission-critical, multi-day, multi-actor coordinations), Temporal is.

The pattern’s name in the literature is durable execution. The same pattern handles the related cases of long-running tool calls (an LLM agent calls an API that takes 20 minutes to respond), event waits (“wait for the user’s email confirmation, however long that takes”), and timer-based triggers (“if the user hasn’t responded in 24 hours, escalate to manager”). Durable execution is one of the few abstractions in agent infrastructure that genuinely matters --- a system that has it is qualitatively different from one that doesn’t. Internalize the difference once and the architectural choices become much clearer.

Chapter 4. The Trace Tree as Auditable Record

Observability is the second face of HITL, and it operates on a different timescale and serves a different audience than approval does. The platforms in this category --- LangSmith, Arize Phoenix, Langfuse --- share a mental model and increasingly share a set of standards. The mental model is the trace tree: an agent run is captured as a hierarchical structure of nested operations, each with its inputs, outputs, metadata, and timing, all queryable and replayable after the fact.

The trace tree
Every agent run produces a tree: thread > run > step > LLM call / tool call. Each level is queryable, filterable, and replayable.

The hierarchy has four to five levels in canonical form. A Thread is the user session --- the unit of conversational continuity, typically tied to a user_id and a session identifier. A Run is one task within the thread: a single agent invocation, with its inputs, outputs, status, and end-to-end duration. A Step is one node in the agent graph: a single trip through the planner or the executor or the critic, with its own start and end times. An LLM call is one model invocation: the prompt, the response, the token counts, the latency, the model name, the temperature. A Tool call is one function invocation: the tool name, the arguments, the result, the duration.

Production observability platforms expose this tree through filterable list views and drill-down tree views. The list view answers questions like “show me all runs in the last hour where the agent called the refund tool” or “show me the slowest 1% of runs from yesterday.” The tree view answers questions like “why did this particular run produce that wrong output --- what did the planner decide, what tools did it call, what came back from each.” Both views matter; serious debugging needs both.

The trace is increasingly the audit artifact, not just the debug artifact. For regulated industries, the trace tree is the documentation of why the agent did what it did --- the evidence that the system behaved within policy. The legibility of the trace becomes a compliance concern: can a non-engineer auditor read it and understand what happened. The newer observability platforms (Langfuse particularly) lean into this with audit-friendly UIs, explicit feedback capture (thumbs-up / thumbs-down logged against specific spans), and the ability to attach reviewer comments and resolution notes to individual runs.

Two standards matter for cross-platform portability. OpenInference is the trace-format standard maintained by Arize, used as the native format by Phoenix and supported as an export format by LangSmith and Langfuse. OpenTelemetry’s GenAI semantic conventions are the broader cross-vendor standard for LLM-specific telemetry, with first-class support for LLM call attributes (model name, token counts, prompt and completion contents) that conventional OTel doesn’t handle natively. As of mid-2026 both standards are converging; an agent application instrumented with either can export to multiple observability backends with limited rework.

Chapter 5. The UI is the API

The third face of HITL is interaction --- humans collaborating with agents in real time --- and unlike the first two, it doesn’t reduce to a primitive that the framework provides. The product surface where the human meets the agent has to be designed; the question is where it lives.

Three UI topologies
Separate chat surface (Chainlit, Streamlit). Embedded agent in existing product (CopilotKit, Vercel AI SDK). Generative UI (Artifacts, AI SDK 5).

The separate-chat topology is the dominant default. The agent gets its own URL, its own surface, its own chat-window UI. ChatGPT and Claude.ai are the canonical examples for vendor products; Chainlit and Streamlit are the canonical frameworks for teams building their own. The pattern’s strength is its simplicity --- the agent doesn’t need to fit into existing application UX, the interaction model is the well-understood chat metaphor, the framework provides nearly everything. The pattern’s weakness is that a separate surface is a separate destination, and users who don’t already think “the agent is where I solve this” won’t go there.

The embedded-agent topology puts the agent inside an existing product as a sidebar, a panel, or a contextual overlay. CopilotKit is the canonical React framework for this; the Vercel AI SDK with ai-sdk/react is the canonical alternative. The agent has access to the application’s state (the document the user is editing, the row they’re viewing, the project they’re working in), and the human interacts with the agent in the context of the work they’re already doing. The pattern’s strength is that it augments existing UX rather than competing with it. The pattern’s weakness is the engineering cost --- deep integration with an existing product’s state means design choices that the framework can’t make for you.

The generative-UI topology has the agent render its own custom UI per turn rather than working within a fixed chat window. The user asks a question; the agent decides the answer is best presented as a chart, a form, an interactive table, or a draft document with edit controls; the agent emits the corresponding UI element, which renders inline alongside the conversation. Anthropic Artifacts is the canonical user-facing surface for this pattern (Claude renders an Artifact when the question warrants one); Vercel AI SDK 5 productizes it for general use with React Server Components streamed from the agent. The pattern’s strength is its expressive range: any UI the application can render, the agent can render too. The pattern’s weakness is the design discipline required to choose well --- a generative UI used carelessly produces a chaotic interface with no consistent vocabulary.

Most production applications combine two topologies. The embedded-agent pattern carries the routine work --- the agent helps in the sidebar with most tasks. Generative UI kicks in for the moments where a custom widget is genuinely the right interaction (a complex form to fill out, a multi-step plan to review and edit, a chart to inspect). The separate-chat surface remains useful as the destination for unstructured exploration and as the fallback for queries that don’t fit the embedded pattern. A serious HITL design picks two topologies and integrates them deliberately rather than defaulting to one and ignoring the others.

The deeper observation about HITL UI: the UI is the API. Whatever the agent can do, the human interaction with the agent is mediated through the UI surface; whatever shape the UI imposes, the human collaboration takes that shape. A chat-only UI imposes a turn-based, text-mediated, low-bandwidth collaboration. An embedded-with-generative-UI design admits richer, faster, more visual collaboration. The choice of UI topology is therefore not a presentation decision; it’s an architecture decision that determines the upper bound of how productively human and agent can work together. Treat it as such.

Part 2 — The Substrates

Eight sections follow. Each opens with a short essay on what its entries have in common and how they relate to alternatives. Representative substrates are presented in the same Fowler-style template used by the prior six catalogs.

Sections at a glance

  • Section A --- Approval gates in frameworks

  • Section B --- Durable approval substrates

  • Section C --- Observability and tracing platforms

  • Section D --- Open trace standards

  • Section E --- Chat-style human-agent UIs

  • Section F --- Embedded agent UIs

  • Section G --- Generative UI

  • Section H --- Native LLM-API HITL primitives

Section A — Approval gates in frameworks

How agent frameworks let you pause for human approval and resume

Every major agent framework now ships some primitive for the pause-resume pattern. LangGraph’s interrupt mechanism is the most-cited reference --- both the static interrupt_before configuration (“stop before this node and wait”) and the dynamic interrupt() function call (“stop from inside a node based on runtime conditions”) cover the common cases. CrewAI provides human_input as an Agent or Task option, with a simpler model: at marked points, the framework prompts the human and feeds the response into the agent’s next turn. AutoGen’s human_input_mode parameter exposes three modes on its UserProxyAgent (ALWAYS, NEVER, TERMINATE) that route human-approval behavior at the message level.

The framework primitives are the easy part. The hard parts --- making the pause durable across crashes (Section B), capturing the approval event into the trace (Section C), and surfacing the approval request to the right human via the right UI (Sections E—G) --- live in the surrounding infrastructure. Picking a framework with a good approval primitive is necessary but not sufficient for a serious HITL design.

LangGraph interrupts (static and dynamic)

Source: github.com/langchain-ai/langgraph (MIT; Python and TypeScript)

Classification Pause-and-resume primitives in the LangGraph state-machine framework.

Intent

Let the agent author mark places in the graph where execution should pause for human input, and let the runtime persist the paused state for resumption when input arrives.

Motivating Problem

LangGraph models agent execution as a graph of nodes operating on typed state. Within that model, the natural place for an approval gate is between specific node transitions: pause before invoking the tool-execution node, pause before committing to a destructive action, pause whenever the agent’s next move would cross a defined risk threshold. LangGraph exposes two ways to mark these gates.

How It Works

The static interrupt_before option on graph.compile() lists nodes before which execution should pause unconditionally. The runtime persists the state via the configured checkpointer (in-memory, SQLite, Postgres, LangGraph Cloud), surfaces an interrupt event to the caller, and waits. When the caller invokes graph.stream(None, config=thread_config) with a stored thread_id and updated state, execution resumes from the persisted checkpoint with the modifications the human made to state.

The dynamic interrupt() function (added in late 2024) is callable from inside any node. When invoked, it pauses execution at that exact point with the value passed in as the payload to the caller. The same resume mechanism applies. Dynamic interrupts let the pause condition depend on runtime state --- “pause only if the transaction amount exceeds $1,000,” “pause only if the agent’s confidence is below 0.7” --- in ways the static configuration can’t express.

The durability comes from the checkpointer choice (Section B touches on durable backends; LangGraph itself defers to the persistence layer). For development the in-memory checkpointer is enough; for production the Postgres checkpointer is the recommended default, with LangGraph Cloud as the managed alternative.

When to Use It

Any LangGraph agent that needs approval gates. The static interrupt_before pattern when the gate location is fixed at design time; the dynamic interrupt() pattern when the gate condition depends on runtime state.

Alternatives --- Temporal Signals (Section B) when the wait may be measured in days and the application benefits from Temporal’s broader durable-execution surface. AutoGen human_input_mode for the conversational pattern. CrewAI human_input for the simpler check-the-box-on-a-task pattern.

Sources

  • langchain-ai.github.io/langgraph/concepts/human_in_the_loop/

  • langchain-ai.github.io/langgraph/how-tos/human_in_the_loop/breakpoints/

Example artifacts

Code.

from langgraph.graph import StateGraph, END

from langgraph.checkpoint.postgres import PostgresSaver

from langgraph.types import interrupt

# Static: pause before any execution of the 'tools' node

graph = (

StateGraph(AgentState)

.add_node("planner", planner_node)

.add_node("tools", tools_node)

.add_edge("planner", "tools")

.add_edge("tools", END)

.compile(

checkpointer=PostgresSaver(conn_string=DB_URL),

interrupt_before=["tools"], # pause every time, unconditionally

)

)

# Dynamic: pause from inside a node, only when a condition holds

def refund_node(state: AgentState):

if state["refund_amount"] > 500:

approval = interrupt({

"question": "Refund exceeds threshold. Approve?",

"amount": state["refund_amount"],

"customer": state["customer_id"],

})

if not approval["approved"]:

return {"status": "denied", "reason": approval.get("reason")}

# ... proceed with refund ...

return {"status": "refunded"}

# Resume after human approval:

thread_config = {"configurable": {"thread_id":
"alice-session-42"}}

result = graph.stream(

Command(resume={"approved": True}),

config=thread_config,

)

CrewAI human_input

Source: github.com/crewAIInc/crewai (Python, MIT)

Classification Per-agent or per-task human-input gate.

Intent

Let the crew author mark specific agents or tasks as requiring human review of the output before the crew advances.

Motivating Problem

CrewAI’s model is task-shaped rather than graph-shaped: a crew is a list of tasks, each with an assigned agent. The natural granularity for approval in this model is the task: when this specific task completes, pause for a human to review the output and either accept it or send the task back with revision instructions.

How It Works

On any Task or Agent, set human_input=True. When the task completes, CrewAI prints the task output to the configured I/O surface and waits for the human to respond with either an acceptance or revision instructions. Revision instructions become input to the agent for a re-run of the task; acceptance allows the crew to advance to the next task.

The pattern works well for the regulated-workflow case where each task represents a discrete review-and-approve unit: a legal document drafted, a financial recommendation prepared, a customer email composed. It works less well for fine-grained tool-level approvals --- those need LangGraph’s interrupt model or the framework-agnostic approaches in Section H.

When to Use It

CrewAI deployments where the natural approval grain is the task. Co-pilot-style workflows where the agent prepares deliverables and the human accepts or revises.

Alternatives --- LangGraph interrupts when the approval grain is finer than the task. Manual orchestration when the workflow doesn’t fit CrewAI’s task model.

Sources

  • docs.crewai.com/concepts/tasks

  • docs.crewai.com/concepts/agents

Example artifacts

Code.

from crewai import Agent, Task, Crew

legal_reviewer = Agent(

role="Legal Reviewer",

goal="Review contract drafts for compliance issues",

backstory="Senior corporate lawyer with 15 years of experience",

)

review_task = Task(

description="Review the attached contract for compliance issues.",

expected_output="A bulleted list of issues with severity ratings.",

agent=legal_reviewer,

human_input=True, # pause after this task for human review

)

crew = Crew(agents=[legal_reviewer], tasks=[review_task])

result = crew.kickoff(inputs={"contract": contract_text})

AutoGen human_input_mode

Source: github.com/microsoft/autogen (Python, MIT)

Classification Per-agent human-input policy in the conversational AutoGen model.

Intent

Configure each AutoGen agent’s human-input behavior through one of three modes: ALWAYS (every message requires human review), NEVER (fully autonomous), TERMINATE (human input only when the conversation reaches a terminating state).

Motivating Problem

AutoGen’s agents communicate via messages. The natural granularity for human input in this model is the message: when an agent is about to send a message, does the human get to review or modify it first? The three modes on UserProxyAgent answer this question with different defaults for different deployment styles.

How It Works

On the UserProxyAgent (or a custom subclass), set human_input_mode to one of ALWAYS, NEVER, or TERMINATE. In ALWAYS mode, every agent message pauses for human input --- the agent’s response becomes a draft the human can edit before sending. In NEVER mode, the agent runs fully autonomously. In TERMINATE mode, the agent runs autonomously until it produces a message matching the termination condition, at which point the human gets to either confirm termination or override and continue.

TERMINATE mode is the production default for most AutoGen deployments: the agent works through a task autonomously, the human reviews the final output, and either ends the run or redirects.

When to Use It

AutoGen-based agent systems. The TERMINATE mode for the typical “run-to-completion-with-final-review” pattern; ALWAYS for high-stakes work where every message gets reviewed; NEVER for trusted autonomous workflows.

Alternatives --- LangGraph for fine-grained tool-level gates beyond AutoGen’s message-level granularity. Temporal Signals when the wait may be long enough to require durable execution.

Sources

  • microsoft.github.io/autogen/docs/topics/human-in-the-loop

Section B — Durable approval substrates

Pause-and-resume that survives crashes, deploys, and multi-day waits

When the approval wait is measured in seconds, any framework’s in-process pause works. When the wait is measured in hours or days, the in-process pause stops being safe --- process restarts during the wait become normal events that the system must handle, not exceptions. Durable execution substrates persist the agent’s state to an external store on every step, so a process restart loses nothing and a multi-day wait is the same engineering as a 30-second wait.

Two substrates dominate: Temporal (the gold standard, especially for financial and healthcare workflows where the wait can be days and the cost of state loss is unacceptable) and AWS Step Functions wait-for-callback (the cloud-native answer for AWS-shaped deployments). Both implement the same pattern --- state persisted, external event resumes execution --- with different operational profiles and feature surfaces.

Temporal (Signals for HITL)

Source: github.com/temporalio/temporal (Apache-2; Go server with SDKs in many languages)

Classification Durable workflow orchestrator; Signal mechanism is the HITL primitive.

Intent

Provide durable execution for long-running workflows, with Signals as the mechanism by which external events (including human approvals) deliver into a running workflow regardless of how many process restarts have happened.

Motivating Problem

Production HITL for high-stakes workflows must tolerate every kind of interruption: process crashes, deploys, AWS region failures, the human approver going home for the weekend. The naive approach --- keep the workflow state in memory and wait --- fails the moment the process exits. Temporal’s answer is replayable workflows: every step’s inputs and outputs are recorded in an external history; on restart, the framework replays the history to reconstruct the workflow’s current execution point; external events (Signals) get delivered to the live workflow at whichever location it’s currently waiting.

How It Works

A workflow function is written in code (Python, TypeScript, Java, Go) using the Temporal SDK. The function looks like ordinary code with awaits and conditions; the framework transforms its execution into a record of inputs and outputs persisted to the Temporal server. When the workflow calls workflow.wait_condition() or workflow.signal_handler(), execution pauses; the framework remembers exactly where; the worker process can crash and restart without losing any progress.

Signals are typed messages an external system sends to a specific workflow run. For HITL: when an agent reaches an approval gate, the workflow calls await workflow.wait_condition(lambda: self.approval_received) and pauses. The approval UI, when the human clicks Approve, sends a Signal to the workflow (workflow_client.signal_workflow(workflow_id, “approve_refund”, payload)). The Signal handler updates self.approval_received = True; the wait_condition resolves; execution continues.

Operational maturity is the differentiator. Temporal runs in production at major financial services firms, healthcare companies, and large-scale tech platforms with workflows that span days to weeks. Built-in retry, timeout, and compensation primitives cover the workflow patterns that pure agent frameworks handle clumsily.

When to Use It

High-stakes HITL workflows where the wait may be measured in days and the cost of losing state is unacceptable. Multi-step workflows with complex retry, timeout, and compensation requirements. Workflows that span multiple actors (multiple humans, multiple systems) coordinating over extended periods.

Alternatives --- LangGraph + Postgres for lighter-weight agent-shaped workflows where Temporal’s broader power isn’t needed. AWS Step Functions for AWS-native deployments where the integration with other AWS services matters more than cross-cloud portability.

Sources

  • github.com/temporalio/temporal

  • docs.temporal.io/encyclopedia/human-in-the-loop

  • temporal.io/blog/human-in-the-loop-agentic-systems

Example

A high-value wire transfer workflow at a financial services firm. The agent prepares the transfer; the workflow calls await workflow.wait_condition(lambda: self.approved) and pauses. Two human approvers receive notifications. The first approver signs in, reviews the transfer, sends an approve_first Signal. The second approver does the same a few hours later. The wait_condition becomes true; the workflow proceeds to execute the transfer via the bank’s API. If the worker process crashes any time during the multi-hour wait, the workflow resumes from the persisted state on the new worker process with no loss of progress.

Example artifacts

Code.

# Python workflow (Temporal SDK)

from temporalio import workflow

from dataclasses import dataclass

\@dataclass

class ApprovalInput:

approver: str

approved: bool

reason: str = ""

@workflow.defn

class RefundWorkflow:

def __init__(self):

self.approval: ApprovalInput | None = None

@workflow.signal

def submit_approval(self, payload: ApprovalInput):

self.approval = payload

@workflow.run

async def run(self, refund_request: RefundRequest) -> RefundResult:

# Step 1: agent prepares the refund (an Activity, retried by
Temporal)

prepared = await workflow.execute_activity(

prepare_refund, refund_request,

start_to_close_timeout=timedelta(minutes=5),

)

# Step 2: pause for human approval - may wait hours or days

await workflow.wait_condition(lambda: self.approval is not None)

if not self.approval.approved:

return RefundResult(status="denied", reason=self.approval.reason)

# Step 3: execute the refund

result = await workflow.execute_activity(

execute_refund, prepared,

start_to_close_timeout=timedelta(minutes=10),

)

return result

# From the approval UI, when the human clicks Approve:

async def on_approve_clicked(workflow_id: str, approver: str):

client = await Client.connect("temporal:7233")

handle = client.get_workflow_handle(workflow_id)

await handle.signal(

RefundWorkflow.submit_approval,

ApprovalInput(approver=approver, approved=True),

)

AWS Step Functions wait-for-callback

Source: aws.amazon.com/step-functions (managed; SDKs in all AWS-supported languages)

Classification Managed workflow orchestrator; .waitForTaskToken pattern is the HITL primitive.

Intent

Provide AWS-native durable workflow orchestration, with the task-token callback pattern as the mechanism for pausing a workflow until an external event (including human approval) triggers resumption.

Motivating Problem

For AWS-native deployments, Step Functions is the path of least resistance for durable workflow orchestration. The wait-for-callback pattern (.waitForTaskToken in the Step Functions integration syntax) is the equivalent of Temporal’s Signal: the workflow pauses, an external system holds a token, when the external system invokes SendTaskSuccess with the token, the workflow resumes.

How It Works

A state in the state machine is defined with the .waitForTaskToken modifier (e.g., “Resource”: “arn:aws:states:::sqs:sendMessage.waitForTaskToken”). When the workflow reaches this state, Step Functions invokes the underlying service (in this example, SQS) with a special $$.Task.Token parameter; the workflow then pauses indefinitely. The external system --- the approval UI, an SQS consumer, or any code path --- calls SendTaskSuccess(taskToken, output) when ready. The workflow resumes with the provided output as the state’s result.

Common HITL pattern: a state of type Task with the wait-for-callback variant sends a notification (via SNS, SES, or Lambda) to the approver, including the task token in the notification payload. The approval UI calls SendTaskSuccess when the human approves or SendTaskFailure when they reject. Step Functions handles the persistence; the human can take days; the workflow state never expires (subject to the configured workflow timeout).

When to Use It

AWS-native deployments where Step Functions is already the chosen workflow orchestrator. Workflows that integrate with AWS services (Lambda, SQS, SNS, Bedrock) and want the managed-orchestration story. Cases where cross-cloud portability isn’t a requirement.

Alternatives --- Temporal when cross-cloud portability or feature breadth matters. LangGraph + Postgres for lighter-weight agent-shaped workflows.

Sources

  • docs.aws.amazon.com/step-functions/latest/dg/connect-to-resource.html

Section C — Observability and tracing platforms

Where the trace tree lives: LangSmith, Phoenix, Langfuse

Three platforms dominate agent observability as of mid-2026. LangSmith is the LangChain-native commercial offering, with the deepest integration with LangChain and LangGraph. Arize Phoenix is the open-source competitor, with OpenTelemetry as its trace transport and self-hosted-or-cloud as the deployment options. Langfuse is the second open-source competitor, particularly popular in European deployments, with a strong UI and explicit human-feedback capture.

All three converge on the same mental model (the trace tree) and increasingly on the same standards (OpenInference, OpenTelemetry GenAI semantic conventions). The differentiation is in the UI, the deployment model, and the ecosystem fit. For new agent projects, the right question is not which platform is best in the abstract but which one fits the stack the team is already on --- LangChain users default to LangSmith, OpenTelemetry-instrumented stacks default to Phoenix, European or audit-conscious deployments often default to Langfuse.

LangSmith

Source: smith.langchain.com (LangChain Inc.; commercial SaaS plus self-hosted option)

Classification Tracing and evaluation platform for LangChain and LangGraph agents.

Intent

Provide the observability story for LangChain and LangGraph applications: granular nested traces of every agent run, dataset-based evaluation, prompt versioning, human-feedback capture, and the audit surface for production agent systems.

Motivating Problem

LangChain and LangGraph are the most-used agent frameworks; LangSmith is the first-party observability platform from the same company. The integration is the deepest of any tracing platform: every LangChain or LangGraph call is auto-traced; the trace tree mirrors the framework’s execution model exactly; tools for prompt versioning, evaluation, and human review of runs are first-class. For teams already on LangChain or LangGraph, LangSmith is the path of least friction; for teams not on those frameworks, it works but doesn’t differentiate.

How It Works

Set LANGSMITH_API_KEY and LANGSMITH_TRACING=true; every LangChain or LangGraph operation auto-traces to the configured project. The UI shows the trace tree (thread → run → step → LLM call → tool call), filterable by time, status, latency, or token count. Drill-down on any run shows full inputs, outputs, intermediate state, and timing.

Beyond raw tracing: dataset features let teams capture sets of input/output pairs and run them through new agent versions for regression testing. Evaluators (built-in and custom) score outputs against criteria. Annotation queues route specific runs to human reviewers who tag them with thumbs-up or thumbs-down feedback. Prompt versioning tracks every iteration of a prompt with diff views and the ability to roll back.

When to Use It

Production LangChain and LangGraph deployments. Teams that want first-party tooling from the framework vendor. Workflows where prompt iteration and evaluation are heavy and the integrated platform is worth the commercial cost.

Alternatives --- Arize Phoenix for the open-source option with OpenTelemetry support. Langfuse for the open-source option with stronger feedback-capture features. Both work fine with LangChain via instrumentation.

Sources

  • docs.smith.langchain.com

Arize Phoenix

Source: github.com/Arize-ai/phoenix (Apache-2; Python; runs locally or self-hosted)

Classification Open-source observability platform with OpenTelemetry transport.

Intent

Provide LangSmith-class observability features as open-source software, with OpenTelemetry as the trace transport and the option to run entirely on-premises or in a local notebook.

Motivating Problem

Many teams cannot send their agent traces to a third-party SaaS for compliance, cost, or principle reasons. Phoenix is the open-source answer: the same trace-tree mental model, the same drill-down and filtering UI, the same evaluation and feedback features, but running on the team’s own infrastructure. The use of OpenTelemetry as the transport means any OTel-instrumented agent framework can send traces to Phoenix --- not just LangChain.

How It Works

Install arize-phoenix; either pip install and run px.launch_app() in a notebook for the local mode, or docker-compose for the self-hosted server mode. Configure the agent framework with the OpenInference instrumentation library for that framework (openinference-instrumentation-langchain, -openai, -llamaindex, -autogen, and many more); traces flow into Phoenix via the OpenTelemetry collector.

The UI has the same shape as LangSmith’s: trace tree views, filterable list views, drill-down into spans with full input/output capture. The evaluations surface lets teams run LLM-as-judge or rule-based evaluators against captured traces. The notebook integration is unusual --- Phoenix runs inside Jupyter or Colab for the prototype workflow, which makes the on-ramp shorter than competitors.

Operational: data stays in the deployer’s infrastructure (local disk for the notebook mode, configurable backend for self-hosted). No external SaaS dependency. Arize the company sells managed Phoenix and other AI observability products on top, but the open-source Phoenix is the same code.

When to Use It

Open-source-only or self-hosted requirements. OpenTelemetry-instrumented agent stacks where Phoenix’s OTel-native architecture matches. Local development workflows where launching the trace UI in a notebook beats managing a separate service. Teams that need the audit posture of “the traces never leave our infrastructure.”

Alternatives --- LangSmith for the LangChain-first managed option. Langfuse for the open-source competitor with stronger UI features for feedback capture.

Sources

  • github.com/Arize-ai/phoenix

  • docs.arize.com/phoenix

Langfuse

Source: github.com/langfuse/langfuse (MIT; TypeScript and Python SDKs; self-hosted or cloud)

Classification Open-source LLM engineering platform with strong UI and feedback capture.

Intent

Provide an open-source observability and evaluation platform with first-class human-feedback capture (thumbs up/down, custom scores), prompt management, and dataset-based evaluation, deployable as self-hosted or as Langfuse Cloud.

Motivating Problem

Where Phoenix’s position emphasizes OpenTelemetry-native integration and notebook-first development, Langfuse’s position emphasizes UI quality and explicit human-feedback workflows. The dashboards are visually polished; the feedback capture is built around the production case of “users thumb up or down on agent responses” and the audit case of “a reviewer scores a sample of runs each day”; the prompt-management features include first-class versioning with diff views.

How It Works

Install langfuse SDK; instrument the application with explicit Langfuse calls (langfuse.trace, langfuse.span, langfuse.generation) or use the OpenTelemetry integration; traces flow to the Langfuse server. The UI shows traces, scores, prompt versions, dataset runs, and dashboards. Score capture is a first-class operation: an application can call langfuse.score() with arbitrary metric names and values, which surface in the dashboards as filterable metrics.

The feedback workflow is unusually well-developed. Production agents can log user feedback (thumbs-up/down clicks, ratings, custom comments) directly to Langfuse against specific traces. Internal reviewers can work through queues of traces in a structured review UI. Both feed into datasets used for evaluation and prompt iteration.

Operational: self-hosted via docker-compose, Langfuse Cloud for the managed offering. The open-source license (MIT) covers the full product; the cloud is the convenience tier.

When to Use It

Teams that prioritize feedback capture and audit-friendly UI over framework integration depth. European deployments where self-hosting requirements lean toward EU-based companies. Workflows where user feedback and reviewer scores are the primary quality signal.

Alternatives --- LangSmith for the LangChain-native managed option. Phoenix for the OpenTelemetry-native open-source option. The three options serve the same need with different operational and feature emphases.

Sources

  • github.com/langfuse/langfuse

  • langfuse.com/docs

Section D — Open trace standards

OpenInference and the OpenTelemetry GenAI semantic conventions

The observability platforms in Section C are increasingly interoperable thanks to two converging standards. OpenInference is the trace-format standard maintained by Arize, used natively by Phoenix and supported by LangSmith, Langfuse, and many smaller tools as an export format. The OpenTelemetry GenAI semantic conventions are the broader cross-vendor effort to standardize LLM-specific attributes (model name, token counts, prompt and completion contents) on top of OTel’s base tracing model.

Why this matters: agent stacks that emit traces in either standard format can switch observability backends with limited rework. Multi-backend deployments (Phoenix for engineers, Langfuse for the compliance team, a custom dashboard for executive reporting) become feasible because the upstream instrumentation is portable.

OpenInference

Source: github.com/Arize-ai/openinference (Apache-2)

Classification Open specification for LLM and agent trace formats.

Intent

Define a vendor-neutral standard for representing LLM calls, tool calls, agent steps, and retrieval operations as structured trace spans, with attribute conventions that make the resulting traces portable across observability backends.

Motivating Problem

Before OpenInference, every observability product had its own trace schema; instrumenting an agent for LangSmith required different code than instrumenting it for Phoenix, and switching backends meant re-instrumenting. OpenInference defines a single schema with named span kinds (LLM, CHAIN, TOOL, RETRIEVER, AGENT, EMBEDDING) and standardized attributes (input.value, output.value, llm.model_name, llm.token_count.prompt and .completion); instrumentation libraries that emit OpenInference-conformant spans work with any compliant backend.

How It Works

The OpenInference specification is a documented schema. The instrumentation libraries (openinference-instrumentation-langchain, -openai, -llamaindex, -autogen, -crewai, and many more) wrap the major agent frameworks and emit conformant spans. The spans flow via OpenTelemetry transport to any compatible backend --- Phoenix natively, LangSmith via the OpenInference import path, Langfuse via the OpenInference integration, custom backends via direct OTel collection.

For agent applications: install the appropriate openinference-instrumentation-* package, configure the OpenTelemetry exporter pointing at the chosen backend, and the agent traces flow automatically. Multi-backend setups configure multiple exporters and the same spans go everywhere.

When to Use It

Any agent stack where observability backend portability matters. The default starting point for new instrumentation --- OpenInference is the broadest-supported standard as of mid-2026 and the path of least lock-in.

Alternatives --- vendor-specific instrumentation when the backend is fixed and the additional features (deeper integration, native sampling) are worth the lock-in. The OpenTelemetry GenAI semantic conventions for the broader cross-vendor approach.

Sources

  • github.com/Arize-ai/openinference

OpenTelemetry GenAI semantic conventions

Source: opentelemetry.io/docs/specs/semconv/gen-ai/ (CNCF; vendor-neutral)

Classification Cross-vendor standard for LLM and agent telemetry within OpenTelemetry.

Intent

Extend OpenTelemetry’s semantic conventions with LLM-specific attributes (model name, token counts, prompt and completion contents, agent step types) so that agent telemetry flows through the existing OpenTelemetry infrastructure used for general application observability.

Motivating Problem

Many enterprises have substantial existing investment in OpenTelemetry: their applications, databases, and infrastructure all emit OTel traces and metrics; their backends (Datadog, Grafana Tempo, Honeycomb, etc.) consume OTel-shaped data. Adding agent telemetry to this stack should not require a separate parallel observability infrastructure. The OpenTelemetry GenAI semantic conventions define the attribute names and span shapes that LLM-aware applications should emit so that the existing OTel pipeline handles agent telemetry alongside everything else.

How It Works

The GenAI semantic conventions specify attribute names: gen_ai.system (the model provider), gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.prompt (the prompt content under sampling policy), gen_ai.completion (the response content under sampling policy), and many more. Agent operations are spans; LLM calls have a defined span kind; tool calls are spans with tool-specific attributes; the full agent trace is a tree of these spans with parent-child relationships standard in OpenTelemetry.

Adoption status as of mid-2026: stable for the core LLM call attributes; experimental for the agent-level attributes (agent steps, tool calls, multi-agent collaboration spans). The OpenInference spec and the OTel GenAI conventions are actively cross-pollinating; expect convergence to a unified standard over 2026—2027.

When to Use It

Enterprises with existing OpenTelemetry infrastructure that want agent telemetry to flow through the same pipeline. Cases where vendor neutrality matters more than the depth of any specific backend integration. Future-proof instrumentation that can adapt as the standards mature.

Alternatives --- OpenInference for the more agent-specific spec with broader instrumentation library support today. Vendor-specific instrumentation for maximum integration depth at the cost of portability.

Sources

  • opentelemetry.io/docs/specs/semconv/gen-ai/

  • opentelemetry.io/blog/2024/agentic-ai/

Section E — Chat-style human-agent UIs

The separate-chat topology: Chainlit, Streamlit

When the agent gets its own dedicated UI --- a separate URL, a chat window, no host application to fit into --- two Python frameworks dominate. Chainlit is purpose-built for the agent-chat case: it natively understands the concept of agent steps, displays intermediate “thought” output inline with the conversation, and includes built-in components for tool-call approval UIs. Streamlit is the general-purpose Python web-app framework that, with the addition of streamlit-chat or its newer chat-message primitives, serves as an effective chat-style agent UI for teams that want a single framework for AI-and-non-AI internal tools.

Both target the same case (dedicated chat surface) with the same Python audience. The dimensions of choice: Chainlit’s agent-specific features (step display, native HITL approval components) versus Streamlit’s broader application surface (multi-page apps, dashboards, non-chat UI) and its dominance for general Python-data-app use cases.

Chainlit

Source: github.com/Chainlit/chainlit (Apache-2; Python)

Classification Python framework for agent-chat UIs with native step display and HITL approval components.

Intent

Make it trivial to build a ChatGPT-style web UI for any Python agent, with first-class support for displaying the agent’s internal step-by-step thought process and for human approval of tool calls mid-conversation.

Motivating Problem

For Python agent developers, the gap between “the agent runs in a notebook” and “the agent has a web UI users can talk to” is wide. Building a full React frontend is significant work; Streamlit is general-purpose and doesn’t natively understand agent concepts. Chainlit fills the gap with a Python-first framework purpose-built for the agent-chat case: agent steps display as collapsible nested sections inline with the conversation; tool-call approval is a built-in UI component; sessions, file uploads, and authentication are framework primitives.

How It Works

Install chainlit; write a Python file with @cl.on_message and @cl.on_chat_start decorators marking the message-handler and session-init functions; run chainlit run app.py. The framework serves a ChatGPT-like UI at localhost:8000, routing user messages to the handler and rendering the handler’s outputs (text, files, images, custom elements) back to the conversation.

Agent-specific features: cl.Step is a context manager that wraps an agent step and displays it as a collapsible “thought” in the conversation; users can drill into the step to see inputs, outputs, and nested sub-steps. cl.AskUserMessage and cl.AskActionMessage pause execution and prompt the user for input --- the framework’s native HITL primitive. The pattern is the canonical chat-UI implementation of approval-style HITL.

Integration with the major agent frameworks is built in. Decorators auto-instrument LangChain, LangGraph, LlamaIndex, AutoGen, and several others, so steps and tool calls render in the UI without explicit framework wiring.

When to Use It

Python agent developers who want a dedicated chat UI with minimal frontend work. Internal tools and demos where the chat surface is the primary product. Cases where displaying the agent’s thought process inline (collapsible steps) is part of the user value.

Alternatives --- Streamlit for the general-purpose Python web-app framework when the agent is one component among many. CopilotKit (Section F) when the agent should embed in an existing React product. Vercel AI SDK for TypeScript-centric stacks.

Sources

  • github.com/Chainlit/chainlit

  • docs.chainlit.io

Example artifacts

Code.

import chainlit as cl

from my_agent import agent_graph

\@cl.on_chat_start

async def on_start():

await cl.Message("Hi! What would you like help with today?").send()

\@cl.on_message

async def on_message(message: cl.Message):

# Run the agent graph, streaming steps to the UI

async with cl.Step(name="planner", type="llm") as planner_step:

plan = await agent_graph.planner.ainvoke({"input":
message.content})

planner_step.output = plan

# Tool-call approval: pause for human OK before executing

res = await cl.AskActionMessage(

content=f"Agent wants to run:
{plan['tool_name']}({plan['args']})",

actions=[

cl.Action(name="approve", value="yes", label="Approve"),

cl.Action(name="reject", value="no", label="Reject"),

],

).send()

if res.get("value") != "yes":

await cl.Message("Tool call rejected.").send()

return

async with cl.Step(name=plan["tool_name"], type="tool") as
tool_step:

result = await execute_tool(plan)

tool_step.output = result

await cl.Message(content=f"Result: {result}").send()

Streamlit as agent UI

Source: github.com/streamlit/streamlit (Apache-2; Python)

Classification General-purpose Python web framework with chat primitives.

Intent

Serve as the agent UI when the broader Streamlit application surface (dashboards, multi-page apps, non-chat UI) is valuable and the chat-component subset is sufficient for the agent interaction.

Motivating Problem

Many teams already use Streamlit for internal Python tools --- dashboards, data exploration, simple workflow apps. Adding an agent to the existing Streamlit application is significantly less work than introducing a second framework. The native st.chat_message and st.chat_input primitives (added 2023) provide enough surface for a serviceable chat UI, and the rest of the Streamlit application can surround the chat with the additional context the user needs.

How It Works

Install streamlit; use st.chat_input for the user input box, st.chat_message for displaying messages, and the same st.session_state for conversation history. The chat surface integrates with the rest of the Streamlit application (sidebar controls, multi-page navigation, data displays, file uploaders) as ordinary Streamlit components.

Compared with Chainlit: Streamlit doesn’t natively understand agent concepts. Step display, tool-call approval, and other agent-specific features have to be built using ordinary Streamlit primitives --- st.expander for collapsible step views, st.button + st.session_state for approval gates. The trade-off: more flexibility and a broader application surface in exchange for more wiring.

When to Use It

Teams already on Streamlit who want to add an agent without introducing a second framework. Applications where the chat is one component of a broader app surface (dashboards plus agent, data exploration plus agent, internal tool plus agent).

Alternatives --- Chainlit for the agent-chat-as-primary-product case. Gradio for the demos-and-prototypes case. The dedicated chat frameworks are usually a better fit when the agent is the product; Streamlit shines when the agent is a feature of something larger.

Sources

  • docs.streamlit.io

Section F — Embedded agent UIs

Putting the agent inside an existing React product: CopilotKit, Vercel AI SDK

The embedded-agent topology is the answer when the agent should augment an existing product rather than live in a separate destination. Two React frameworks dominate. CopilotKit positions itself as the headless React framework for agent integration: components for chat panels, mid-conversation forms, and approval buttons that integrate with the existing application’s state. The Vercel AI SDK is the broader TypeScript SDK from Vercel, with ai-sdk/react providing hooks and components for chat UIs that integrate with the AI SDK’s streaming infrastructure.

The two frameworks overlap considerably. CopilotKit emphasizes agent-specific features (read-the-application-state, in-context actions, headless components for full design control). Vercel AI SDK emphasizes the broader streaming and provider-abstraction story (multiple LLM providers, type-safe tool calls, structured output streaming). Most React-shaped agent embeddings end up using one or the other; the choice often follows the team’s existing investment in either ecosystem.

CopilotKit

Source: github.com/CopilotKit/CopilotKit (MIT; TypeScript)

Classification React framework for embedding AI agents in existing applications.

Intent

Provide headless React components and hooks that let an agent be added to an existing application as a chat panel, an inline action, or a mid-conversation form, with access to the host application’s state.

Motivating Problem

Adding an agent to an existing React product means more than rendering a chat window. The agent needs to read the application’s state (which document is open, which row is selected, which view the user is on); the agent needs to take actions in the application (insert text, modify state, navigate); the agent should render contextual UI for moments where a custom form is the right interaction. CopilotKit provides these capabilities as React-first primitives, with explicit design for the case of embedding agents in production applications rather than building dedicated agent surfaces.

How It Works

Wrap the application in <CopilotKit publicApiKey={…}>; add a <CopilotPopup /> or <CopilotSidebar /> for the chat surface. The agent gets access to application state through the useCopilotReadable hook, which makes any React state visible to the agent’s context. The agent can take actions through useCopilotAction, which registers typed actions the agent can invoke with arguments and a handler.

Mid-conversation forms and approval gates use the useCopilotAction pattern: when the agent decides to take an action requiring confirmation, it invokes an action whose handler renders a custom form or approval UI. The user fills out the form or clicks approve; the handler returns the result to the agent; the conversation continues. This is the canonical pattern for tool-call approval in an embedded React agent.

Backend integration: CopilotKit ships an adapter pattern with built-in support for LangGraph, OpenAI, Anthropic, and others. The backend agent runs on the team’s existing infrastructure; the CopilotKit runtime handles message routing, action invocation, and streaming between the React frontend and the backend agent.

When to Use It

React-based products where an agent should be embedded in the existing UX rather than living separately. Cases where the agent needs to read the application’s state and take actions in it. Co-pilot-style interactions (the agent assists; the user drives).

Alternatives --- Vercel AI SDK for the streaming-focused alternative. Build-your-own with raw fetch + EventSource when the framework abstractions don’t fit. The dedicated chat frameworks (Section E) for the case where the agent is the product, not a feature.

Sources

  • github.com/CopilotKit/CopilotKit

  • docs.copilotkit.ai

Example artifacts

Code.

import { CopilotKit, useCopilotReadable, useCopilotAction } from
"@copilotkit/react-core";

import { CopilotSidebar } from "@copilotkit/react-ui";

function MyApp() {

const [selectedRowId, setSelectedRowId] = useState<string |
null>(null);

const [rows, setRows] = useState<Row[]>([]);

// The agent can read application state

useCopilotReadable({

description: "Currently selected row",

value: rows.find(r => r.id === selectedRowId),

});

// The agent can take actions; this one requires human approval
inline

useCopilotAction({

name: "deleteRow",

description: "Delete a row by ID. Requires human confirmation.",

parameters: [{ name: "rowId", type: "string" }],

renderAndWaitForResponse: ({ args, respond, status }) => (

<div>

<p>Agent wants to delete row {args.rowId}. Confirm?</p>

<button onClick={() => respond?.("approved")}>Yes</button>

<button onClick={() => respond?.("denied")}>No</button>

</div>

),

handler: async ({ rowId }) => {

await deleteRowFromBackend(rowId);

return "deleted";

},

});

return (

<CopilotKit publicApiKey="...">

<YourExistingApp />

<CopilotSidebar instructions="Help the user manage their data."
/>

</CopilotKit>

);

}

Vercel AI SDK + ai-sdk/react

Source: github.com/vercel/ai (Apache-2; TypeScript)

Classification TypeScript SDK for AI applications with React hooks for chat UIs.

Intent

Provide a TypeScript SDK that abstracts over LLM providers, supports streaming, type-safe tool calls, and structured outputs, with React hooks (ai-sdk/react) for building chat UIs against the SDK’s backend primitives.

Motivating Problem

For React-and-Next-shaped applications, the Vercel AI SDK has been the dominant choice for LLM integration since its 2023 launch, with its hold strengthened by AI SDK v5 (2025) which added first-class agent abstractions and generative-UI streaming. The HITL story sits within the broader package: useChat for the chat UI, tool-call streaming with renderable UI components, structured outputs with typed React rendering, and the AI SDK v5 useAgent hook for agent-shaped applications.

How It Works

On the backend (Next.js API route or any TypeScript backend), use streamText, streamObject, or generateText to call any provider (OpenAI, Anthropic, Google, Mistral, etc.) with a unified API. On the frontend, the useChat hook from ai-sdk/react gives you messages, input handling, and streaming-to-UI for free. Tool calls stream as structured events that can render custom UI components inline with the conversation.

For HITL specifically: tool calls can be configured with execution awaiting human approval. When the agent invokes such a tool, the streamed UI shows the tool-call element with custom-rendered approval UI; the human clicks approve or reject; the result streams back to the agent which continues with the chosen branch. The pattern is the AI SDK’s native generative-UI primitive applied to the approval-gate use case.

When to Use It

Next.js or React-shaped agent applications. Multi-provider deployments where the SDK’s provider abstraction is valuable. TypeScript-centric stacks. Cases where the SDK’s broader feature surface (streaming, structured outputs, embeddings) is worth the framework adoption.

Alternatives --- CopilotKit for deeper embedded-agent integration with the host application’s state. Direct provider SDKs (Anthropic SDK, OpenAI SDK) when the AI SDK’s abstraction is unnecessary. Chainlit / Streamlit for Python-centric stacks.

Sources

  • sdk.vercel.ai/docs

  • github.com/vercel/ai

Section G — Generative UI

When the agent renders its own widgets mid-conversation

Generative UI is the youngest of the three topologies but the most expressive. Rather than the agent communicating through pre-defined UI elements (text messages, fixed forms, standard charts), the agent chooses or generates the UI element that best fits each turn: a form when a structured input is needed, a chart when data is the answer, a table with row-level actions when the response is best presented as multiple items, a draft document with inline edit controls when the user needs to refine an output. Two surfaces dominate as of mid-2026: Anthropic Artifacts (the Claude-native rendering of generated UI for end users) and the Vercel AI SDK’s streaming generative UI components (the build-your-own primitive for React applications).

The pattern’s power is its expressive range. The pattern’s discipline is choosing well: a generative UI used carelessly produces a chaotic interface with no consistent vocabulary, while a generative UI used thoughtfully turns each turn of the conversation into the optimally-shaped interaction.

Anthropic Artifacts (as HITL surface)

Source: docs.anthropic.com (Anthropic-native rendering in Claude.ai and via API)

Classification Anthropic-rendered generated UI for end users.

Intent

Render generated UI artifacts --- HTML, React components, SVG, Mermaid diagrams, documents --- inline in the Claude.ai conversation as interactive panels separate from the chat stream.

Motivating Problem

For agent applications that ship on Claude.ai (or use Claude’s API surface in their own product), Artifacts are the natural delivery surface for generative UI: rendered interactive panels that sit alongside the conversation, edited iteratively as the conversation progresses, with the user able to interact with the artifact directly. The HITL angle is that the artifact is often the moment where the human and the agent collaborate --- reviewing a draft document, refining a UI mock, exploring a chart, validating a structured output.

How It Works

In Claude.ai, Claude decides when an artifact is the right response (the question warrants something interactive or substantial enough to want a dedicated surface). The artifact renders in a panel beside the conversation; the user can read, scroll, interact, or ask Claude to modify it. The conversation drives iteration; each modification updates the artifact in place.

For applications building on the Claude API: the same pattern is reproducible. The agent emits structured output describing the artifact (HTML, React component code, SVG, etc.); the host application renders it in a panel separate from the chat stream; user interactions and conversation drive iteration. The Claude API doesn’t hand the team an Artifacts UI; it provides the model and the structured-output capability, and the team builds the rendering surface.

When to Use It

Applications targeting Claude.ai users directly --- Artifacts are the default surface there. Custom applications building on the Claude API where a dedicated rendering panel separate from the chat is the right interaction shape (long documents, complex visualizations, interactive components).

Alternatives --- Vercel AI SDK’s streaming UI for the React-application pattern. Inline UI rendering in Chainlit or CopilotKit for the cases where the generative UI should live in the conversation flow rather than a separate panel.

Sources

  • docs.anthropic.com

Vercel AI SDK streaming UI

Source: sdk.vercel.ai/docs/ai-sdk-ui/generative-ui (Apache-2)

Classification React Server Components-based generative UI streaming.

Intent

Stream React Server Components from the agent to the React client, so the agent can render arbitrary custom UI components per turn rather than communicating only through text or pre-defined elements.

Motivating Problem

For React-shaped agent applications, the generative-UI pattern needs a delivery mechanism. The Vercel AI SDK’s streaming UI primitive uses React Server Components: the agent (running in a Server Action or API route) decides which UI component to render and streams it to the client; the client renders the component inline with the conversation; the component is a normal React component with full access to client-side React features (state, event handlers, animations).

How It Works

On the backend, use streamUI (AI SDK v3+) or the more recent streamObject + render variants. The agent’s response can include tool calls whose results are React components. When the agent calls a tool like showFlightOptions, the result is a React component that the AI SDK streams to the client. The client renders the component in the chat stream.

For HITL specifically: the components rendered can be interactive forms, approval buttons, draggable elements, anything React supports. The agent emits the appropriate component for each turn; the user interacts with it; the resulting state flows back to the agent through tool-call results or message submissions. The result is conversations where each turn produces the optimally-shaped UI for the work at hand, rather than every turn fitting through the chat-message bottleneck.

When to Use It

React-shaped applications where generative UI is part of the agent’s value proposition. Cases where the conversation should produce richly interactive components rather than text-with-occasional-forms. Multi-modal applications where the right rendering varies dramatically by turn.

Alternatives --- CopilotKit’s renderAndWaitForResponse pattern for embedded agents with simpler generative needs. Anthropic Artifacts for Claude.ai-targeted applications. Static UI with conditional rendering when the variation between turns is small.

Sources

  • sdk.vercel.ai/docs/ai-sdk-ui/generative-ui

Section H — Native LLM-API HITL primitives

The tool_use ↔ tool_result handshake as foundational pause-and-resume

Beneath every framework primitive in Section A is a more fundamental pattern: the tool_use / tool_result handshake provided by the LLM APIs themselves. When the model emits a tool_use block, the API call ends; the application is in control; the model is suspended. The application can execute the tool immediately, defer execution while waiting for a human, or refuse the tool call entirely; whenever the application returns a tool_result, the model resumes from exactly where it paused. This is, mechanically, the foundational HITL primitive that every framework abstracts.

Recognizing this matters for two reasons. First, it means HITL doesn’t require a framework --- a thin application calling the Anthropic API directly has the same pause-and-resume primitive available, just less polished. Second, it means the framework-level HITL primitives in Section A are not magic; they’re abstractions over the tool_use loop with various conveniences (persistence, typed state, UI integration) layered on top.

Anthropic API tool_use ↔ tool_result loop

Source: docs.anthropic.com/en/docs/build-with-claude/tool-use

Classification Native LLM-API HITL primitive.

Intent

Use the tool_use / tool_result handshake as the foundational pause-and-resume mechanism: when the model emits a tool_use, the application has full control over when and how to return a tool_result, including indefinite delays for human approval.

Motivating Problem

Most discussions of HITL focus on framework primitives, which obscures that the underlying capability is in the LLM API itself. The model emits a tool_use block; the API call returns; the application is in control. There is no requirement that the application execute the tool immediately. The application can wait an arbitrary time, store the tool_use to a queue, surface it to a human, wait for the human decision, and then either return a tool_result or return a message that redirects the model. The model’s subsequent API call resumes from the conversation state as if no pause occurred.

How It Works

The Anthropic API’s messages.create() returns a message that may include one or more tool_use blocks. The application inspects the tool_use, decides whether and how to execute the tool, and adds a tool_result block to the conversation history. The next messages.create() call passes the updated history; the model continues from where it paused, now with the tool_result in context.

For HITL: when the application receives a tool_use that requires approval, the application persists the entire conversation state (including the pending tool_use) to durable storage, notifies the appropriate human via the appropriate channel, and waits. When the human responds, the application either constructs the tool_result (if approved) or constructs a synthetic user message redirecting the model (if rejected), and resumes the conversation. The model has no way to tell whether the tool was executed in 50 milliseconds or 50 hours; the persistence and resumption are entirely the application’s concern.

The same pattern works across all major LLM API providers: OpenAI’s function calling, Google’s Gemini tool use, Mistral’s tool support. The mechanics differ slightly in syntax; the underlying pattern is identical.

When to Use It

Applications that want HITL without framework adoption. Custom frameworks that need to implement HITL as a primitive. Cases where the framework abstractions don’t fit the application’s exact persistence or approval routing needs. Teaching examples where understanding the foundational mechanism matters.

Alternatives --- the framework primitives in Section A when they fit the application’s shape. Worth understanding the foundational pattern regardless, because it clarifies what the frameworks are doing and where their abstractions leak.

Sources

  • docs.anthropic.com/en/docs/build-with-claude/tool-use

Example artifacts

Code.

import anthropic

client = anthropic.Anthropic()

def run_with_hitl(conversation: list, tools: list) -> dict:

"""Run the agent until it produces a non-tool-use response or
requires approval."""

while True:

msg = client.messages.create(

model="claude-opus-4-7",

max_tokens=4096,

tools=tools,

messages=conversation,

)

conversation.append({"role": "assistant", "content":
msg.content})

# Find any tool_use blocks

tool_uses = [b for b in msg.content if b.type == "tool_use"]

if not tool_uses:

return {"status": "complete", "final_message": msg,
"conversation": conversation}

# For each tool_use, decide: execute, approve, or reject

tool_results = []

for tu in tool_uses:

if requires_approval(tu):

# Persist conversation state, notify human, return for resumption
later

save_to_durable_store(conversation, pending_tool_use=tu)

notify_approver(tu)

return {"status": "pending_approval", "tool_use": tu,
"conversation": conversation}

else:

result = execute_tool(tu.name, tu.input)

tool_results.append({

"type": "tool_result",

"tool_use_id": tu.id,

"content": result,

})

conversation.append({"role": "user", "content": tool_results})

def resume_after_approval(conversation: list, tool_use, approved:
bool):

"""Resume the agent run after the human approves or rejects."""

if approved:

result = execute_tool(tool_use.name, tool_use.input)

tool_result = {"type": "tool_result", "tool_use_id":
tool_use.id, "content": result}

else:

tool_result = {

"type": "tool_result", "tool_use_id": tool_use.id,

"content": "Tool call denied by human reviewer.", "is_error":
True,

}

conversation.append({"role": "user", "content":
[tool_result]})

return run_with_hitl(conversation, tools=available_tools)

Appendix A --- HITL Face Reference Table

Cross-reference between the three faces of HITL (Chapter 1) and their representative substrates.

Face of HITLCharacteristicRepresentative substrates
ApprovalSynchronous, blockingLangGraph interrupts, Temporal Signals, AWS Step Functions wait-for-callback
ObservationAsynchronous, non-blockingLangSmith, Arize Phoenix, Langfuse
InteractionBidirectional, mid-flightChainlit, CopilotKit, Vercel AI SDK, generative UI

Appendix B --- The Seven-Volume Series

This catalog joins the six prior volumes to form a seven-layer vocabulary for agentic AI. The volumes are independent and the reading order is flexible; the cross-references make whichever path the reader chooses coherent.

  • Volume 1 --- Patterns of AI Agent Workflows --- the timing of agent runs.

  • Volume 2 --- The Claude Skills Catalog --- model instructions in packaged form.

  • Volume 3 --- The AI Agent Tools Catalog --- the function-calling primitives.

  • Volume 4 --- The AI Agent Events & Triggers Catalog --- the activation layer.

  • Volume 5 --- The AI Agent Fabric Catalog --- the infrastructure substrate.

  • Volume 6 --- The AI Agent Memory Catalog --- the state and context layer.

  • Volume 7 --- The Human-in-the-Loop Catalog (this volume) --- the human-agent interaction layer.

Seven layers. Read top-down for the agent designer’s sequence: how runs compose, what model instructions are loaded, which tools the agent calls, what events trigger the agent, what fabric it runs on, what state and memory it operates against, and how humans approve, observe, and interact with it. Read bottom-up for the operator’s sequence: how humans engage with the system, where state lives, what fabric supports it, what events drive it, what tools it uses, what skills ship with it, what patterns its runs follow.

HITL is the layer that most often determines whether an agent system is acceptable to its eventual users --- not whether it works in the lab. Auditors, regulators, end users, and engineers each have specific needs at this layer; meeting them requires deliberate design across the three faces (approval, observation, interaction). A working architect makes choices at all seven layers; the choices interact; the interactions are what makes a production agent system actually viable.

Appendix C --- The HITL Anti-Patterns

Seven recurring mistakes that distinguish thoughtful HITL designs from improvised ones. Avoiding these is most of the practical wisdom in the field:

  1. Treating HITL as binary. The choice is not on or off; it’s a five-point spectrum (Chapter 2) and a single application typically uses different points for different actions. A global setting is almost always wrong.

  2. Conflating the three faces. Approval, observation, and interaction need different products and different designs. A platform that solves one of the three doesn’t automatically solve the others.

  3. In-process pause for approval. The agent’s state lives in the running process; a deploy or a crash during the wait loses the in-flight task. Durable execution (Chapter 3) is the engineering answer; skipping it produces a system that works in demo and fails in production.

  4. Trace volume without trace quality. Capturing every event without thought produces traces that are technically complete and operationally useless. Sampling, naming conventions, and selective input/output capture are what make traces searchable; the trace is the audit artifact, not the firehose.

  5. Approval gates as the only safety mechanism. Approval gates depend on humans paying attention to specific actions at specific moments. Defense in depth (action-level validation, policy enforcement, sandbox isolation, observability) provides safety properties that human attention cannot. Approval is one layer of a stack, not a substitute for the stack.

  6. Chat-only UI for high-stakes interaction. The chat metaphor caps the bandwidth of human-agent collaboration. For applications where the human is making consequential decisions (medical, legal, financial), embedded or generative UI is often the better fit; chat is the default, not the right default for every case.

  7. Building observability last. Tracing instrumentation added after the system is already in production is significantly more expensive than instrumentation included from the start. The trace UI is also the primary debug surface; building it last means debugging the early development is harder than it needed to be.

Appendix D --- Discovery and Standards

Resources for tracking the HITL ecosystem as it evolves:

  • OpenInference (github.com/Arize-ai/openinference) --- the open trace-format standard; the right starting point for new instrumentation.

  • OpenTelemetry GenAI semantic conventions (opentelemetry.io/docs/specs/semconv/gen-ai/) --- the broader cross-vendor standard converging with OpenInference.

  • Temporal’s HITL documentation (docs.temporal.io/encyclopedia/human-in-the-loop) --- the field’s most rigorous treatment of durable approval engineering.

  • LangGraph’s human-in-the-loop guide (langchain-ai.github.io/langgraph/concepts/human_in_the_loop/) --- the conceptual reference for framework-level approval design.

  • Microsoft Agent Framework tool-approval documentation (learn.microsoft.com/en-us/agent-framework/agents/tools/tool-approval) --- the Azure-side reference design for approval gates.

  • Awesome lists for ecosystem discovery: awesome-llm-observability, awesome-ai-agents, and similar curated GitHub indexes.

Two pragmatic recommendations. First, instrument for portability: OpenInference or OpenTelemetry GenAI conventions both insulate against vendor lock-in at the observability layer, and the cost of switching backends is dramatically lower when the upstream traces are standards-compliant. Second, design for durability from day one: adding durable execution to a working agent system is significantly harder than building it in from the start, because the assumptions about persistence and state flow are different at every step.

Appendix E --- Omissions

This catalog covers about 16 substrates across 8 sections. The wider ecosystem is significantly larger; a non-exhaustive list of what isn’t here:

  • Conventional workflow engines (Airflow, Prefect, Dagster, Argo Workflows) when not specifically used for HITL. They can serve the durable-execution role with effort; the agent-specific tooling lives elsewhere.

  • Enterprise approval products (ServiceNow, Salesforce Process Builder, Workday) when used outside the agent context. Production agent deployments in enterprises often integrate with these existing approval systems.

  • Conventional observability platforms (Datadog, Grafana, New Relic, Honeycomb) when used without LLM-specific instrumentation. The OpenTelemetry GenAI conventions are the bridge that makes these usable for agent telemetry.

  • Build-your-own chat UIs (custom React, Vue, or vanilla JS implementations). Many production agent UIs are bespoke rather than framework-based; the patterns in Sections E—G transfer but the implementation is application-specific.

  • Voice and multimodal HITL surfaces (voice-call review, video annotation, screenshot-based review). These follow the same three-faces taxonomy but with different product categories.

  • Specialized HITL products for narrow verticals (medical chart review, legal document approval, financial trade authorization) that haven’t consolidated into general-purpose categories.

Appendix F --- A Note on the Moving Target

LangGraph’s dynamic interrupt() function shipped in late 2024. Temporal’s HITL documentation became a first-class section of the docs in 2025. Langfuse passed LangSmith on some open-source observability metrics in 2025. The AI SDK v5 (2025) made generative UI a production-grade primitive. CopilotKit pivoted toward headless components in 2024. OpenInference and OpenTelemetry GenAI conventions are actively merging as of mid-2026. The field is young enough that products and standards move quarterly; this catalog captures mid-2026.

The deepest structural fact remains stable: the three faces of HITL (approval, observation, interaction), the five-point approval spectrum, the requirement for durable execution, the trace tree as auditable record, the three UI topologies. These are conceptual claims about the shape of the problem, and they don’t shift the way product names do. An architect who internalizes the conceptual structure can map any new product onto it in minutes; an architect who learns only the products has to relearn the field every twelve months.

Seven volumes complete. Patterns, Skills, Tools, Events, Fabric, Memory, Human-in-the-Loop. A working architect now has the vocabulary for the entire stack, top to bottom.

--- End of The Human-in-the-Loop Catalog v0.1 ---

— The Seven-Volume Series —