About This Catalog

This is the eighth volume in a catalog of the working vocabulary of agentic AI. The seven prior volumes covered patterns (the timing of agent runs), skills (model instructions in packaged form), tools (the function-calling primitives), events and triggers (what activates the agent), fabric (the substrate beneath orchestration), memory (state, context, and recall), and human-in-the-loop (approval, observation, and interaction). This eighth volume covers the governance layer that sits across all of them: the testing and defense mechanisms that determine whether an agent system is safe, reliable, and accountable enough to deploy.

Evaluation and guardrails are different things, often conflated. Evaluation is what happens before deployment: rubric-based testing, regression detection, dataset-driven benchmarks running in CI/CD pipelines that catch quality and behavior failures before they reach users. Guardrails are what happens during deployment: input filtering, output validation, classifier-based safety checks running synchronously on every request, blocking adversarial inputs and unsafe outputs in real time. The categories share tooling vocabulary (“LLM-as-judge,” for example, shows up in both), share evaluation logic (the rubric that decides if an output is unsafe at test time often becomes the runtime classifier), and share standards (OWASP’s LLM Top 10 frames both threat surfaces). They are not the same thing, and a serious governance design uses both.

The category matured significantly across 2024 and 2025. OWASP’s 2025 revision of the LLM Top 10 added System Prompt Leakage and Vector and Embedding Weaknesses to the list, reflecting how agent architectures have evolved beyond the chatbot pattern; NIST’s AI Risk Management Framework formalized the governance vocabulary for enterprise compliance; MITRE’s ATLAS framework cataloged the adversarial tactics. Tooling caught up. DeepEval reached parity with pytest as the testing-style eval framework; Ragas became the standard for RAG-specific evaluation; Promptfoo established itself as the CLI-first option for rapid iteration; NeMo Guardrails formalized Colang as a domain-specific language for conversational safety policies; Guardrails AI built the validator-marketplace pattern through Guardrails Hub; Meta’s Llama Guard productized fast classification-based safety as a deployable model; NVIDIA’s Garak and Microsoft’s PyRIT covered the red-teaming side. As of mid-2026, the building blocks exist; the assembly is the work.

Scope

Coverage:

Input/output guardrails frameworks: NeMo Guardrails (with Colang), Guardrails AI, the Guardrails Hub validator marketplace.
Classifier-style guardrails: Llama Guard, and related fast classification safety models.
Offline evaluation frameworks: DeepEval, Ragas (RAG-specific), Promptfoo.
Reference evaluation suites: OpenAI Evals, HELM (Holistic Evaluation of Language Models).
Observability-integrated evaluation: LangSmith Evaluators, Phoenix Evals.
Adversarial / red-team frameworks: NVIDIA Garak, Microsoft PyRIT.
Standards and threat models: OWASP Top 10 for LLM Applications (2025), NIST AI Risk Management Framework, MITRE ATLAS.

Out of scope:

General software-engineering test frameworks (pytest, unittest, Jest, JUnit) when not specifically adapted for LLM evaluation.
Traditional WAF and API gateway products when not specifically extended with LLM-aware filtering.
Compliance management tools (GRC platforms) when used outside the AI-specific governance context.
Closed enterprise security products (Splunk, Crowdstrike, Sentinel) when not adapted with LLM-aware detection.
Model alignment and RLHF infrastructure --- these shape model behavior at training time, distinct from the deployment-time evaluation and guardrails this volume covers.

How to read this catalog

Part 1 (“The Narratives”) is conceptual orientation: the distinction between evals and guardrails, the failure mode taxonomy that organizes what can go wrong, the defense-in-depth model for layered protection, the LLM-as-judge pattern and its biases, and the OWASP Top 10 for LLMs as the canonical threat model. Five diagrams sit in Part 1; everything in Part 2 is text and code.

Part 2 (“The Substrates”) is reference material organized by section. Each section opens with a short essay on what its entries have in common and how they relate to alternatives. Representative substrates appear in the Fowler-style template established by the prior seven volumes. The entries are not meant to be read front-to-back; jump in via the table of contents.

Part 1 — The Narratives

Five short essays frame the design space for evaluation and guardrails. The reference entries in Part 2 assume the vocabulary established here.

Chapter 1. Evals and Guardrails: Two Different Defenses

The first move is naming the distinction. Evaluation and guardrails are both called “safety” and both called “quality” and both spoken of in the same breath, but they answer different questions, run at different times, and use different tools. Conflating them produces designs where one of the two is missing and the absence isn’t noticed until production.

Evals vs guardrails — Evals run test-time, offline, in CI/CD. Guardrails run online, synchronously, on every production request. Both are necessary.

Evaluation runs before deployment. The team has a candidate version of an agent --- a new prompt, a new tool, a new model upgrade --- and the question is whether it’s ready to ship. The eval suite runs against a reference dataset (curated inputs, known-good outputs, edge cases captured from production), produces structured scores against rubrics, and either passes or fails the CI/CD gate. The failure mode being prevented is regression: shipping an agent worse than the one already in production, where the degradation is silent because the change looked harmless. DeepEval, Ragas, Promptfoo, and OpenAI Evals are the canonical tools; the operational shape is the unit-test pattern from software engineering, adapted for the case where the assertions are about quality rather than correctness.

Guardrails run during deployment. The agent is in production; a real user is making a real request; the question is whether this specific request is safe to process and whether the agent’s specific response is safe to return. Input guardrails run before the model sees the request, blocking obvious prompt injections, off-topic content, and adversarial inputs. Output guardrails run after the model produces a response, blocking PII leakage, toxic content, hallucinated facts, and unsafe tool calls. The failure mode being prevented is incident: a user being harmed, a system being attacked, a regulatory violation occurring. NeMo Guardrails, Guardrails AI, Llama Guard are the canonical tools; the operational shape is the firewall pattern from network security, adapted for the case where the protocol being inspected is natural language.

The two timelines matter. Evals tolerate latency --- a CI run can take ten minutes without any production impact --- and reward thoroughness; a comprehensive eval that catches a regression is worth its cost. Guardrails cannot tolerate latency --- every millisecond added to a request happens on the user’s critical path --- and reward selectivity; a guardrail that blocks legitimate requests is worse than no guardrail. The two timelines produce different design pressures. Evals are LLM-heavy: use the best model for the judging because cost is amortized. Guardrails are classifier-heavy: use small, fast, specialized models because cost is per-request.

The two failure populations matter. Evals catch the regressions a careful engineer would notice in review: the agent forgot how to handle X, the new prompt broke case Y. Guardrails catch the attacks an engineer wouldn’t imagine: the prompt injection embedded in a webpage the agent reads, the jailbreak the team didn’t test against, the off-policy tool call the agent reasoned itself into. The two failure populations barely overlap. A team that builds excellent evals and skips guardrails ships a regression-free agent that gets jailbroken in week two; a team that builds excellent guardrails and skips evals ships a hardened agent whose quality silently degrades with every prompt edit. Both omissions are common; both are correctable; both must be corrected.

Chapter 2. The Failure Mode Taxonomy

Before choosing tools, name the failures the tools are meant to detect. The agent failure space splits naturally into four categories, each with its characteristic shape and its characteristic detection mechanism. A taxonomy isn’t a substitute for thinking about specific risks in specific deployments, but it organizes the thinking.

Quality failures are when the agent produces wrong, irrelevant, or unhelpful output --- but not unsafe output. Hallucinations are the canonical example: the agent confidently states a fact that isn’t true. Faithfulness gaps are the RAG-specific version: the retrieved context contains the truth, but the agent’s answer doesn’t reflect it. Off-topic responses are when the agent answers a question the user didn’t ask. Reasoning errors are arithmetic mistakes, logical contradictions, multi-step plans that don’t add up. These failures harm trust without harming users directly; they’re what evals catch. Detection mechanisms: rubric-based scoring (“does the answer cite the source correctly”), reference-based metrics (BLEU, ROUGE, BERTScore for the cases where reference output exists), LLM-as-judge for the open-ended cases, RAG-specific metrics (Ragas’s faithfulness, answer relevancy, context precision).

Safety failures are when the agent produces harmful content. Toxicity and hate speech are the canonical examples. Bias and discrimination are the systemic versions: the agent treats one demographic group worse than another. PII leakage is when the agent surfaces personal information that should have been protected. Self-harm content is when the agent’s response could harm a vulnerable user. These failures harm users directly; they’re what guardrails block. Detection mechanisms: classifier models trained on harmful-content datasets (Llama Guard, Perspective API, Detoxify), regex and named-entity filters for PII, dedicated bias-evaluation rubrics, content-policy enforcement (Azure Content Safety, OpenAI Moderation API).

Security failures are when an adversary attacks the agent. Direct prompt injection is the user typing “ignore previous instructions”. Indirect prompt injection is the more dangerous case: the agent reads a webpage, an email, or a document containing hidden instructions that the model treats as legitimate. Jailbreaks are crafted prompts that circumvent the model’s safety training. System prompt leakage --- newly added to OWASP’s 2025 LLM Top 10 --- is when an adversary extracts the agent’s system prompt by manipulation. These failures harm the system’s integrity; they require dedicated red-team testing to find and dedicated runtime defenses to block. Detection mechanisms: input filters for known attack patterns, prompt-shield classifier models, red-team frameworks (Garak, PyRIT) that generate attacks systematically, structured prompt design that segregates instruction from data.

Behavior failures are when the agent chooses the wrong action. Tool misuse is when the agent calls the right tool with the wrong arguments, or the wrong tool altogether. Excessive agency --- also expanded in OWASP’s 2025 list --- is when the agent does more than it should: takes consequential actions without approval, calls tools beyond its task scope, uses permissions it didn’t need. Off-policy actions are when the agent takes an action the operating procedure forbids. Resource exhaustion is when the agent loops, retries, or consumes computation without bound. These failures harm operational reliability; they need tool-call evals and policy enforcement. Detection mechanisms: tool-selection accuracy rubrics, scope-restriction policies, per-tool rate limits, action audit logs, and the layered HITL gates from Volume 7.

The taxonomy isn’t a clean partition. Hallucination can be quality (the agent invents a fact) or security (an adversary planted the fact via context injection). PII leakage can be safety (the model regurgitated training data) or security (an adversary extracted it via attack). The categories overlap at the edges; the value of the taxonomy is organizing the detection-tool selection, not classifying each failure perfectly. A serious governance design covers all four categories with appropriate tools and accepts the overlap.

Chapter 3. Defense in Depth: No Single Layer Works

The classical security principle of defense in depth --- multiple independent layers, no single layer trusted to prevent all failures --- applies with unusual force to LLM systems. The reason is that the systems are stochastic. A traditional firewall rule either matches or doesn’t; a content classifier on an LLM output is right most of the time and wrong some of the time, and “most” is rarely good enough to ship alone.

The Swiss cheese model from accident analysis describes the architecture: each defensive layer is a slice of Swiss cheese with holes. A request that reaches the user must pass through every layer; a failure happens only when the holes line up across all layers. Because the layers are independent --- input filters miss different things than classifier guards, which miss different things than output validators --- the probability of holes lining up is dramatically lower than any single layer’s false-negative rate. Eight layers each with a 5% false-negative rate produce a system with roughly a 10⁻¹¹ chance of a clean breach. The math is theoretical, but the engineering intuition is sound: stack the layers.

The canonical agent defense stack: input filter (cheap regex and length checks), classifier guard (Llama Guard or equivalent fast model), prompt structure (system prompts, XML tagging, role-based access for tool exposure), tool whitelist (only the tools this agent should have are available), output validator (schema check, PII redaction, content classification on the response), HITL gate (human approval for high-risk actions, per Volume 7), eval suite (the offline regression checks running in CI/CD), and audit log (the forensic record, per Volume 7’s Chapter 4). The order matters: cheap layers first to filter at the edge, expensive layers last for the requests that survived earlier filters.

Three practical observations about the stack. First, input-side layers are cheap and output-side layers are expensive. A regex filter costs microseconds; a Llama Guard inference costs tens of milliseconds; a fact-checking output validator that calls an LLM costs hundreds of milliseconds. Put cheap filters first so expensive layers only see the requests that survived. Second, approval gates and audit logs are the audit story; everything else is the prevention story. Auditors examining the system after an incident look at the gates and the logs to reconstruct what happened; the prevention layers are what kept most incidents from happening at all. Third, system prompts are not security controls. The OWASP 2025 LLM Top 10 emphasizes the point: LLMs are stochastic, not deterministic; instructions in the system prompt are preferences the model usually follows, not guarantees the model can’t violate. Defense in depth means not relying on system-prompt instructions to enforce hard requirements; if it has to hold, it has to be enforced outside the model.

The stack varies by application. A low-risk customer-support chatbot might run input filter, classifier guard, output validator, audit log --- four layers, lightweight, sub-second. A high-stakes financial agent might run all eight layers plus additional domain-specific checks (fraud detection on transaction amounts, regulatory compliance verification on financial advice), accepting the latency cost as the price of operating in a regulated context. The right stack depth is the one that matches the risk profile; the wrong stack depth is the one that ships with whatever the default tutorial showed.

Chapter 4. LLM-as-Judge: The Pattern That Pervades Everything

Across both evals and guardrails, the LLM-as-judge pattern is everywhere. The pattern uses an LLM to score the output of another LLM (or the same LLM on a different turn) against a rubric, producing a numerical or categorical judgment that drives downstream decisions --- pass/fail in CI, allow/block at runtime, route to human review when uncertain. The pattern works well enough to dominate; it fails in well-documented ways that any production deployment must address.

LLM-as-judge with biases — Cheap, flexible, semantic. Also positionally biased, length-biased, self-preferring, and prone to verbosity capture. Validation is non-negotiable.

The reasons LLM-as-judge dominates are economic and operational. Human annotation costs dollars per minute; LLM judgments cost cents per request, scaling with token count rather than human time. Rubrics for human reviewers are operationally heavy --- reviewer training, calibration sessions, inter-rater agreement studies; rubrics for LLM judges are prompts, iterable in minutes. Semantic understanding lets the LLM judge handle paraphrases, partial credit, and contextual quality that string-match or BLEU-style metrics can’t. The combination is so productive that the field shifted decisively to LLM-as-judge for open-ended evaluation between 2023 and 2024.

The biases are documented and consistent. Positional bias: in pairwise comparisons, judges favor whichever output appears first (or in some configurations, second) regardless of quality. Length bias: judges prefer longer, more elaborate responses, even when the longer version is wrong or padded. Self-preference: a judge using GPT-4 prefers GPT-4 outputs over Claude outputs of equal human-rated quality; the bias goes both directions for different model families. Verbosity capture: an output that explains its reasoning convincingly often scores higher than a curt correct one, even when the explained reasoning is wrong. None of these biases is fatal; all of them can be mitigated; an unmitigated judge produces optimistic noise rather than useful signal.

The mitigations are well-understood. Randomize positions in pairwise comparisons so positional bias averages out. Cap or normalize length so length bias can’t reward verbosity. Cross-check with a second judge model from a different family to detect self-preference. Calibrate the judge against a small human-labeled set periodically --- if the judge’s scores diverge from human scores on the calibration set, the judge has drifted and needs adjustment. Use multiple judges with majority voting for high-stakes evaluations. Treat low-confidence judgments as routes to human review rather than confident decisions. These are not optional refinements; they are the difference between an eval that works and an eval that produces a misleading green light.

The deepest practical observation: the judge model and the model being judged should differ. If the same LLM produces the output and scores the output, self-preference contaminates the score; the model’s blind spots become the judge’s blind spots; the eval becomes a confirmation that the model agrees with itself. Production deployments use a different model family for the judge, often a more capable model than the one being evaluated, and treat the judge’s scores as one signal among several rather than the ground truth.

Chapter 5. The OWASP LLM Top 10 (2025) as Threat Model

The OWASP Top 10 for LLM Applications is the closest thing the field has to a standard threat model. The 2025 edition (released late 2024) consolidates the lessons of two years of production LLM deployments, adds two new categories (System Prompt Leakage and Vector and Embedding Weaknesses) reflecting the agent-architecture shift, and expands Excessive Agency to reflect the autonomous-tool-use surface that agent frameworks introduced. Reading the list once and using it as a checklist for any production agent design is the single highest-leverage governance investment available.

LLM01:2025 Prompt Injection

Top of the list for the second consecutive edition. LLMs process instructions and data in the same channel without architectural separation --- an attacker who can put text into the agent’s context can attempt to override its instructions, and there is no general defense in the model itself. Direct injection happens through user input; indirect injection happens through the documents, webpages, emails, and tool results the agent reads. The mitigations: constrain model behavior via system prompts (recognizing they’re not security controls), segregate untrusted content using prompt structure (XML tags, tool-result wrappers, named delimiters), validate tool calls against expected formats, never grant the agent more authority than it absolutely needs.

LLM02:2025 Sensitive Information Disclosure

Jumped from sixth to second place in the 2025 edition, reflecting the field’s growing awareness of training-data extraction and configuration disclosure. LLMs can memorize and reproduce fragments of training data including PII; attackers have demonstrated extraction techniques. Mitigations: data sanitization in training pipelines, output filtering for PII, differential privacy for sensitive datasets, prompt hygiene that never includes secrets the model could reproduce.

LLM03:2025 Supply Chain

LLM applications increasingly depend on third-party model weights, training datasets, fine-tuning libraries, and plugin ecosystems; each is a potential supply-chain attack vector. Mitigations: model provenance verification, SBOM (software bill of materials) practices extended to AI artifacts, controlled model upgrade processes.

LLM04:2025 Data and Model Poisoning

Tampered training data or fine-tuning data can implant model behaviors that surface only on specific triggers. Mitigations: training data provenance, anomaly detection on training corpora, evaluation against attack patterns before model release.

LLM05:2025 Improper Output Handling

Treating LLM outputs as trustworthy when they’re passed downstream --- to a code executor, a database, an email sender --- produces classic injection vulnerabilities at the LLM/system boundary. Mitigations: output validation matching expected schemas, treating LLM outputs as untrusted user input by default, sandboxing any execution of LLM-generated code.

LLM06:2025 Excessive Agency

Significantly expanded in the 2025 edition, with three named sub-causes: excessive functionality (the agent has tools beyond its task scope), excessive permissions (those tools operate with broader privileges than needed), excessive autonomy (high-impact actions proceed without human approval). The category specifically targets the agent architecture pattern --- it didn’t exist as a separate concern before agents existed. Mitigations: tool whitelisting, least-privilege credentials per tool, human-in-the-loop gates for irreversible actions (the entire Volume 7 catalog), audit logging of every consequential call.

LLM07:2025 System Prompt Leakage (NEW)

New for 2025. An adversary extracts the agent’s system prompt through manipulation, learning the agent’s rules, credentials embedded in the prompt, role structure, or operational logic. OWASP’s guidance is direct: “System prompts are not security controls.” Anything in the system prompt is reachable; therefore nothing security-critical belongs there. Mitigations: never embed credentials, API keys, or sensitive operational logic in prompts; enforce security boundaries outside the model in deterministic auditable code; treat the system prompt as content the user might eventually see.

LLM08:2025 Vector and Embedding Weaknesses (NEW)

New for 2025, targeting RAG and agent memory architectures. Three attack surfaces: embedding poisoning (malicious vectors injected into the store influence retrieval), similarity attacks (crafted queries retrieve unintended content), embedding inversion (reconstructing source text from vectors). The category is OWASP’s recognition that RAG isn’t a safety feature --- it’s a new attack surface. Mitigations: access controls on vector stores per Volume 6’s Section A; integrity validation of indexed content; tenant isolation enforced at the database layer (Supabase RLS, Qdrant payload filters).

LLM09:2025 Misinformation

Renamed from “Overreliance” in the 2025 edition to focus on the model itself generating false information rather than just users trusting it. Hallucinations propagate when users treat outputs as authoritative. Mitigations: faithfulness evals (Ragas), citation requirements, confidence-aware UX that signals uncertainty, human review for high-stakes claims.

LLM10:2025 Unbounded Consumption

Resource exhaustion through expensive operations --- long-running tool calls, recursive agent loops, unbounded retrieval, runaway token consumption. Mitigations: per-request token budgets, tool-call rate limits, loop detection in agent frameworks, cost monitoring with circuit breakers.

Reading the list with a specific agent design in mind produces a checklist. For each of the ten categories, the question is: how does this agent address this risk? An answer of “we don’t” is a real risk identified; an answer of “we rely on the system prompt” is an OWASP violation; an answer with named layers from the defense-in-depth stack (Chapter 3) is the working pattern.

Part 2 — The Substrates

Eight sections follow. Each opens with a short essay on what its entries have in common and how they relate to alternatives. Representative substrates are presented in the same Fowler-style template used by the prior seven catalogs.

Sections at a glance

Section A --- Input/output guardrails frameworks
Section B --- Classifier-style guardrails
Section C --- Offline evaluation frameworks
Section D --- Reference evaluation suites
Section E --- Observability-integrated evaluation
Section F --- Adversarial and red-team frameworks
Section G --- Standards and threat models
Section H --- Confidence, calibration, and human review
Section I --- Discovery and curation

Section A — Input/output guardrails frameworks

NeMo Guardrails, Guardrails AI, and the Guardrails Hub validator marketplace

Three frameworks dominate the runtime-guardrail category. NeMo Guardrails (NVIDIA) takes a conversation-design approach: Colang scripts define canonical dialog flows, acceptable topics, and forbidden behaviors, with an execution engine that enforces them. Guardrails AI takes a validator-composition approach: reusable validators (PII redaction, hallucination detection, profanity, schema conformance) compose into a Guard object that intercepts inputs and outputs. Guardrails Hub is the marketplace of community-contributed validators that the Guardrails AI framework can install and run.

The two frameworks reflect two different design instincts: NeMo treats conversational safety as a DSL-driven discipline (Colang as the language of safe dialog), while Guardrails AI treats it as a library of testable components. Both work; teams choose by preference for declarative dialog flows versus composable code. Most production deployments combine elements of both.

NVIDIA NeMo Guardrails

Source: github.com/NVIDIA/NeMo-Guardrails (Apache-2; Python)

Classification Programmable guardrail framework with Colang DSL for dialog flow control.

Intent

Provide a programmable execution engine that enforces hard boundaries around conversation topics, prevents prompt injection, blocks toxic content, and steers conversations through canonical dialog flows defined in a domain-specific language called Colang.

Motivating Problem

Conversational AI applications need rules: topics the assistant must refuse to discuss, paths the conversation must follow for compliance reasons, content the model must never produce regardless of how the request is phrased. Encoding those rules in prompts is unreliable because models are stochastic; encoding them in application code produces a tangled mess of conditional logic that’s hard to audit. NeMo Guardrails’ approach is a domain-specific language --- Colang --- in which you define canonical user intents, canonical bot responses, and the flows that connect them, with the framework’s runtime intercepting every turn to enforce the defined behavior.

How It Works

Define the agent’s rules in Colang: a syntax that looks like structured natural language with named user intents (“define user ask about competitors”), bot responses (“define bot decline to discuss competitors”), and flows linking them (“define flow when user ask about competitors bot decline to discuss competitors”). The Colang file is loaded by NeMo’s runtime alongside the underlying LLM.

On each user message, the runtime classifies the message against the defined user intents (using the LLM itself, with the Colang definitions as the classification basis). If a matching intent fires, the runtime routes through the defined flow, executing actions and producing the appropriate bot response. If no intent matches, the message falls through to the underlying LLM with no Colang interference. Input rails (run before the model) can block adversarial inputs; output rails (run after the model) can block unsafe responses; dialog rails (the flows themselves) steer the conversation.

Built-in rail types cover the common cases: topical rails (refuse off-topic), fact-checking rails (verify claims against a source), hallucination rails (detect ungrounded outputs), moderation rails (block unsafe content). Custom rails compose for application-specific rules. Integration with classifier models (Llama Guard, AlignScore) and external safety services is first-class.

When to Use It

Conversational agents with explicit compliance or topic requirements (financial advice avoidance, medical disclaimer enforcement, brand-voice consistency). Cases where a declarative DSL for dialog flows is the right abstraction. Applications that benefit from auditable conversation logic separated from prompt engineering.

Alternatives --- Guardrails AI for the library-of-validators approach when conversation flow logic isn’t the dominant concern. Custom application code when the rules are simple enough not to justify a framework. Llama Guard alone for pure-classification safety without conversation flow.

Sources

github.com/NVIDIA/NeMo-Guardrails
docs.nvidia.com/nemo/guardrails/

Example artifacts

Schema / config.

# config/rails.co --- Colang definitions

define user ask about competitor pricing

"How much does CompetitorX charge?"

"What does CompetitorY cost?"

"Tell me about the pricing of our competitors"

define bot decline to discuss competitors

"I'm not able to discuss specifics about competitor pricing."

"Can I help you understand our pricing instead?"

define flow handle competitor questions

user ask about competitor pricing

bot decline to discuss competitors

bot offer to help with our pricing

Setup.

# config/config.yml

models:

- type: main

engine: anthropic

model: claude-opus-4-7

rails:

input:

flows:

- self check input # block obvious jailbreaks

- llama_guard_check_input # classifier-based safety

output:

flows:

- self check output

- fact_check_output # verify against retrieved sources

Code.

from nemoguardrails import LLMRails, RailsConfig

config = RailsConfig.from_path("./config")

rails = LLMRails(config)

response = rails.generate(

messages=[{"role": "user", "content": "How much does
CompetitorX charge?"}]

)

# response will follow the defined flow:

# "I'm not able to discuss specifics about competitor pricing.

# Can I help you understand our pricing instead?"

Guardrails AI

Source: github.com/guardrails-ai/guardrails (Apache-2; Python)

Classification Validator-composition framework for LLM input/output validation.

Intent

Provide a Python framework where input and output validation is composed from reusable validators (Pydantic-style schema checks, PII detection, hallucination detection, profanity filtering, structural constraints), executed against LLM I/O at runtime.

Motivating Problem

For LLM applications that need to validate model outputs against constraints --- “the response must be valid JSON,” “the answer must cite at least one source,” “the output must not contain phone numbers” --- the natural unit of work is the validator: a small composable check that takes a value, returns valid/invalid plus a reason, and can be triggered to retry or modify the LLM call when validation fails. Guardrails AI builds the framework around this primitive, providing a library of common validators and the runtime to compose and execute them.

How It Works

Define a Guard with the validators that should run on inputs and outputs. The Guard wraps the LLM call; on each request, input validators run first (rejecting or transforming the input), the LLM is called, output validators run on the response (rejecting, retrying, or transforming the response). Validators have configurable on-fail behaviors: exception, reask the LLM with the validator’s feedback, retry, no-op with logging.

The validator library is extensive. Built-in: ValidJson, ValidChoices, RegexMatch, ToxicLanguage, DetectPII, ProfanityFree, GuardrailsHallucination (LLM-based fact-check), OnTopic (LLM-based topic check), and many more. Custom validators are normal Python classes implementing a validate() method. The composition is declarative; the execution is the framework’s concern.

For structured outputs specifically, Guardrails AI integrates with Pydantic: define the expected schema as a Pydantic class, attach validators to fields, and the framework enforces both the structure and the field-level constraints. This is the canonical pattern for tool-call validation and structured-output generation.

When to Use It

Applications with concrete validation requirements (schema conformance, PII handling, specific content rules) where each requirement maps cleanly to a validator. Structured-output use cases (function calling, JSON generation) where the schema validation is the dominant concern. Teams that prefer a code-composition approach over a DSL-driven approach.

Alternatives --- NeMo Guardrails for conversational-flow-heavy applications. Custom application code for simple cases. Llama Guard alone for pure classification safety.

Sources

github.com/guardrails-ai/guardrails
www.guardrailsai.com/docs

Example artifacts

Code.

from guardrails import Guard

from guardrails.hub import ToxicLanguage, DetectPII, ValidJson

from pydantic import BaseModel, Field

# Structured-output validation with field-level rules

class CustomerReply(BaseModel):

response: str = Field(

..., description="The reply to the customer",

validators=[ToxicLanguage(threshold=0.5, on_fail="reask")],

)

contains_pii: bool = Field(...)

guard = Guard.for_pydantic(output_class=CustomerReply)

# Wraps an LLM call; runs validators automatically

result = guard(

llm_api=anthropic_messages_create,

model="claude-opus-4-7",

messages=[{"role": "user", "content": customer_message}],

)

print(result.validated_output) # ToxicLanguage-checked, schema-valid
output

Guardrails Hub

Source: hub.guardrailsai.com (community marketplace; ties to guardrails-ai/guardrails)

Classification Validator marketplace and registry.

Intent

Provide a community-curated marketplace of plug-and-play validators (PII masking, hallucination detection, profanity filtering, topic adherence, jailbreak detection, and many more) that Guardrails AI applications can install and compose by name.

Motivating Problem

Building validators from scratch for every common requirement --- PII detection, profanity, hallucination, jailbreak detection --- duplicates work across teams. Guardrails Hub centralizes the validator population: each validator is a package with its own implementation, documentation, performance characteristics, and license, installable into a Guardrails AI deployment with one CLI command.

How It Works

Browse hub.guardrailsai.com for a validator that matches a requirement; install via the CLI (guardrails hub install hub://guardrails/detect_pii); import and use in code as a normal validator within a Guard. Validators are versioned, licensed individually, and rated by community usage.

Categories cover the common ground: content safety (toxic language, profanity, hate speech), security (jailbreak detection, prompt injection patterns), quality (hallucination, off-topic, fact verification), compliance (PII detection, financial advice, medical advice disclaimers), and structural (JSON validity, regex matches, schema conformance). New validators land regularly; the marketplace pattern reduces the cost of adding a new check to roughly the cost of choosing one.

When to Use It

Any Guardrails AI deployment that needs validators beyond the built-in set. As a starting point for discovering which validators exist for a given concern. As a publication channel for teams that build novel validators and want broader reuse.

Alternatives --- implement validators in-house when the requirement is highly specialized or proprietary. Use built-in validators for the cases the core library covers.

Sources

hub.guardrailsai.com

Section B — Classifier-style guardrails

Specialized small models trained to classify safety, deployed in front of the main agent

When the goal is fast, focused safety classification --- “is this input adversarial,” “is this output toxic,” “does this prompt try to jailbreak” --- a dedicated small classifier model is usually a better answer than running the question through a large general-purpose LLM. Specialized classifiers are an order of magnitude faster, an order of magnitude cheaper, and often more accurate within their specialty than a general model used for the same task. They sit naturally in the input-side and output-side layers of the defense-in-depth stack (Chapter 3) where low latency matters most.

Llama Guard is the most-cited example: Meta’s purpose-built safety classification model, available as a downloadable model weight, deployable in-process or behind an inference endpoint. Other entries in the same category include Azure AI Content Safety (Microsoft’s managed equivalent), AWS Bedrock Guardrails (AWS-native), and Anthropic’s own classifier-style safety endpoints. The pattern is the same across vendors; the choice is operational.

Meta Llama Guard

Source: github.com/meta-llama/llama-recipes ; huggingface.co/meta-llama/Llama-Guard (Llama Community License)

Classification Specialized safety-classification model deployable as a guardrail.

Intent

Provide a fast, classification-focused safety model that classifies inputs and outputs as safe or unsafe across a defined taxonomy of harm categories, deployable in front of a main agent LLM at sub-100ms latency.

Motivating Problem

For runtime safety classification, calling a large general-purpose LLM (GPT-4, Claude Opus, Llama 3.1 405B) as a judge produces accurate but slow and expensive judgments --- 500ms+ per call, dollars per thousand requests. Llama Guard’s answer is a smaller dedicated model (Llama Guard 3 8B and 1B variants as of 2024-2026) trained specifically on harm classification, achieving production-acceptable latency (tens of milliseconds at the 1B size) at a fraction of the cost, with accuracy that rivals or exceeds general-LLM judges on the specific safety task.

How It Works

Llama Guard takes a conversation (or a single utterance) plus a taxonomy of harm categories (the MLCommons AILuminate taxonomy is the default, covering categories like violent crimes, hate, sexual content, child exploitation, suicide-and-self-harm, and others). The model outputs “safe” or “unsafe” plus the matched categories. The classification is fast enough to run on every request without significant latency cost.

Deployment options: download the weights and serve via vLLM or TGI for in-house inference; use the Hugging Face inference endpoint; integrate via NeMo Guardrails (which has a first-class Llama Guard rail) or Guardrails AI. The model is open-weights under the Llama Community License, with commercial-use terms that fit most deployments.

The taxonomy is configurable: an application can use the default categories, add custom categories, or remove categories that don’t apply. Custom categories are described in natural language at inference time, which means the model adapts to application-specific safety definitions without retraining.

When to Use It

Production agent deployments needing fast, focused safety classification on every request. Input filtering for adversarial-prompt patterns. Output filtering for unsafe responses. Cases where general-LLM-as-judge latency is unacceptable. Self-hosted deployments where the model weights need to stay on-premises.

Alternatives --- Azure AI Content Safety or AWS Bedrock Guardrails for managed equivalents. Anthropic’s native moderation endpoint for Claude-based deployments. OpenAI Moderation API for the OpenAI-centric stack. General-LLM-as-judge when latency is not a constraint and the application benefits from richer reasoning over simpler classification.

Sources

huggingface.co/meta-llama/Llama-Guard-3-8B
ai.meta.com/research/publications/llama-guard-llm-based-input-output-safeguard/

Example artifacts

Code.

# Llama Guard 3 via Hugging Face transformers

from transformers import AutoTokenizer, AutoModelForCausalLM

import torch

model_id = "meta-llama/Llama-Guard-3-8B"

tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(

model_id, torch_dtype=torch.bfloat16, device_map="auto"

)

def classify(messages: list) -> dict:

input_ids = tokenizer.apply_chat_template(

messages, return_tensors="pt"

).to(model.device)

output = model.generate(input_ids=input_ids, max_new_tokens=100,
do_sample=False)

response = tokenizer.decode(output[0][input_ids.shape[-1]:],
skip_special_tokens=True)

# Response format: "safe" or "unsafe\nS1,S5" (where S1=Violent
Crimes, S5=Defamation, etc)

lines = response.strip().split("\n")

return {

"safe": lines[0] == "safe",

"categories": lines[1].split(",") if len(lines) > 1 else [],

}

# Use as input guardrail

result = classify([

{"role": "user", "content": user_message},

])

if not result["safe"]:

return {"error": "Request blocked", "reason":
result["categories"]}

Managed classifier guardrails (Azure, AWS, Anthropic)

Source: Azure AI Content Safety / AWS Bedrock Guardrails / Anthropic Moderation

Classification Cloud-vendor managed safety classification endpoints.

Intent

Provide the Llama Guard pattern as a managed service: an API endpoint that classifies content against a vendor-defined safety taxonomy, with no model weights to download or infrastructure to run.

Motivating Problem

For teams that prefer to consume safety classification as a managed service rather than self-host model weights, each major cloud vendor offers an equivalent endpoint. The trade-off is the usual one: lower operational burden in exchange for vendor lock-in, recurring per-call cost, and dependence on the vendor’s taxonomy and update cadence.

How It Works

Azure AI Content Safety: REST API classifying text and images against categories (Hate, Sexual, Violence, SelfHarm) with severity scores; bundled with Azure AI Foundry. AWS Bedrock Guardrails: configurable content policies (denied topics, sensitive information, harmful categories) applied to Bedrock model invocations; declarative configuration in the Bedrock console. Anthropic moderation: classifier-style endpoints that pair with Claude API calls; integrated into Anthropic’s safety stack with policy categories that align with the model’s training.

All three follow the same API shape: send text, receive structured classification scores, take action based on thresholds. Integration with the LLM call is typically a wrapper pattern: pre-call classification on the input, post-call classification on the output, with policies determining when to block versus redact versus log.

When to Use It

Cloud-native deployments where the vendor’s broader stack is already in use. Cases where the operational burden of self-hosting a classification model isn’t justified by the volume or by data-residency constraints. Multi-tenant applications where the vendor’s SLAs and compliance certifications matter.

Alternatives --- Llama Guard for the self-hosted open-weights path. Multiple managed classifiers in parallel for defense in depth across vendors.

Sources

learn.microsoft.com/azure/ai-services/content-safety/
docs.aws.amazon.com/bedrock/latest/userguide/guardrails.html

Section C — Offline evaluation frameworks

DeepEval, Ragas, and Promptfoo --- the CI/CD-shaped eval tools

Three offline evaluation frameworks dominate the open-source category as of mid-2026, each with a clear positioning. DeepEval is the pytest-style framework: write evaluations as test functions, assert against metric thresholds, run in CI/CD pipelines, fail the build on regression. Ragas is the RAG-specialist: it provides the canonical metrics for retrieval-augmented generation systems (faithfulness, answer relevancy, context precision, context recall) and the harness to compute them. Promptfoo is the CLI-first option: rapid side-by-side comparison of prompts, models, and configurations from the command line, optimized for fast iteration over careful CI integration.

The three tools overlap in scope but differ in shape. DeepEval is for the team that wants LLM evaluation to feel like unit testing. Ragas is for the team whose primary product is a RAG application and who needs the RAG-specific metrics done right. Promptfoo is for the team iterating on prompts daily and needing comparison results in seconds. Most serious teams use two of the three for different purposes.

DeepEval

Source: github.com/confident-ai/deepeval (Apache-2; Python)

Classification Pytest-style LLM evaluation framework.

Intent

Provide a Python evaluation framework that feels like pytest, with assertions over LLM-specific metrics (hallucination, faithfulness, contextual relevancy, tool correctness, bias, toxicity), runnable in CI/CD pipelines, with first-class support for tracking metrics over time and detecting regressions.

Motivating Problem

LLM evaluation has historically been ad-hoc --- spreadsheets of test cases, notebooks of one-off comparisons, scattered Jupyter outputs. The discipline of unit testing in software engineering offers a clear model: small composable tests, clear pass/fail assertions, integration with CI/CD, automatic regression detection. DeepEval ports that discipline to LLM evaluation, deliberately echoing pytest in its API surface to make adoption easy for engineers already comfortable with the testing-as-code idiom.

How It Works

Write evaluations as test functions with @pytest.mark.parametrize for test cases. Use deepeval.assert_test with metrics like HallucinationMetric, FaithfulnessMetric, AnswerRelevancyMetric, ToolCorrectnessMetric. Each metric is computed against the test case and asserts against a threshold; the test passes if the metric exceeds the threshold, fails otherwise. Standard pytest CLI runs the suite; CI/CD integration uses the same green-checkmark workflow as any pytest project.

Metrics are LLM-based: the framework uses an evaluator LLM (configurable, defaults to GPT-4 family) to compute scores against rubrics. The rubrics are well-specified for the common metrics: FaithfulnessMetric measures the alignment between retrieved context and the generated answer; HallucinationMetric measures whether the answer contains claims unsupported by the context. Custom metrics are normal Python classes implementing measure() and is_successful().

DeepEval’s confident-ai.com is the optional managed dashboard for tracking metrics over time, organizing test suites, and producing the reports that justify investment in evaluation. The framework itself is fully usable without the dashboard; the dashboard is the convenience for teams that want centralized eval visibility.

When to Use It

Production LLM applications adding CI/CD-integrated evaluation. Teams that already use pytest for software testing and want LLM evaluation in the same idiom. Regression detection across prompt changes, model upgrades, and pipeline modifications.

Alternatives --- Promptfoo for CLI-driven rapid iteration. Ragas for RAG-specific evaluation. LangSmith Evaluators for the LangChain-integrated path. OpenAI Evals for the reference-suite case.

Sources

github.com/confident-ai/deepeval
docs.confident-ai.com

Example artifacts

Code.

# tests/test_agent.py

from deepeval import assert_test

from deepeval.test_case import LLMTestCase

from deepeval.metrics import (

FaithfulnessMetric,

AnswerRelevancyMetric,

HallucinationMetric,

)

import pytest

TEST_CASES = [

LLMTestCase(

input="What was the warranty period?",

actual_output=my_agent("What was the warranty period?"),

retrieval_context=[doc1, doc2],

expected_output="3 years",

),

# ... dozens more cases drawn from production traces ...

]

\@pytest.mark.parametrize("test_case", TEST_CASES)

def test_faithfulness(test_case):

assert_test(test_case, [FaithfulnessMetric(threshold=0.8)])

\@pytest.mark.parametrize("test_case", TEST_CASES)

def test_relevancy(test_case):

assert_test(test_case, [AnswerRelevancyMetric(threshold=0.7)])

# Run in CI/CD: pytest tests/ --- fails the build if metrics drop
below threshold

Ragas

Source: github.com/explodinggradients/ragas (Apache-2; Python)

Classification RAG-specific evaluation framework.

Intent

Provide the canonical evaluation framework for retrieval-augmented generation systems, with RAG-specific metrics (faithfulness, answer relevancy, context precision, context recall, answer correctness) computed correctly and with thoughtful handling of the multi-component nature of RAG pipelines.

Motivating Problem

RAG systems have two failure surfaces that pure-generation metrics don’t capture: retrieval quality (did the retriever find the relevant documents) and generation faithfulness (does the answer reflect the retrieved documents). General-purpose evaluation frameworks treat the RAG output as a black box; Ragas treats the RAG pipeline as a composite of retriever and generator with separable metrics for each component, plus end-to-end metrics for the combination.

How It Works

Provide Ragas with the question, the retrieved context, the generated answer, and (optionally) a ground-truth answer. Ragas computes metrics: Faithfulness (fraction of answer claims supported by the context), Answer Relevancy (how well the answer addresses the question), Context Precision (whether the retrieved context contains the relevant information at high ranks), Context Recall (whether the retrieved context covers the ground-truth answer’s information), Answer Correctness (semantic similarity to ground truth).

The metrics decompose evaluation correctly. A low Faithfulness with high Context Precision means the generator is hallucinating despite having the right context (a generation problem). A low Context Recall means the retriever didn’t find the right documents (a retrieval problem). The diagnostic value is that the metrics tell the team where to focus improvement effort.

Ragas integrates with LangChain, LlamaIndex, and Haystack natively, and works with any pipeline that produces the question/context/answer tuples. Synthetic test-set generation (Ragas can generate evaluation questions from a corpus) is a first-class capability that addresses the cold-start problem for new RAG deployments.

When to Use It

Production RAG deployments needing rigorous evaluation. Agent memory systems (Volume 6) that retrieve from vector stores. Document-Q&A applications. Any system where the retrieval-and-generation pipeline’s components need separate quality measurement.

Alternatives --- DeepEval for the broader LLM evaluation case (DeepEval also includes RAG metrics but Ragas is the specialist). Custom evaluation when the RAG pipeline shape doesn’t match Ragas’s assumptions.

Sources

github.com/explodinggradients/ragas
docs.ragas.io

Example artifacts

Code.

from ragas import evaluate

from ragas.metrics import (

faithfulness,

answer_relevancy,

context_precision,

context_recall,

)

from datasets import Dataset

# Production RAG traces, converted to Ragas dataset format

dataset = Dataset.from_dict({

"question": questions,

"answer": generated_answers,

"contexts": [retrieved_chunks_per_question],

"ground_truth": reference_answers,

})

result = evaluate(

dataset,

metrics=[faithfulness, answer_relevancy, context_precision,
context_recall],

)

# Component-wise diagnostics:

# - Low faithfulness + high context_precision => generator
hallucinating

# - Low context_recall => retriever missing relevant docs

print(result)

Promptfoo

Source: github.com/promptfoo/promptfoo (MIT; TypeScript)

Classification CLI-first LLM evaluation and red-teaming tool.

Intent

Provide a CLI tool and library for evaluating LLM outputs through side-by-side comparison of prompts, models, and parameters, with red-teaming capabilities for security testing and a web UI for inspecting results.

Motivating Problem

For the team iterating on prompts daily --- trying variant A vs variant B, comparing model upgrades, testing parameter sensitivities --- the friction of writing pytest-style tests is too high. The desired workflow is closer to the BLAST-style hyperparameter sweep: define inputs and configurations in a YAML file, run a single command, get a side-by-side comparison in seconds. Promptfoo is built around this workflow with the YAML-config-plus-CLI pattern at the center.

How It Works

Write a promptfooconfig.yaml describing the prompts, the providers (which LLM), the test cases (inputs), and the assertions (what counts as passing). Run promptfoo eval. The tool executes the matrix of prompts × providers × tests in parallel, applies assertions to each output, and produces a structured report. The web UI (promptfoo view) renders the matrix for inspection.

Assertions are flexible: string matching, regex, LLM-as-judge rubrics, semantic similarity, custom JavaScript. The matrix view makes it easy to see which prompts work for which cases and which configurations regress. Caching is aggressive (cached responses for unchanged inputs); the iteration loop is fast --- modify a prompt, rerun, get results in seconds.

Promptfoo also covers the red-teaming surface: promptfoo redteam generates adversarial test cases (prompt injection patterns, jailbreak attempts, bias-probing inputs) and runs them through the configured pipeline, surfacing which cases produced unsafe outputs. This is unusual integration; most eval tools and most red-team tools are separate.

When to Use It

Daily prompt iteration where the friction of writing pytest-style tests is too high. Side-by-side model comparison (the same prompt across Claude, GPT, Gemini, Llama). Quick red-teaming passes during development. Teams that prefer YAML configuration over Python.

Alternatives --- DeepEval for pytest-style CI integration with metrics tracking over time. Ragas for RAG-specific evaluation. PyRIT or Garak for the dedicated red-teaming case.

Sources

github.com/promptfoo/promptfoo
www.promptfoo.dev

Example artifacts

Schema / config.

# promptfooconfig.yaml

description: "Compare two prompt variants for customer support"

providers:

- anthropic:claude-opus-4-7

- openai:gpt-5

prompts:

- file://prompts/support_v1.txt

- file://prompts/support_v2.txt

tests:

- vars:

customer_question: "How do I return an item?"

assert:

- type: contains

value: "return policy"

- type: llm-rubric

value: "The response is polite and accurate."

- vars:

customer_question: "Cancel my account immediately!"

assert:

- type: contains-any

value: ["sorry", "understand"]

- type: not-contains

value: "unable"

# Run: promptfoo eval

# View results: promptfoo view

Section D — Reference evaluation suites

OpenAI Evals and HELM: pre-built benchmarks for measuring agent capability

Where Section C’s frameworks evaluate your specific agent against your specific test cases, the reference suites evaluate models and agents against published benchmarks: standardized inputs, agreed-upon scoring methodologies, comparable results across teams. OpenAI Evals (the GitHub registry) is the open-source framework with hundreds of contributed evaluations. HELM (Holistic Evaluation of Language Models, Stanford CRFM) is the academic-rigor reference for cross-model comparison across dozens of scenarios and metrics. Both are useful for the question “is my agent in the right ballpark on benchmark X” rather than “does my agent handle my specific use case.”

OpenAI Evals

Source: github.com/openai/evals (MIT; Python)

Classification Open-source registry of LLM evaluations with a standardized framework.

Intent

Provide an open framework for creating, sharing, and running LLM evaluations, with a registry of community-contributed evaluations covering benchmarks, behavioral checks, and capability tests.

Motivating Problem

Many useful LLM evaluations exist as one-off scripts in research papers and product blog posts; the result is duplicated effort and inconsistent reporting. OpenAI Evals standardizes the evaluation format --- a registry of named evaluations, each with defined test cases and scoring methodology --- so an evaluation defined once can be run by anyone against any compatible model. The library covers traditional benchmarks (MMLU, HumanEval, BBH) plus behavioral checks specific to assistant deployments.

How It Works

Evaluations are defined as YAML registry entries plus Python implementation classes. The framework runs the registered evaluation against a configured model (any provider with a compatible API), produces standardized output, and supports comparison across runs. The registry on GitHub holds hundreds of community-contributed evaluations.

Two evaluation styles dominate. Match-style: the evaluation has a reference answer; the model’s output is compared (exact match, regex, fuzzy) against the reference. Model-graded: a stronger model judges whether the candidate model’s output meets criteria; this is the LLM-as-judge pattern (Chapter 4) applied within the eval framework. Custom evaluation types are first-class.

When to Use It

Comparing a model or agent against published benchmarks. Reproducing published evaluations to verify claimed capabilities. Contributing new evaluations to the community registry. Cases where the standardization itself is valuable for cross-team or cross-vendor comparison.

Alternatives --- DeepEval, Ragas, Promptfoo for application-specific evaluation. HELM for the academic-rigor reference. lm-evaluation-harness (EleutherAI) for the academic benchmarking community’s preferred framework.

Sources

github.com/openai/evals

HELM (Holistic Evaluation of Language Models)

Source: github.com/stanford-crfm/helm (Apache-2; Python)

Classification Academic-rigor reference framework for cross-model evaluation.

Intent

Provide a comprehensive, methodologically rigorous evaluation framework that compares language models across dozens of scenarios and metrics, with public leaderboards tracking model performance over time.

Motivating Problem

The model-comparison space is noisy: vendors report numbers favorable to their models, benchmarks are gamed, evaluation methodologies vary. HELM’s response, from the Stanford Center for Research on Foundation Models, is methodological rigor: a defined taxonomy of scenarios (multi-domain Q&A, reasoning, summarization, multilinguality, etc.), defined metrics (accuracy, calibration, robustness, fairness, bias, toxicity, efficiency), explicit evaluation protocols, and public release of all results.

How It Works

Scenarios are defined formally: each captures a capability with specific datasets, prompt formats, and scoring rules. Metrics are defined formally: accuracy alone is insufficient; HELM also measures calibration (does confidence track correctness), robustness (do small input perturbations change outputs), fairness (do performance differences across demographic groups exist). Models under evaluation are run against the full matrix; results are published on the HELM leaderboards.

The HELM-Lite variant covers a smaller scenarios-and-metrics matrix for the practical case where the full suite is overkill. HELM-Instruct evaluates instruction-following capabilities specifically. HELM-Safety adds adversarial scenarios for the safety dimension.

When to Use It

Cross-model comparison requiring academic-grade methodology. Evaluating a new model against the published leaderboards. Research and analysis where the multi-dimensional view (accuracy plus calibration plus fairness plus efficiency) matters more than a single bottom-line number.

Alternatives --- OpenAI Evals for community-contributed coverage. lm-evaluation-harness for the EleutherAI-community-preferred framework. Application-specific evaluation (Sections C) when the published-benchmark question isn’t the question.

Sources

github.com/stanford-crfm/helm
crfm.stanford.edu/helm/

Section E — Observability-integrated evaluation

When traces become test data: LangSmith Evaluators and Phoenix Evals

The observability platforms in Volume 7’s Section C --- LangSmith, Phoenix, Langfuse --- ship with evaluation features built on top of their trace data. The pattern is conceptually clean: production traces are the natural source of evaluation inputs (real questions users actually asked); the trace UI surfaces failures (the runs to investigate); evaluators score traces against rubrics; the scored traces become the dataset for the next prompt iteration. Closing the loop between observability and evaluation is what these platforms’ eval features add.

LangSmith Evaluators

Source: smith.langchain.com (LangChain; commercial SaaS + self-host)

Classification Trace-integrated evaluation surface within the LangSmith platform.

Intent

Provide built-in evaluation capabilities tied to LangSmith traces: capture production runs into datasets, run evaluators against them, track metric trends over time, and integrate with the same UI that surfaces traces and prompt versions.

Motivating Problem

Teams already running LangSmith for tracing (Volume 7) often duplicate work setting up a separate evaluation framework. LangSmith Evaluators removes the duplication: the same traces that power debugging also become the dataset for evaluation; evaluators run against the dataset; metrics surface in the same dashboards as the traces themselves; iteration on prompts produces side-by-side comparisons with the previous version using the same eval suite.

How It Works

Capture production runs into a Dataset (a curated collection of inputs and expected outputs, drawn from real traces). Define Evaluators --- built-in (correctness, conciseness, harmfulness) or custom (Python functions that score a run). Trigger eval runs from the UI or programmatically; results appear alongside the trace data with comparison views showing how the new prompt version performed against the dataset compared to the previous version.

The integration with prompt versioning matters: when a developer ships a new prompt, LangSmith automatically reruns the evaluation suite against the new version and surfaces any regressions before the prompt is promoted to production. The workflow approximates pre-merge testing for prompts in the same way unit tests function for code.

When to Use It

Teams already using LangSmith for tracing who want evaluation in the same platform. Workflows where production traces are the primary source of evaluation cases. Iterative prompt development where regression detection against a fixed dataset accelerates safe iteration.

Alternatives --- DeepEval or Promptfoo for standalone CI-style evaluation. Phoenix Evals for the open-source equivalent. Custom Python code when the platform abstraction doesn’t fit.

Sources

docs.smith.langchain.com/evaluation

Phoenix Evals (Arize)

Source: github.com/Arize-ai/phoenix (Apache-2; Python)

Classification Open-source trace-integrated evaluation built on OpenInference traces.

Intent

Provide LLM evaluation capabilities tightly integrated with Phoenix’s trace UI, with built-in evaluators (hallucination, relevance, toxicity, code QA) and the workflow of capturing traces, running evals against them, and inspecting failures in the same notebook or self-hosted UI.

Motivating Problem

For teams using Phoenix for observability, Phoenix Evals provides the equivalent of LangSmith Evaluators in the open-source camp: trace-derived datasets, batch evaluation runs, metric tracking, and the option to deploy entirely on the team’s infrastructure. The OpenInference-based trace format means evaluators can run against traces from any OpenInference-instrumented agent framework, not just LangChain.

How It Works

Run Phoenix locally (notebook mode) or as a self-hosted server. Captured traces are queryable as DataFrames; evaluators --- built-in classes for the common rubrics or custom callables --- run against the trace data; results are stored back into Phoenix and displayed in the UI. The evaluators use LLM-as-judge under the hood (with all the caveats from Chapter 4), with configurable judge models.

Built-in evaluators cover hallucination, relevance, Q&A correctness, retrieval evaluation (faithfulness, context relevance), and toxicity. The list grows as the open-source community contributes; the pattern of defining custom evaluators is straightforward Python.

When to Use It

Teams already using Phoenix for tracing, especially in self-hosted deployments. Workflows where the data-residency or audit posture requires the eval pipeline to remain on the team’s infrastructure. OpenInference-instrumented agent stacks that want a single tool for tracing and evaluation.

Alternatives --- LangSmith Evaluators for the LangChain-managed equivalent. Langfuse evaluations for the third open-source option. DeepEval / Ragas / Promptfoo for standalone evaluation frameworks outside the observability platform.

Sources

docs.arize.com/phoenix/evaluation/evals

Section F — Adversarial and red-team frameworks

Garak and PyRIT: systematically generating attacks to find what breaks

Where Section C’s evaluation frameworks check whether the agent does the right thing on intended inputs, the red-team frameworks systematically generate adversarial inputs to find what the agent does on inputs designed to break it. Two frameworks dominate the open-source space. NVIDIA Garak is the vulnerability scanner pattern: a catalog of probes that test for specific failure modes (prompt injection, encoding attacks, jailbreaks, hallucination amplification, data leakage), runnable as a CLI against any model endpoint. Microsoft PyRIT (Python Risk Identification Tool) is the orchestrator pattern: a framework for composing attacks, with built-in attack strategies and the ability to chain multiple attack stages.

Both frameworks address OWASP LLM01 (Prompt Injection) and LLM06 (Excessive Agency) directly, and both produce structured reports identifying which attack patterns succeeded against the target. The practical use is twofold: pre-launch red-teaming to find vulnerabilities before users do, and ongoing red-teaming as part of model and prompt iteration so the agent doesn’t silently develop new vulnerabilities.

NVIDIA Garak

Source: github.com/NVIDIA/garak (Apache-2; Python)

Classification LLM vulnerability scanner.

Intent

Provide a vulnerability scanner for LLM endpoints, with a catalog of probes that test for specific failure modes (prompt injection, jailbreaks, data leakage, encoding attacks, toxicity, malicious code generation), runnable from the CLI with structured reports.

Motivating Problem

LLM vulnerability discovery has been ad-hoc --- papers describe specific attacks, but applying them systematically to a deployed model requires reimplementing each one. Garak’s answer is a structured catalog of probes: each probe targets a specific failure mode (DAN-style jailbreaks, ASCII smuggling, encoding-based bypasses, hallucination amplification, malicious code generation, profanity, PII leakage), implemented once, runnable as a set against any compatible LLM endpoint. The output is a structured report showing which probes succeeded --- a vulnerability scan of the model.

How It Works

Configure Garak to point at an LLM endpoint (any HTTP API, including OpenAI-compatible, Anthropic, Hugging Face, REST). Select probes from the catalog or run the default suite. Garak fires each probe (often with many variations per probe) against the endpoint; detectors classify the responses; the report summarizes which probes broke through and which didn’t.

The probe catalog is extensive and growing: dozens of probe families covering prompt injection variants, multilingual attacks, encoding tricks (base64, ROT13, Unicode tricks), refusal-bypass jailbreaks, content-policy probes (toxicity, harmful instructions), data-leakage probes (training-data extraction, PII recall), and more. Detectors range from simple regex to LLM-based classification.

When to Use It

Pre-launch security testing of any LLM application. Periodic red-teaming as part of model upgrades or prompt iteration. Comparison testing across multiple models or providers (“which model resists which attacks”). Compliance evidence (the structured Garak report becomes part of the security audit artifact).

Alternatives --- PyRIT for the orchestrator pattern that composes multi-stage attacks. Promptfoo redteam for the CLI-integrated single-tool case. Custom red-team scripts when the vulnerabilities being tested don’t match existing probes.

Sources

github.com/NVIDIA/garak

Example artifacts

Setup.

pip install garak

# Set provider keys, e.g. ANTHROPIC_API_KEY

Code.

# Run a default scan against an Anthropic endpoint:

garak --model_type anthropic --model_name claude-opus-4-7

# Run specific probes:

garak --model_type anthropic --model_name claude-opus-4-7 \

--probes dan,promptinject,malwaregen,leakreplay

# Produce a JSON report for CI/CD integration:

garak --model_type anthropic --model_name claude-opus-4-7 \

--probes promptinject --report_prefix ./scan-2026-05

Microsoft PyRIT

Source: github.com/Azure/PyRIT (MIT; Python)

Classification Red-team orchestration framework for multi-stage attack composition.

Intent

Provide a Python framework for composing and running adversarial attacks against LLM systems, with built-in attack strategies (jailbreaks, prompt injections, prompt converters), pluggable target endpoints, and scoring mechanisms to identify successful attacks.

Motivating Problem

Where Garak is the vulnerability scanner pattern (run a fixed catalog of probes), PyRIT is the orchestrator pattern (compose attacks dynamically, chain stages, adapt based on responses). The model fits the case where the attack itself is a sequence --- a jailbreak that requires several turns of conversation to set up, an indirect prompt injection that requires planting content in a document the agent reads, a multi-stage attack that builds context across calls. PyRIT’s primitives --- Orchestrator, AttackStrategy, PromptTarget, Scorer --- compose to express these multi-stage attacks.

How It Works

Define a PromptTarget (the LLM under test). Define one or more AttackStrategy implementations (DAN, jailbreak templates, prompt-injection patterns, role-play attacks). Configure Scorer implementations that determine whether an attack succeeded (LLM-based judges, regex matchers, classifier models). Use an Orchestrator to run the attack --- single-turn, multi-turn, or with adaptive feedback based on the target’s responses.

PyRIT’s strengths over Garak: multi-turn attack composition (a conversation that progressively jailbreaks the target), prompt converters (encode the attack payload through base64, ROT13, language translation), adaptive attacks (modify the next attack based on the target’s response to the previous one). The integration with Azure AI Content Safety provides a built-in scoring path for Microsoft-deployed targets.

Operational: Microsoft maintains PyRIT as part of its responsible-AI tooling; the framework is used by Microsoft’s own red teams against deployed AI products. The integration with Azure is first-class but not required --- the framework runs against any HTTP endpoint with appropriate adapter code.

When to Use It

Multi-stage red-teaming where attacks require composition. Adaptive attacks that adjust based on target responses. Azure-deployed AI systems where the integrated Content Safety scoring matters. Research and tool development for new attack patterns.

Alternatives --- Garak for the simpler scanner pattern. Custom code when the attack vocabulary doesn’t match either framework. Combine both: PyRIT for novel composed attacks, Garak for the standard probe coverage.

Sources

github.com/Azure/PyRIT
azure.github.io/PyRIT/

Section G — Standards and threat models

OWASP Top 10 for LLMs, NIST AI RMF, and MITRE ATLAS

Three published standards shape how teams talk about LLM risk. OWASP Top 10 for LLM Applications (covered in Part 1 Chapter 5) is the canonical threat catalog, useful as a design checklist and as a shared vocabulary across security, product, and engineering teams. NIST AI Risk Management Framework (AI RMF) is the higher-level governance framework, mapping to AI risk management activities (Govern, Map, Measure, Manage) and producing the documentation enterprise compliance functions expect. MITRE ATLAS catalogs the adversarial tactics and techniques applied to AI systems, in the same MITRE ATT&CK style used for general cybersecurity.

The three frameworks are complementary, not competing. OWASP is the developer-facing checklist; NIST is the executive-facing governance framework; MITRE ATLAS is the security-team-facing threat intelligence resource. A serious governance program references all three.

OWASP Top 10 for LLM Applications (2025)

Source: genai.owasp.org/llm-top-10/ (OWASP Foundation; open access)

Classification Developer-facing threat catalog and design checklist.

Intent

Identify the ten most critical security risks in LLM applications, with descriptions, example attack scenarios, and prevention strategies, updated annually based on community input and real-world incidents.

Motivating Problem

Production LLM applications have failed in characteristic ways since 2023: prompt injection of various kinds, training-data extraction, excessive agency in agent architectures, RAG attacks against vector stores, and others. The OWASP project consolidates these failure modes into a single ten-item list that’s short enough to use as a design checklist and detailed enough to drive specific mitigations. The 2025 edition reflects two years of accumulated lessons, including the new categories (System Prompt Leakage, Vector and Embedding Weaknesses) that the agent shift made unavoidable.

How It Works

Each entry in the Top 10 has a description, common examples of vulnerability, prevention and mitigation strategies, example attack scenarios, and references to related frameworks. The format is deliberately security-developer-friendly: read once, use as a checklist forever.

The 2025 list (covered in Part 1 Chapter 5): LLM01 Prompt Injection; LLM02 Sensitive Information Disclosure; LLM03 Supply Chain; LLM04 Data and Model Poisoning; LLM05 Improper Output Handling; LLM06 Excessive Agency (significantly expanded with three sub-causes); LLM07 System Prompt Leakage (new); LLM08 Vector and Embedding Weaknesses (new); LLM09 Misinformation (renamed from Overreliance); LLM10 Unbounded Consumption.

Integration with red-team tools is increasingly first-class. DeepEval’s DeepTeam product, for example, ships an OWASPTop10 attack-set configuration that generates adversarial inputs aligned with each Top 10 category, producing structured reports against the standard.

When to Use It

Every production LLM deployment. The right starting point for the security review of any new agent system. As shared vocabulary across security, product, and engineering teams when discussing AI risk.

Alternatives --- none for the role OWASP plays. NIST AI RMF and MITRE ATLAS complement OWASP rather than replace it.

Sources

genai.owasp.org/llm-top-10/
owasp.org/www-project-top-10-for-large-language-model-applications/

NIST AI Risk Management Framework

Source: nist.gov/itl/ai-risk-management-framework (US NIST; public)

Classification Executive-facing AI governance framework.

Intent

Provide a framework for organizations to manage AI risk throughout the AI lifecycle, organized around four core functions (Govern, Map, Measure, Manage) with detailed profiles for sectors (Generative AI Profile, July 2024) and use cases.

Motivating Problem

Enterprise compliance and governance functions need a framework that maps to their existing risk management vocabulary --- risk identification, measurement, mitigation, documentation, accountability. OWASP’s checklist is the right artifact for developers; NIST AI RMF is the right artifact for governance teams, regulators, and auditors who need a recognizable framework with documented profiles.

How It Works

The framework defines four core functions. Govern: establish organizational structures for AI risk management, including roles, policies, and accountability. Map: identify context, intended uses, and risks of AI systems. Measure: analyze risks using metrics, evaluations, and assessments. Manage: prioritize and treat risks, with documented mitigation strategies.

The Generative AI Profile (NIST AI 600-1, July 2024) extends the framework specifically for generative AI systems, with risk categories aligned to GenAI-specific failure modes. The Companion Resource (NIST AI 100-1) provides specific actions for each function-and-risk-category combination. Together they form the documentation backbone for enterprise GenAI governance programs.

When to Use It

Enterprise AI deployments requiring governance documentation. Regulated industries where AI risk management is a compliance requirement. Organizations building AI risk management programs that need a recognized framework. Procurement processes evaluating AI vendor risk posture.

Alternatives --- ISO/IEC 42001 (AI management system standard) for organizations that prefer the ISO framework family. EU AI Act for European deployments where regulatory compliance is the driver. Domain-specific frameworks (FDA AI/ML guidance for medical, FINRA for financial services) for vertical-specific requirements.

Sources

nist.gov/itl/ai-risk-management-framework
nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf

MITRE ATLAS

Source: atlas.mitre.org (MITRE Corporation; open access)

Classification Security-team-facing threat intelligence for AI systems.

Intent

Catalog the tactics and techniques adversaries use against AI systems, in the same MITRE ATT&CK style used for general cybersecurity, providing security teams with structured threat intelligence specific to AI.

Motivating Problem

Security operations teams already use MITRE ATT&CK for general cybersecurity threat modeling --- a structured catalog of adversary tactics (the categories of attacker behavior) and techniques (the specific methods within each tactic). Applying the same vocabulary to AI-specific attacks brings AI security into the existing security operations workflow. MITRE ATLAS provides the AI-specific equivalent, with tactics (Reconnaissance, Resource Development, Initial Access, ML Model Access, Execution, Persistence, Defense Evasion, Discovery, Collection, ML Attack Staging, Exfiltration, Impact) and techniques within each, plus documented case studies of real-world AI attacks.

How It Works

Browse the ATLAS matrix at atlas.mitre.org; navigate tactics and techniques. Each technique has a description, examples (often real incidents), and mitigations. The case studies (curated AI incident reports) ground the abstract techniques in concrete events.

Integration with security tooling: ATLAS technique IDs can be tagged in security incident reports, threat intelligence feeds, and detection rules, creating the same shared vocabulary that ATT&CK provides for general cybersecurity. PyRIT’s attacks are mapped to ATLAS techniques; Garak’s probes increasingly are as well.

When to Use It

Security operations teams adding AI to their threat modeling vocabulary. Incident response for AI-related events where the team wants a shared classification. Threat intelligence consumption focused on AI-specific adversary behavior. Red-team activities that need to map their attacks to a published taxonomy.

Alternatives --- OWASP Top 10 for the developer-facing equivalent (the two frameworks complement each other). MITRE ATT&CK for non-AI parts of the threat surface. Vendor-specific threat intelligence feeds for tactical, time-sensitive information.

Sources

atlas.mitre.org

Section H — Confidence, calibration, and human review

Deciding what to trust and what to route to a human --- the quality-control discipline behind the frameworks

Sections A through G catalog the frameworks that score outputs and block bad ones. This section is about the discipline that uses them: eliciting confidence from the model, checking that the confidence means what it claims, and sampling the output stream so human reviewers see the cases that matter. These are methodology patterns, not products --- they sit on top of the eval and guardrail substrates and turn their scores into review decisions. LLM-as-judge (Chapter 4) produces the scores; these patterns decide what to do with them.

Per-field confidence and selective routing

Source: Practitioner conventions for extraction quality control; Anthropic and OpenAI structured-output guidance

Classification Eliciting confidence per output field and routing only the weak fields to review.

Intent

Have the model emit a confidence score for each field it produces, then auto-accept the confident fields and route only the low-confidence ones to human review, rather than accepting or reviewing a whole record on a single coarse score.

Motivating Problem

A structured extraction usually gets most fields right and a few wrong. A single record-level confidence score hides which fields are weak: a 0.85 overall can mean every field is a solid 0.85, or that four fields are 0.98 and one is 0.30. Routing on the record score either sends good fields to review alongside the bad one --- wasting reviewer time --- or accepts the bad field along with the good ones and ships an error. The weakness is per-field, so the confidence and the routing should be too.

How It Works

Elicit confidence per field: design the output schema so each extracted value is paired with a confidence, rather than one score for the whole record. A field becomes an object of value and confidence, or the schema carries a parallel confidence map. The model reports how sure it is of each value it wrote.

Read the scores at the field level: high-confidence fields flow through, low-confidence fields are flagged, and a borderline band in between is tuned to taste. Because the scores are per field, a single record can be partly auto-accepted and partly routed.

Route partially: a mixed-confidence record is split, not sent whole. The confident fields are accepted; the reviewer sees only the fields that need a second look, with the source and the model’s value beside them. This cuts review volume sharply versus routing whole records.

Set the policy explicitly: decide per-field versus per-record routing, the thresholds, and what the borderline band does, and state it as policy rather than an ad-hoc cutoff. The thresholds are only trustworthy if the confidence is calibrated (next pattern) --- an uncalibrated score routes the wrong fields.

When to Use It

Structured extraction and classification at volume, where records have many fields and only some are hard: invoices, forms, entity extraction, data enrichment. Most valuable when reviewer time is the bottleneck and blanket review is too expensive. Depends on the model exposing usable per-field confidence, and on that confidence being calibrated.

Alternatives --- record-level routing (Volume 1, Cascade/Fallback) when outputs are single-valued rather than multi-field; full review when the domain is high-stakes enough that no field should ship unchecked.

Sources

docs.claude.com/en/docs/build-with-claude/tool-use
platform.openai.com/docs/guides/structured-outputs

Example artifacts

Code.

{
  "invoice_number": { "value": "INV-2213", "confidence": 0.98 },
  "vendor_name": { "value": "Acme Industrial", "confidence": 0.94 },
  "total_cents": { "value": 812900, "confidence": 0.41 }
}

Confidence calibration

Source: Guo et al., “On Calibration of Modern Neural Networks” (2017); practitioner conventions for reliability diagrams

Classification Checking that a model's confidence matches its empirical accuracy, and tuning thresholds to it.

Intent

Verify that a model’s stated confidence matches how often it is actually right --- and set routing thresholds from the measured relationship --- so that a “0.9” means roughly ninety-percent accuracy rather than an arbitrary number the model emitted.

Motivating Problem

Confidence-based routing assumes the confidence means something. Models are frequently miscalibrated, and usually over-confident --- reporting 0.9 on cases they get right 70 percent of the time. When the score and the accuracy diverge, every threshold set against it routes the wrong cases: the review queue fills with fields that were fine, or errors sail through at a confidence that looked safe. Calibration is the check that makes a confidence score usable, and it has to be re-run whenever the model or prompt changes.

How It Works

Build a validation set: a representative, labeled, held-out sample of inputs with known-correct outputs. This is the ground truth calibration is measured against; it must resemble production traffic, not a convenience sample.

Bucket and measure: group predictions by their confidence score (0.0—0.1, 0.1—0.2, and so on) and, for each bucket, measure the actual accuracy of the predictions in it. A well-calibrated model gets the 0.9 bucket right about 90 percent of the time.

Read the reliability diagram: plot confidence on the x-axis against measured accuracy on the y-axis. The diagonal is perfect calibration; points below the line are over-confidence (the common case), points above are under-confidence. The shape shows where the model can be trusted and where its confidence is inflated.

Set thresholds from the curve, not from intuition: if the model is right 95 percent of the time only above a stated confidence of 0.8, then 0.8 --- not 0.9 --- is the auto-accept cutoff. The diagram turns a vague “high confidence” into a defensible threshold.

Re-calibrate on change: a new model version, a changed prompt, or drift in the inputs can move the curve. Calibration is not a one-time measurement; re-run it whenever the pipeline changes, or the thresholds silently go stale.

When to Use It

Any pipeline that routes, filters, or auto-accepts on a confidence score --- extraction review queues, cascade escalation, guardrail thresholds. Essential wherever a confidence cutoff decides whether a human sees an output. Requires a labeled validation set, which is the main cost; without one, confidence numbers are decoration.

Alternatives --- LLM-as-judge scores (Chapter 4) as an alternative signal to raw model confidence, itself calibrated the same way; conservative blanket review when a validation set cannot be built and the stakes forbid trusting an unverified score.

Sources

arxiv.org/abs/1706.04599 (Guo et al., calibration of modern neural networks)
Chapter 4 (LLM-as-Judge) --- the scoring layer calibration is applied to

Example artifacts

Code.

# Bucket predictions by confidence, then measure accuracy per bucket
from collections import defaultdict

buckets = defaultdict(lambda: [0, 0])  # confidence bin -> [correct, total]
for pred in validation_set:
    b = round(pred.confidence, 1)
    buckets[b][0] += int(pred.value == pred.gold)
    buckets[b][1] += 1

for b in sorted(buckets):
    correct, total = buckets[b]
    print(f"confidence ~{b}: accuracy {correct / total:.2f} over {total}")

Stratified sampling for human review

Source: Survey-sampling methodology adapted to ML quality assurance; practitioner conventions for review queues

Classification Sampling the output stream for review by risk stratum instead of at random.

Intent

Choose which outputs a human reviews by stratifying the stream --- oversampling low-confidence, rare-category, and high-stakes outputs --- rather than sampling at random, so a fixed reviewer budget catches the most errors and the most consequential ones.

Motivating Problem

At production volume, humans cannot review everything, so a sample is reviewed. Random sampling spends the reviewer budget in proportion to volume, which means the common, easy, low-stakes cases dominate the queue and the rare failures --- the ones worth catching --- are almost never seen. A one-percent random sample of a stream that is 99 percent easy is almost entirely easy cases. The sampling strategy, not the reviewers, decides whether errors get caught.

How It Works

Stratify by confidence: oversample the low-confidence band, where errors concentrate, while still drawing a baseline from the confident majority to catch the over-confident mistakes calibration warns about. A floor --- say, everything below 0.7 is reviewed --- plus a small random rate above it is the common shape.

Stratify by category: when outputs fall into categories, a common category can drown a rare one in a random sample. Set per-category quotas so rare classes get scrutiny proportionate to their importance, not their frequency.

Stratify by stakes: weight the sample toward outputs where an error is expensive --- large refunds, medical fields, legal clauses --- so consequence, not just volume, drives review. Combine the dimensions (confidence times stakes, confidence times category) when more than one matters.

Watch the reviewer economics: the point of stratifying is error yield per reviewer-hour. Oversample too aggressively and reviewers see only hard cases and burn out; sample too flat and rare errors slip through. Tune the ratios against the measured catch rate, and log what was sampled so the strata can be re-tuned.

When to Use It

Any human-in-the-loop review queue operating on a stream too large to review in full --- extraction QA, content moderation, agent-output spot-checks. Most valuable when failures are rare and unevenly distributed, which is exactly when random sampling misses them. Pairs with per-field confidence, which supplies the strata, and calibration, which makes the strata trustworthy.

Alternatives --- full review for streams small enough, or stakes high enough, to justify it; pure random sampling when the goal is an unbiased quality estimate rather than catching the most errors (the two goals differ, and a blended sample can serve both).

Sources

Cochran, “Sampling Techniques” (classical reference on stratified sampling)
Chapter 4 and Section C --- the judge and eval layers that produce the scores stratified on

Example artifacts

Code.

# Confidence-stratified review sample: all low-confidence, plus a 2% baseline
review_queue = []
for out in outputs:
    if out.confidence < 0.70:
        review_queue.append(out)         # oversample the risky band
    elif hash(out.id) % 50 == 0:
        review_queue.append(out)         # 2% baseline over the confident majority

Section I — Discovery and curation

Where to track the evaluation and guardrails ecosystem as it evolves

The evaluation and guardrails category moves fast enough that printed catalogs go partially stale within months. Two kinds of resources help: the published benchmark leaderboards (which track model and method performance against standardized tasks) and the community-maintained awesome-X lists (which track which products exist).

Benchmark leaderboards and awesome lists

Source: Various: HELM, MTEB, BIG-bench, awesome-llm-evaluation, awesome-llm-security

Classification Discovery and tracking resources.

Intent

Provide pointers to the current state of LLM evaluation, model performance, and the security ecosystem, updated by communities of practice rather than by single vendors.

Motivating Problem

No printed list of products is current six months after publication. The community resources that track the ecosystem in real time --- leaderboards for benchmarks, awesome-X lists for products, vendor blogs for technique evolution --- are the right places to look for what’s current. The catalog identifies them; the reader checks them when current state matters.

How It Works

Benchmark leaderboards: HELM (crfm.stanford.edu/helm/) for cross-model evaluation, MTEB (huggingface.co/spaces/mteb/leaderboard) for embedding models, Chatbot Arena (chat.lmsys.org) for crowd-sourced human-preference scores. Each is maintained by an academic or community group; each captures different dimensions of model performance.

Awesome lists: awesome-llm-evaluation (curated frameworks and resources), awesome-llm-security (security tools and research), awesome-prompt-engineering (prompt patterns), and many more. The standard awesome-X GitHub-list format, browsable for discovery.

Vendor-maintained resources: Anthropic’s alignment research publications, OpenAI’s evals registry, Microsoft’s Responsible AI documentation. Vendor-specific but often the most operationally current.

When to Use It

Periodic surveys of the current state. Discovery when surveying products for a specific need. Cross-checking that this catalog’s recommendations match the current consensus.

Alternatives --- vendor-specific blogs and documentation. Conference proceedings (NeurIPS, ICLR, ACL) for academic state-of-the-art. Direct evaluation against your own data, which always beats any leaderboard for your specific case.

Sources

crfm.stanford.edu/helm/
huggingface.co/spaces/mteb/leaderboard
github.com/topics/awesome-llm

Appendix A --- Failure Category Reference Table

Cross-reference between the four failure categories (Chapter 2) and their representative defenses.

Category	What it includes	Representative substrates
Quality failures	Hallucination, faithfulness gaps, off-topic, reasoning errors	Evals: DeepEval, Ragas; rubrics; LLM-as-judge
Safety failures	Toxicity, bias, PII leakage, self-harm content	Guardrails: Llama Guard, Guardrails AI; classifier endpoints
Security failures	Prompt injection (direct/indirect), jailbreaks, system prompt leakage	Input filters; red-team: Garak, PyRIT; structured prompts
Behavior failures	Tool misuse, excessive agency, off-policy, resource exhaustion	Tool-call evals; policy enforcers; HITL gates (Vol 7)

Appendix B --- The Eight-Volume Series

This catalog joins the seven prior volumes to form an eight-layer vocabulary for agentic AI. The volumes are independent and the reading order is flexible; the cross-references make whichever path the reader chooses coherent.

Volume 1 --- Patterns of AI Agent Workflows --- the timing of agent runs.
Volume 2 --- The Claude Skills Catalog --- model instructions in packaged form.
Volume 3 --- The AI Agent Tools Catalog --- the function-calling primitives.
Volume 4 --- The AI Agent Events & Triggers Catalog --- the activation layer.
Volume 5 --- The AI Agent Fabric Catalog --- the infrastructure substrate.
Volume 6 --- The AI Agent Memory Catalog --- the state and context layer.
Volume 7 --- The Human-in-the-Loop Catalog --- the human-agent interaction layer.
Volume 8 --- The Evaluation & Guardrails Catalog (this volume) --- the governance layer.

Seven of the eight volumes describe components of an agent system: patterns, skills, tools, events, fabric, memory, human interaction. The eighth volume is different in shape --- it wraps around the other seven rather than slotting beneath them. Evaluation and guardrails apply to every layer: evaluate the pattern’s correctness, validate the skill’s outputs, guard the tool’s usage, audit the event’s authorization, monitor the fabric’s behavior, test the memory’s integrity, score the human-agent interaction’s quality. The governance layer is the cross-cutting concern that makes the rest of the system trustworthy.

Eight layers, eight volumes. Read top-down for the agent designer’s sequence; read bottom-up for the operator’s sequence; read the governance layer (this volume) across all of them at any time. The framework is intended to be useful from multiple directions for different practitioners; the cross-references support each path. A working architect makes choices at all eight layers; the choices interact; the interactions are what makes a production agent system both functional and responsible.

Appendix C --- The Eight Eval & Guardrails Anti-Patterns

Eight recurring mistakes that distinguish thoughtful governance designs from improvised ones. Avoiding these is most of the practical wisdom in the field:

Conflating evals and guardrails. They run at different times, catch different failures, use different tools, and require different teams to own them. A design that has one and skips the other ships with a known gap.
Treating system prompts as security controls. OWASP 2025 is explicit: “System prompts are not security controls.” Anything in the prompt is reachable by an attacker; anything that must hold must be enforced outside the model in deterministic code.
Single-layer defense. Any single layer has a non-trivial false-negative rate. Defense in depth is the only working approach. Designs that rely on one layer (“we have a content filter”) ship with adversarial vulnerabilities the operator never anticipates.
LLM-as-judge without validation. Unvalidated judges are confident liars. Position bias, length bias, self-preference, and verbosity capture are documented and correctable; an eval that doesn’t correct them produces optimistic noise rather than useful signal.
Skipping red-team testing. Evals catch the regressions a careful engineer would notice; red teams catch the attacks an engineer wouldn’t imagine. A system that ships without red-team testing has not been tested against the threats it will actually face.
Building evaluation last. Evaluation built after a system is in production is significantly more expensive than evaluation built into development from the start. The system’s shape adapts to its testing story; retrofitting tests against a shape that didn’t anticipate them produces poor coverage and high friction.
Treating the safety taxonomy as fixed. New attack patterns and failure modes emerge regularly. A governance program that captures “the attacks we know” and stops looking will miss the attacks coming next. Continuous red-teaming and engagement with the OWASP and ATLAS communities is the maintenance cost.
Optimizing for benchmark scores at the expense of production behavior. Standardized benchmarks (HELM, OpenAI Evals registry, leaderboards) are useful starting points and dangerous targets. A model trained or tuned to maximize benchmark scores often regresses in production behavior the benchmarks don’t measure. The right evaluation is against the team’s actual workload.

Appendix D --- Discovery and Maturity

Resources for tracking the evaluation and guardrails ecosystem as it evolves:

OWASP GenAI Security Project (genai.owasp.org) --- the canonical threat catalog, updated annually.
NIST AI Risk Management Framework (nist.gov/itl/ai-risk-management-framework) --- the governance documentation backbone.
MITRE ATLAS (atlas.mitre.org) --- the adversarial threat taxonomy with case studies.
HELM leaderboard (crfm.stanford.edu/helm/) --- cross-model evaluation with academic rigor.
MTEB leaderboard (huggingface.co/spaces/mteb/leaderboard) --- embedding model evaluation.
Mem0 memory-benchmarks (github.com/mem0ai/memory-benchmarks) --- memory-layer evaluation (cross-reference to Volume 6).
Vendor responsible-AI publications: Anthropic’s alignment research, OpenAI’s safety publications, Microsoft’s Responsible AI Standard, NVIDIA’s NeMo Guardrails documentation.

Two pragmatic recommendations. First, build for governance maturity, not governance perfection. A program that starts with three working defenses and adds more over time outperforms a program that plans ten defenses and ships none. Defense in depth requires multiple layers, but the right number to start with is the number you can actually run today. Second, treat governance as a continuous discipline, not a launch checklist. The threats evolve, the products evolve, the team’s deployment shape evolves. The governance program that worked last year is not necessarily the governance program that works this year; periodic review and update is the maintenance cost of running AI systems responsibly.

Appendix E --- Omissions

This catalog covers about 18 substrates across 8 sections. The wider ecosystem is significantly larger; a non-exhaustive list of what isn’t here:

General software testing frameworks (pytest, JUnit, Jest) when not specifically adapted for LLM evaluation. They’re foundational but covered elsewhere.
Cloud-vendor managed evaluation products beyond brief mention (AWS Bedrock evaluation, Azure AI Foundry evaluation). They follow the patterns this catalog covers.
Domain-specific compliance frameworks (HIPAA, GDPR, PCI-DSS) when used outside the AI-specific governance context. Important but covered in their own literatures.
Closed enterprise security products (Splunk, Crowdstrike, SentinelOne) when not extended with LLM-aware capabilities.
Model alignment and RLHF infrastructure --- these shape model behavior at training time, distinct from the deployment-time evaluation and guardrails covered here.
Academic benchmarks beyond HELM (BIG-bench, MMLU, BBH, HumanEval, and many more). They’re relevant but live in their own resources; OpenAI Evals and HELM are the entry points for finding them.

Appendix F --- A Note on the Moving Target and the Series Close

OWASP’s 2025 LLM Top 10 added two new categories and significantly expanded Excessive Agency, reflecting how agent architectures have evolved beyond the chatbot pattern. NIST’s Generative AI Profile (NIST AI 600-1) landed in July 2024 as the enterprise governance framework for the GenAI era. MITRE ATLAS continues to absorb new adversarial techniques as they’re documented. The tooling --- NeMo Guardrails, Guardrails AI, DeepEval, Ragas, Llama Guard, Garak, PyRIT --- evolves on quarterly cycles. The category is mature enough that the working patterns are clear and young enough that the products will look different in twelve months.

The deepest structural facts are stable. Evaluation and guardrails are different defenses. The failure space has four categories --- quality, safety, security, behavior --- each needing different detection tools. Defense in depth is the only working approach; no single layer is sufficient. LLM-as-judge pervades both evaluation and guardrails, with documented biases that require validation. The OWASP Top 10 is the canonical threat model and the right starting checklist. An architect who internalizes these structural facts can map any new product onto the framework in minutes; an architect who learns only the products has to relearn the field every year.

Eight volumes complete. Patterns, Skills, Tools, Events, Fabric, Memory, Human-in-the-Loop, Evaluation & Guardrails. A working architect now has the vocabulary for the entire stack: how runs compose, what model instructions are loaded, which tools the agent calls, what events trigger the agent, what fabric it runs on, what state and memory it operates against, how humans approve and observe and interact with it, and how the whole system is tested and defended and governed. The vocabulary is intended to outlive any specific product; the products implement the vocabulary; the products change; the vocabulary doesn’t. Internalize the conceptual structure and the technical landscape becomes legible regardless of which products dominate this quarter.

--- End of The Evaluation & Guardrails Catalog v0.1 ---