Skip to main contentSkip to navigation
llm
guardrails
ai-security
testing

Evaluating Guardrail Frameworks for LLM Applications

A practical comparison of NeMo Guardrails, Guardrails AI, and custom implementations—benchmarks, trade-offs, and when to use each.

Publié February 6, 2026

6 min de lecture

Every guardrail framework promises safety. Few deliver it consistently. Before you bet your production system on a framework, you need to know what it actually catches—and what it misses under load, under adversarial pressure, and in the edge cases that matter most.

After deploying guardrails across multiple production systems, here's what I've learned about the three main approaches: NeMo Guardrails, Guardrails AI, and custom implementations.

The Trade-Off Triangle

Every guardrail solution balances three things:

  • Coverage: How many attack types does it catch?
  • Latency: How much does it slow down each request?
  • Flexibility: How easily can you customize it for your domain?

No framework wins on all three. The right choice depends on which trade-off your application can tolerate.

NeMo Guardrails

NVIDIA's NeMo Guardrails uses Colang, a domain-specific language for defining conversational flows and safety rules. It intercepts the conversation at multiple points: before the LLM call, after it, and during dialog management.

Setup

from nemoguardrails import RailsConfig, LLMRails
 
# config/config.yml defines model, rails, and general settings
# config/rails/ contains Colang files with flow definitions
 
config = RailsConfig.from_path("./config")
rails = LLMRails(config)
 
# Usage
response = await rails.generate(
    messages=[{"role": "user", "content": user_input}]
)

The power of NeMo is in the Colang definitions:

define user ask about competitors
"What do you think about [competitor]?"
"How do you compare to [competitor]?"
"Is [competitor] better?"
 
define bot refuse competitor discussion
"I'm focused on helping you with our products. I'd be happy to answer questions about our features."
 
define flow
user ask about competitors
bot refuse competitor discussion

Where It Shines

  • Topical control: The dialog management layer is excellent at keeping conversations on-topic. If your chatbot should only discuss your product, Colang flows handle this naturally.
  • Built-in jailbreak resistance: Ships with pre-built rails for common jailbreak patterns.
  • Fact-checking integration: Can be configured to verify claims against a knowledge base before responding.

Where It Breaks

  • Latency overhead: Each rail adds a round-trip to the LLM for intent classification. In my benchmarks, a three-rail setup adds 800-1,200ms to every request. For real-time chat, that's noticeable.
  • Colang learning curve: Your security team needs to learn a new language. Debugging Colang flows when they don't fire correctly is frustrating—the tooling is still immature.
  • Prompt injection blind spot: NeMo is strong at topical control but weaker at detecting sophisticated prompt injection in user inputs. It relies on dialog flow matching, which adversarial inputs can evade.

Guardrails AI

Guardrails AI takes a different approach: composable validators that inspect inputs and outputs as a pipeline. It feels like Pydantic for LLM safety.

Setup

from guardrails import Guard
from guardrails.hub import (
    DetectPII,
    ToxicLanguage,
    DetectPromptInjection,
    RestrictToTopic,
    CompetitorCheck,
)
 
# Compose validators into a guard
guard = Guard().use_many(
    DetectPromptInjection(on_fail="exception"),
    ToxicLanguage(threshold=0.8, on_fail="filter"),
    DetectPII(
        pii_entities=["EMAIL", "PHONE", "SSN", "CREDIT_CARD"],
        on_fail="fix"
    ),
    RestrictToTopic(
        valid_topics=["product support", "billing", "technical help"],
        on_fail="refrain"
    ),
)
 
# Usage
result = guard(
    llm_api=openai.chat.completions.create,
    model="gpt-4",
    messages=[{"role": "user", "content": user_input}],
)

Where It Shines

  • Composability: Mix and match validators from the hub or write your own. Adding a new check is one line of code.
  • Python-native: No new language to learn. Validators are Python classes with a validate method. Your team can write custom validators in an afternoon.
  • Structured output enforcement: Excellent at ensuring LLM outputs conform to a schema. If you need the model to return valid JSON with specific fields, Guardrails AI handles this natively.
  • Granular failure modes: Each validator can block, filter, fix, or refrain independently. PII gets redacted while injection attempts get blocked—in the same pipeline.

Where It Breaks

  • No dialog management: Unlike NeMo, there's no concept of conversation flow. Each request is validated independently. Multi-turn attacks that build context across messages can slip through.
  • Validator quality varies: Hub validators range from excellent to experimental. Always benchmark before trusting one in production.
  • Output validation latency: Some validators (like RestrictToTopic) make their own LLM calls, adding latency on top of your primary model call.

Building Custom Guardrails

Sometimes neither framework fits. Custom guardrails make sense when:

  • Your domain has safety requirements that no validator covers (medical, legal, financial)
  • Latency is critical and you can't afford the overhead of framework internals
  • You need fine-grained control over exactly how inputs and outputs are processed

Architecture Pattern

from dataclasses import dataclass
from typing import Optional
from abc import ABC, abstractmethod
 
@dataclass
class GuardResult:
    passed: bool
    output: str
    triggered_rules: list[str]
    modified: bool  # True if output was filtered/redacted
 
class BaseGuardrail(ABC):
    @abstractmethod
    async def check(self, text: str, context: dict) -> GuardResult:
        pass
 
class CustomGuardrailPipeline:
    """Lightweight guardrail pipeline with minimal overhead."""
 
    def __init__(self):
        self.input_rails: list[BaseGuardrail] = []
        self.output_rails: list[BaseGuardrail] = []
 
    def add_input_rail(self, rail: BaseGuardrail):
        self.input_rails.append(rail)
 
    def add_output_rail(self, rail: BaseGuardrail):
        self.output_rails.append(rail)
 
    async def check_input(self, user_input: str, context: dict) -> GuardResult:
        for rail in self.input_rails:
            result = await rail.check(user_input, context)
            if not result.passed:
                return result
        return GuardResult(passed=True, output=user_input, triggered_rules=[], modified=False)
 
    async def check_output(self, response: str, context: dict) -> GuardResult:
        current = response
        triggered = []
        modified = False
 
        for rail in self.output_rails:
            result = await rail.check(current, context)
            if not result.passed:
                return result
            if result.modified:
                current = result.output
                modified = True
            triggered.extend(result.triggered_rules)
 
        return GuardResult(
            passed=True, output=current,
            triggered_rules=triggered, modified=modified
        )

Custom Classifier Example

For domain-specific checks, fine-tuned classifiers often outperform general-purpose validators:

class DomainInjectionDetector(BaseGuardrail):
    """Injection detector trained on domain-specific attack patterns."""
 
    def __init__(self, model_path: str, threshold: float = 0.85):
        self.classifier = load_classifier(model_path)
        self.threshold = threshold
 
    async def check(self, text: str, context: dict) -> GuardResult:
        score = self.classifier.predict_proba(text)[1]  # injection probability
        if score > self.threshold:
            return GuardResult(
                passed=False,
                output="I can't process that request.",
                triggered_rules=[f"domain_injection_score={score:.3f}"],
                modified=False,
            )
        return GuardResult(passed=True, output=text, triggered_rules=[], modified=False)

When Custom Is Worth It

Custom guardrails require ongoing maintenance—new attack patterns, model updates, and false positive tuning. Budget for this. A framework handles maintenance for you; custom code puts that burden on your team. If you have a dedicated ML security team, custom wins on performance and coverage. If you don't, start with a framework and add custom rails only where frameworks fall short.

Head-to-Head Comparison

I ran each approach against a test suite of 500 prompt injection payloads, 200 jailbreak attempts, and 300 PII extraction probes. Here's how they compared:

@dataclass
class BenchmarkResult:
    framework: str
    injection_detection_rate: float
    jailbreak_detection_rate: float
    pii_detection_rate: float
    false_positive_rate: float
    p50_latency_ms: float
    p95_latency_ms: float
 
class GuardrailBenchmark:
    """Benchmark runner for comparing guardrail frameworks."""
 
    def __init__(self, test_suites: dict[str, list[str]]):
        self.suites = test_suites
 
    async def run(self, framework_fn, framework_name: str) -> BenchmarkResult:
        results = {"injection": [], "jailbreak": [], "pii": [], "benign": []}
        latencies = []
 
        for category, payloads in self.suites.items():
            for payload in payloads:
                start = time.monotonic()
                blocked = await framework_fn(payload)
                latencies.append((time.monotonic() - start) * 1000)
                results[category].append(blocked)
 
        return BenchmarkResult(
            framework=framework_name,
            injection_detection_rate=sum(results["injection"]) / len(results["injection"]),
            jailbreak_detection_rate=sum(results["jailbreak"]) / len(results["jailbreak"]),
            pii_detection_rate=sum(results["pii"]) / len(results["pii"]),
            false_positive_rate=sum(results["benign"]) / len(results["benign"]),
            p50_latency_ms=sorted(latencies)[len(latencies) // 2],
            p95_latency_ms=sorted(latencies)[int(len(latencies) * 0.95)],
        )
MetricNeMo GuardrailsGuardrails AICustom (tuned classifier)
Injection detection71%84%93%
Jailbreak detection82%68%87%
PII detection76%92%89%
False positive rate8%5%3%
p50 latency overhead940ms320ms45ms
p95 latency overhead1,850ms890ms120ms

Key takeaway: NeMo's latency overhead is significant but its jailbreak detection is strong. Guardrails AI has the best PII detection and lowest false positives among frameworks. Custom classifiers win on latency and overall detection but require engineering investment.

Decision Guidance

Choose NeMo Guardrails when:

  • Your primary concern is keeping conversations on-topic
  • You need dialog flow control (multi-turn safety)
  • Latency budget is generous (>2 seconds per request is acceptable)

Choose Guardrails AI when:

  • You need composable, mix-and-match safety validators
  • PII detection and structured output validation are priorities
  • Your team is Python-native and wants fast iteration

Choose custom when:

  • Latency is critical (under 100ms overhead budget)
  • You have domain-specific attack patterns that frameworks don't cover
  • You have a team that can maintain classifiers and update detection models

Best practice: Layer them. Use Guardrails AI for input/output validation (fast, composable) and add custom classifiers for domain-specific threats. NeMo can sit on top for dialog management if you need multi-turn control and can absorb the latency.

Conclusion

  1. No single framework catches everything—even the best detection rates leave gaps. Layer your guardrails to cover each other's blind spots.
  2. Benchmark against your own attack surface, not generic test suites. The injection patterns targeting a customer service bot are different from those targeting a code assistant.
  3. Latency matters more than you think. An 800ms overhead per request changes the user experience. Measure it before committing to a framework.
  4. Custom guardrails fill gaps, but treat them like any security control—they need monitoring, updates, and regular testing to stay effective.

For a deeper look at how prompt injection attacks work and how to build systematic red team testing for your guardrails, those articles cover the fundamentals this piece builds on.

Evaluating Guardrail Frameworks for LLM Applications | Musah Abdulai