Evaluating Guardrail Frameworks for LLM Applications
A practical comparison of NeMo Guardrails, Guardrails AI, and custom implementations—benchmarks, trade-offs, and when to use each.
Every guardrail framework promises safety. Few deliver it consistently. Before you bet your production system on a framework, you need to know what it actually catches—and what it misses under load, under adversarial pressure, and in the edge cases that matter most.
After deploying guardrails across multiple production systems, here's what I've learned about the three main approaches: NeMo Guardrails, Guardrails AI, and custom implementations.
The Trade-Off Triangle
Every guardrail solution balances three things:
- Coverage: How many attack types does it catch?
- Latency: How much does it slow down each request?
- Flexibility: How easily can you customize it for your domain?
No framework wins on all three. The right choice depends on which trade-off your application can tolerate.
NeMo Guardrails
NVIDIA's NeMo Guardrails uses Colang, a domain-specific language for defining conversational flows and safety rules. It intercepts the conversation at multiple points: before the LLM call, after it, and during dialog management.
Setup
from nemoguardrails import RailsConfig, LLMRails
# config/config.yml defines model, rails, and general settings
# config/rails/ contains Colang files with flow definitions
config = RailsConfig.from_path("./config")
rails = LLMRails(config)
# Usage
response = await rails.generate(
messages=[{"role": "user", "content": user_input}]
)The power of NeMo is in the Colang definitions:
define user ask about competitors
"What do you think about [competitor]?"
"How do you compare to [competitor]?"
"Is [competitor] better?"
define bot refuse competitor discussion
"I'm focused on helping you with our products. I'd be happy to answer questions about our features."
define flow
user ask about competitors
bot refuse competitor discussionWhere It Shines
- Topical control: The dialog management layer is excellent at keeping conversations on-topic. If your chatbot should only discuss your product, Colang flows handle this naturally.
- Built-in jailbreak resistance: Ships with pre-built rails for common jailbreak patterns.
- Fact-checking integration: Can be configured to verify claims against a knowledge base before responding.
Where It Breaks
- Latency overhead: Each rail adds a round-trip to the LLM for intent classification. In my benchmarks, a three-rail setup adds 800-1,200ms to every request. For real-time chat, that's noticeable.
- Colang learning curve: Your security team needs to learn a new language. Debugging Colang flows when they don't fire correctly is frustrating—the tooling is still immature.
- Prompt injection blind spot: NeMo is strong at topical control but weaker at detecting sophisticated prompt injection in user inputs. It relies on dialog flow matching, which adversarial inputs can evade.
Guardrails AI
Guardrails AI takes a different approach: composable validators that inspect inputs and outputs as a pipeline. It feels like Pydantic for LLM safety.
Setup
from guardrails import Guard
from guardrails.hub import (
DetectPII,
ToxicLanguage,
DetectPromptInjection,
RestrictToTopic,
CompetitorCheck,
)
# Compose validators into a guard
guard = Guard().use_many(
DetectPromptInjection(on_fail="exception"),
ToxicLanguage(threshold=0.8, on_fail="filter"),
DetectPII(
pii_entities=["EMAIL", "PHONE", "SSN", "CREDIT_CARD"],
on_fail="fix"
),
RestrictToTopic(
valid_topics=["product support", "billing", "technical help"],
on_fail="refrain"
),
)
# Usage
result = guard(
llm_api=openai.chat.completions.create,
model="gpt-4",
messages=[{"role": "user", "content": user_input}],
)Where It Shines
- Composability: Mix and match validators from the hub or write your own. Adding a new check is one line of code.
- Python-native: No new language to learn. Validators are Python classes with a
validatemethod. Your team can write custom validators in an afternoon. - Structured output enforcement: Excellent at ensuring LLM outputs conform to a schema. If you need the model to return valid JSON with specific fields, Guardrails AI handles this natively.
- Granular failure modes: Each validator can block, filter, fix, or refrain independently. PII gets redacted while injection attempts get blocked—in the same pipeline.
Where It Breaks
- No dialog management: Unlike NeMo, there's no concept of conversation flow. Each request is validated independently. Multi-turn attacks that build context across messages can slip through.
- Validator quality varies: Hub validators range from excellent to experimental. Always benchmark before trusting one in production.
- Output validation latency: Some validators (like
RestrictToTopic) make their own LLM calls, adding latency on top of your primary model call.
Building Custom Guardrails
Sometimes neither framework fits. Custom guardrails make sense when:
- Your domain has safety requirements that no validator covers (medical, legal, financial)
- Latency is critical and you can't afford the overhead of framework internals
- You need fine-grained control over exactly how inputs and outputs are processed
Architecture Pattern
from dataclasses import dataclass
from typing import Optional
from abc import ABC, abstractmethod
@dataclass
class GuardResult:
passed: bool
output: str
triggered_rules: list[str]
modified: bool # True if output was filtered/redacted
class BaseGuardrail(ABC):
@abstractmethod
async def check(self, text: str, context: dict) -> GuardResult:
pass
class CustomGuardrailPipeline:
"""Lightweight guardrail pipeline with minimal overhead."""
def __init__(self):
self.input_rails: list[BaseGuardrail] = []
self.output_rails: list[BaseGuardrail] = []
def add_input_rail(self, rail: BaseGuardrail):
self.input_rails.append(rail)
def add_output_rail(self, rail: BaseGuardrail):
self.output_rails.append(rail)
async def check_input(self, user_input: str, context: dict) -> GuardResult:
for rail in self.input_rails:
result = await rail.check(user_input, context)
if not result.passed:
return result
return GuardResult(passed=True, output=user_input, triggered_rules=[], modified=False)
async def check_output(self, response: str, context: dict) -> GuardResult:
current = response
triggered = []
modified = False
for rail in self.output_rails:
result = await rail.check(current, context)
if not result.passed:
return result
if result.modified:
current = result.output
modified = True
triggered.extend(result.triggered_rules)
return GuardResult(
passed=True, output=current,
triggered_rules=triggered, modified=modified
)Custom Classifier Example
For domain-specific checks, fine-tuned classifiers often outperform general-purpose validators:
class DomainInjectionDetector(BaseGuardrail):
"""Injection detector trained on domain-specific attack patterns."""
def __init__(self, model_path: str, threshold: float = 0.85):
self.classifier = load_classifier(model_path)
self.threshold = threshold
async def check(self, text: str, context: dict) -> GuardResult:
score = self.classifier.predict_proba(text)[1] # injection probability
if score > self.threshold:
return GuardResult(
passed=False,
output="I can't process that request.",
triggered_rules=[f"domain_injection_score={score:.3f}"],
modified=False,
)
return GuardResult(passed=True, output=text, triggered_rules=[], modified=False)When Custom Is Worth It
Custom guardrails require ongoing maintenance—new attack patterns, model updates, and false positive tuning. Budget for this. A framework handles maintenance for you; custom code puts that burden on your team. If you have a dedicated ML security team, custom wins on performance and coverage. If you don't, start with a framework and add custom rails only where frameworks fall short.
Head-to-Head Comparison
I ran each approach against a test suite of 500 prompt injection payloads, 200 jailbreak attempts, and 300 PII extraction probes. Here's how they compared:
@dataclass
class BenchmarkResult:
framework: str
injection_detection_rate: float
jailbreak_detection_rate: float
pii_detection_rate: float
false_positive_rate: float
p50_latency_ms: float
p95_latency_ms: float
class GuardrailBenchmark:
"""Benchmark runner for comparing guardrail frameworks."""
def __init__(self, test_suites: dict[str, list[str]]):
self.suites = test_suites
async def run(self, framework_fn, framework_name: str) -> BenchmarkResult:
results = {"injection": [], "jailbreak": [], "pii": [], "benign": []}
latencies = []
for category, payloads in self.suites.items():
for payload in payloads:
start = time.monotonic()
blocked = await framework_fn(payload)
latencies.append((time.monotonic() - start) * 1000)
results[category].append(blocked)
return BenchmarkResult(
framework=framework_name,
injection_detection_rate=sum(results["injection"]) / len(results["injection"]),
jailbreak_detection_rate=sum(results["jailbreak"]) / len(results["jailbreak"]),
pii_detection_rate=sum(results["pii"]) / len(results["pii"]),
false_positive_rate=sum(results["benign"]) / len(results["benign"]),
p50_latency_ms=sorted(latencies)[len(latencies) // 2],
p95_latency_ms=sorted(latencies)[int(len(latencies) * 0.95)],
)| Metric | NeMo Guardrails | Guardrails AI | Custom (tuned classifier) |
|---|---|---|---|
| Injection detection | 71% | 84% | 93% |
| Jailbreak detection | 82% | 68% | 87% |
| PII detection | 76% | 92% | 89% |
| False positive rate | 8% | 5% | 3% |
| p50 latency overhead | 940ms | 320ms | 45ms |
| p95 latency overhead | 1,850ms | 890ms | 120ms |
Key takeaway: NeMo's latency overhead is significant but its jailbreak detection is strong. Guardrails AI has the best PII detection and lowest false positives among frameworks. Custom classifiers win on latency and overall detection but require engineering investment.
Decision Guidance
Choose NeMo Guardrails when:
- Your primary concern is keeping conversations on-topic
- You need dialog flow control (multi-turn safety)
- Latency budget is generous (>2 seconds per request is acceptable)
Choose Guardrails AI when:
- You need composable, mix-and-match safety validators
- PII detection and structured output validation are priorities
- Your team is Python-native and wants fast iteration
Choose custom when:
- Latency is critical (under 100ms overhead budget)
- You have domain-specific attack patterns that frameworks don't cover
- You have a team that can maintain classifiers and update detection models
Best practice: Layer them. Use Guardrails AI for input/output validation (fast, composable) and add custom classifiers for domain-specific threats. NeMo can sit on top for dialog management if you need multi-turn control and can absorb the latency.
Conclusion
- No single framework catches everything—even the best detection rates leave gaps. Layer your guardrails to cover each other's blind spots.
- Benchmark against your own attack surface, not generic test suites. The injection patterns targeting a customer service bot are different from those targeting a code assistant.
- Latency matters more than you think. An 800ms overhead per request changes the user experience. Measure it before committing to a framework.
- Custom guardrails fill gaps, but treat them like any security control—they need monitoring, updates, and regular testing to stay effective.
For a deeper look at how prompt injection attacks work and how to build systematic red team testing for your guardrails, those articles cover the fundamentals this piece builds on.