Publicado February 6, 2026
6 min de lectura
Every guardrail framework promises safety. Few deliver it consistently. Before you bet your production system on a framework, you need to know what it actually catches—and what it misses under load, under adversarial pressure, and in the edge cases that matter most.
After deploying guardrails across multiple production systems, here's what I've learned about the three main approaches: NeMo Guardrails, Guardrails AI, and custom implementations.
Every guardrail solution balances three things:
No framework wins on all three. The right choice depends on which trade-off your application can tolerate.
NVIDIA's NeMo Guardrails uses Colang, a domain-specific language for defining conversational flows and safety rules. It intercepts the conversation at multiple points: before the LLM call, after it, and during dialog management.
from nemoguardrails import RailsConfig, LLMRails
# config/config.yml defines model, rails, and general settings
# config/rails/ contains Colang files with flow definitions
config = RailsConfig.from_path("./config")
rails = LLMRails(config)
# Usage
response = await rails.generate(
messages=[{"role": "user", "content": user_input}]
)The power of NeMo is in the Colang definitions:
define user ask about competitors
"What do you think about [competitor]?"
"How do you compare to [competitor]?"
"Is [competitor] better?"
define bot refuse competitor discussion
"I'm focused on helping you with our products. I'd be happy to answer questions about our features."
define flow
user ask about competitors
bot refuse competitor discussionGuardrails AI takes a different approach: composable validators that inspect inputs and outputs as a pipeline. It feels like Pydantic for LLM safety.
from guardrails import Guard
from guardrails.hub import (
DetectPII,
ToxicLanguage,
DetectPromptInjection,
RestrictToTopic,
CompetitorCheck,
)
# Compose validators into a guard
guard = Guard().use_many(
DetectPromptInjection(on_fail="exception"),
ToxicLanguage(threshold=0.8, on_fail="filter"),
DetectPII(
pii_entities=["EMAIL", "PHONE", "SSN", "CREDIT_CARD"],
on_fail="fix"
),
RestrictToTopic(
valid_topics=["product support", "billing", "technical help"],
on_fail="refrain"
),
)
# Usage
result = guard(
llm_api=openai.chat.completions.create,
model="gpt-4",
messages=[{"role": "user", "content": user_input}],
)validate method. Your team can write custom validators in an afternoon.RestrictToTopic) make their own LLM calls, adding latency on top of your primary model call.Sometimes neither framework fits. Custom guardrails make sense when:
from dataclasses import dataclass
from typing import Optional
from abc import ABC, abstractmethod
@dataclass
class GuardResult:
passed: bool
output: str
triggered_rules: list[str]
modified: bool # True if output was filtered/redacted
class BaseGuardrail(ABC):
@abstractmethod
async def check(self, text: str, context: dict) -> GuardResult:
pass
class CustomGuardrailPipeline:
"""Lightweight guardrail pipeline with minimal overhead."""
def __init__(self):
self.input_rails: list[BaseGuardrail] = []
self.output_rails: list[BaseGuardrail] = []
def add_input_rail(self, rail: BaseGuardrail):
self.input_rails.append(rail)
def add_output_rail(self, rail: BaseGuardrail):
self.output_rails.append(rail)
async def check_input(self, user_input: str, context: dict) -> GuardResult:
for rail in self.input_rails:
result = await rail.check(user_input, context)
if not result.passed:
return result
return GuardResult(passed=True, output=user_input, triggered_rules=[], modified=False)
async def check_output(self, response: str, context: dict) -> GuardResult:
current = response
triggered = []
modified = False
for rail in self.output_rails:
result = await rail.check(current, context)
if not result.passed:
return result
if result.modified:
current = result.output
modified = True
triggered.extend(result.triggered_rules)
return GuardResult(
passed=True, output=current,
triggered_rules=triggered, modified=modified
)For domain-specific checks, fine-tuned classifiers often outperform general-purpose validators:
class DomainInjectionDetector(BaseGuardrail):
"""Injection detector trained on domain-specific attack patterns."""
def __init__(self, model_path: str, threshold: float = 0.85):
self.classifier = load_classifier(model_path)
self.threshold = threshold
async def check(self, text: str, context: dict) -> GuardResult:
score = self.classifier.predict_proba(text)[1] # injection probability
if score > self.threshold:
return GuardResult(
passed=False,
output="I can't process that request.",
triggered_rules=[f"domain_injection_score={score:.3f}"],
modified=False,
)
return GuardResult(passed=True, output=text, triggered_rules=[], modified=False)Custom guardrails require ongoing maintenance—new attack patterns, model updates, and false positive tuning. Budget for this. A framework handles maintenance for you; custom code puts that burden on your team. If you have a dedicated ML security team, custom wins on performance and coverage. If you don't, start with a framework and add custom rails only where frameworks fall short.
I ran each approach against a test suite of 500 prompt injection payloads, 200 jailbreak attempts, and 300 PII extraction probes. Here's how they compared:
@dataclass
class BenchmarkResult:
framework: str
injection_detection_rate: float
jailbreak_detection_rate: float
pii_detection_rate: float
false_positive_rate: float
p50_latency_ms: float
p95_latency_ms: float
class GuardrailBenchmark:
"""Benchmark runner for comparing guardrail frameworks."""
def __init__(self, test_suites: dict[str, list[str]]):
self.suites = test_suites
async def run(self, framework_fn, framework_name: str) -> BenchmarkResult:
results = {"injection": [], "jailbreak": [], "pii": [], "benign": []}
latencies = []
for category, payloads in self.suites.items():
for payload in payloads:
start = time.monotonic()
blocked = await framework_fn(payload)
latencies.append((time.monotonic() - start) * 1000)
results[category].append(blocked)
return BenchmarkResult(
framework=framework_name,
injection_detection_rate=sum(results["injection"]) / len(results["injection"]),
jailbreak_detection_rate=sum(results["jailbreak"]) / len(results["jailbreak"]),
pii_detection_rate=sum(results["pii"]) / len(results["pii"]),
false_positive_rate=sum(results["benign"]) / len(results["benign"]),
p50_latency_ms=sorted(latencies)[len(latencies) // 2],
p95_latency_ms=sorted(latencies)[int(len(latencies) * 0.95)],
)| Metric | NeMo Guardrails | Guardrails AI | Custom (tuned classifier) |
|---|---|---|---|
| Injection detection | 71% | 84% | 93% |
| Jailbreak detection | 82% | 68% | 87% |
| PII detection | 76% | 92% | 89% |
| False positive rate | 8% | 5% | 3% |
| p50 latency overhead | 940ms | 320ms | 45ms |
| p95 latency overhead | 1,850ms | 890ms | 120ms |
Key takeaway: NeMo's latency overhead is significant but its jailbreak detection is strong. Guardrails AI has the best PII detection and lowest false positives among frameworks. Custom classifiers win on latency and overall detection but require engineering investment.
Choose NeMo Guardrails when:
Choose Guardrails AI when:
Choose custom when:
Best practice: Layer them. Use Guardrails AI for input/output validation (fast, composable) and add custom classifiers for domain-specific threats. NeMo can sit on top for dialog management if you need multi-turn control and can absorb the latency.
For a deeper look at how prompt injection attacks work and how to build systematic red team testing for your guardrails, those articles cover the fundamentals this piece builds on.