Guardrails
Configure content moderation and safety policies for input and output
Guardrails evaluate content against safety policies before requests are sent to LLM providers (input guardrails) and after responses are received (output guardrails). You can block, warn, log, or redact content based on configurable rules.
Overview
The guardrails system supports:
- Multiple providers - OpenAI Moderation, AWS Bedrock, Azure Content Safety, or custom HTTP webhooks
- Built-in rules - Blocklist patterns, PII regex detection, content limits
- Flexible actions - Block, warn, log, or redact violations
- Execution modes - Blocking, concurrent, or streaming evaluation
- Audit logging - Track all evaluations and violations
Quick Start
Enable OpenAI Moderation (free) for input guardrails:
[features.guardrails]
enabled = true
[features.guardrails.input]
enabled = true
mode = "blocking"
[features.guardrails.input.provider]
type = "openai_moderation"Guardrail Providers
OpenAI Moderation
Free moderation API from OpenAI. Fast and effective for general content safety.
[features.guardrails.input.provider]
type = "openai_moderation"
api_key = "${OPENAI_API_KEY}" # Optional, uses default OpenAI key
model = "omni-moderation-latest" # or "text-moderation-latest"
# Custom thresholds per category (0.0-1.0, default varies by category)
[features.guardrails.input.provider.thresholds]
hate = 0.7
harassment = 0.7
self_harm = 0.5
sexual = 0.8
violence = 0.7Detected categories: Hate, Harassment, Self-Harm, Sexual, Violence
OpenAI Moderation is free to use and doesn't count against your API quota.
AWS Bedrock Guardrails
Enterprise-grade guardrails with PII detection, topic filters, and custom word lists.
[features.guardrails.input.provider]
type = "bedrock"
guardrail_id = "abc123def456"
guardrail_version = "1"
region = "us-east-1"
# Optional: Override default AWS credentials
access_key_id = "${AWS_ACCESS_KEY_ID}"
secret_access_key = "${AWS_SECRET_ACCESS_KEY}"
# Enable trace for debugging
trace_enabled = trueCapabilities:
| Feature | Description |
|---|---|
| Content filters | Hate, insults, sexual, violence, misconduct, prompt attacks |
| Word filters | Custom word lists and AWS managed lists |
| Topic filters | Block off-topic conversations |
| PII detection | Email, phone, SSN, credit card, address, name |
| Confidence levels | None, Low, Medium, High per category |
Configure guardrail policies in the AWS Console. The gateway references them by ID.
Azure AI Content Safety
Microsoft's content moderation with configurable severity thresholds.
[features.guardrails.input.provider]
type = "azure_content_safety"
endpoint = "https://myservice.cognitiveservices.azure.com"
api_key = "${AZURE_CONTENT_SAFETY_KEY}"
api_version = "2024-09-01"
# Severity thresholds (0-6 scale, block at threshold or above)
[features.guardrails.input.provider.thresholds]
hate = 2 # Block severity 2+
violence = 4 # Block severity 4+
self_harm = 2
sexual = 4Severity scale:
| Level | Severity | Score |
|---|---|---|
| Info | Safe | 0 |
| Low | Minor concerns | 1-2 |
| Medium | Moderate concerns | 3-4 |
| High | Significant concerns | 5 |
| Critical | Severe | 6 |
Custom HTTP Provider
Send content to your own moderation service via HTTP webhook.
[features.guardrails.input.provider]
type = "custom"
url = "https://my-guardrails.example.com/evaluate"
api_key = "${CUSTOM_GUARDRAILS_KEY}"
timeout_ms = 3000
retry_enabled = true
max_retries = 2
[features.guardrails.input.provider.headers]
X-Custom-Header = "value"Request format:
{
"input": "text to evaluate",
"source": "user_input",
"request_id": "req_abc123",
"user_id": "user_456",
"context": {}
}Response format:
{
"passed": false,
"violations": [
{
"category": "hate",
"severity": "high",
"confidence": 0.95,
"message": "Hate speech detected"
}
]
}Blocklist (Built-in)
Local pattern matching with literal strings or regex. No external API calls.
[features.guardrails.input.provider]
type = "blocklist"
case_insensitive = true
[[features.guardrails.input.provider.patterns]]
pattern = "competitor_name"
is_regex = false
category = "competitor_mention"
severity = "medium"
message = "Competitor name mentioned"
[[features.guardrails.input.provider.patterns]]
pattern = "(?i)\\b(password|secret|api.?key)\\s*[:=]"
is_regex = true
category = "confidential"
severity = "high"
message = "Potential secret detected"PII Regex (Built-in)
Detect common PII patterns without external API calls.
[features.guardrails.input.provider]
type = "pii_regex"
email = true
phone = true
ssn = true
credit_card = true
ip_address = true
date_of_birth = trueDetected patterns:
| Type | Example Pattern |
|---|---|
user@example.com | |
| Phone | (555) 123-4567, +1-555-123-4567 |
| SSN | 123-45-6789 |
| Credit Card | 4111-1111-1111-1111 |
| IP Address | 192.168.1.1 |
| Date of Birth | 01/15/1990, 1990-01-15 |
Content Limits (Built-in)
Enforce size constraints on input content.
[features.guardrails.input.provider]
type = "content_limits"
max_characters = 100000
max_words = 20000
max_lines = 1000Execution Modes
Input Guardrails
Blocking Mode (Default)
Evaluate guardrails before sending to the LLM. Safest option but adds latency.
[features.guardrails.input]
mode = "blocking"
timeout_ms = 5000
on_timeout = "block" # or "allow"Request → Guardrails → (pass) → LLM → Response
→ (fail) → ErrorConcurrent Mode
Evaluate guardrails and call LLM simultaneously. Lower latency for passing requests.
[features.guardrails.input]
mode = "concurrent"
timeout_ms = 1000
on_timeout = "block"Request → ┬→ Guardrails ─┬→ (pass) → Wait for LLM → Response
└→ LLM ────────┘→ (fail) → Cancel LLM → ErrorBehavior:
- If guardrails fail before LLM responds: cancel LLM request, return error
- If LLM responds first: wait for guardrails result before returning
- If guardrails timeout: action based on
on_timeoutsetting
Output Guardrails
Output guardrails evaluate LLM responses before returning to the client.
[features.guardrails.output]
enabled = true
timeout_ms = 5000
[features.guardrails.output.provider]
type = "openai_moderation"Streaming Evaluation Modes
For streaming responses, choose how to evaluate content:
[features.guardrails.output]
streaming_mode = "final_only" # Default| Mode | Behavior | Trade-off |
|---|---|---|
final_only | Evaluate after streaming completes | Lowest latency, harmful content may stream |
buffered | Evaluate after N tokens accumulate | Balance of latency and safety |
per_chunk | Evaluate each SSE chunk | Highest safety, significant latency |
Buffered mode configuration:
[features.guardrails.output]
streaming_mode = "buffered"
[features.guardrails.output.streaming_mode.buffered]
buffer_tokens = 100Actions
Configure what happens when violations are detected:
[features.guardrails.input]
default_action = "block"
[features.guardrails.input.actions]
hate = "block"
harassment = "warn"
competitor_mention = "redact"
off_topic = "log"| Action | Behavior |
|---|---|
block | Reject request with error response |
warn | Allow request, add warning headers |
log | Allow request silently, log violation |
redact | Remove or mask violating content |
Redaction configuration:
[features.guardrails.input.actions.redact]
replacement = "[REDACTED]"Violation Categories
Standard categories across all providers:
Content Safety
| Category | Description |
|---|---|
hate | Hate speech, discrimination, slurs |
harassment | Bullying, threats against individuals |
self_harm | Self-harm instructions or glorification |
sexual | Sexual content |
violence | Gore, graphic violence |
dangerous | Illegal or dangerous activities |
Security
| Category | Description |
|---|---|
prompt_attack | Jailbreak attempts, prompt injection |
prompt_leakage | Attempts to extract system prompts |
malicious_code | Malware or malicious code |
PII
| Category | Description |
|---|---|
pii_email | Email addresses |
pii_phone | Phone numbers |
pii_ssn | Social security numbers |
pii_credit_card | Credit card numbers |
pii_address | Physical addresses |
pii_name | Personal names |
Business Policy
| Category | Description |
|---|---|
off_topic | Topic filter violations |
competitor_mention | Competitor names |
confidential | Confidential information |
Severity Levels
Violations include a severity level:
| Level | Value | Description |
|---|---|---|
info | 0 | Informational only |
low | 1 | May warrant logging |
medium | 2 | May warrant warning |
high | 3 | Typically requires action |
critical | 4 | Immediate action required |
Error Handling
Timeout Handling
[features.guardrails.input]
timeout_ms = 5000
on_timeout = "block" # Fail-closed (default)
# on_timeout = "allow" # Fail-open (higher availability)Provider Error Handling
[features.guardrails.input]
on_error = "block" # Fail-closed (default)
# on_error = "allow" # Fail-open
# on_error = "log_and_allow" # Log error but allow requestAudit Logging
Track all guardrail evaluations:
[features.guardrails.audit]
enabled = true
log_blocked = true # Log blocked requests
log_violations = true # Log all violations (even if not blocked)
log_redacted = true # Log redaction events
log_all_evaluations = false # Log every evaluation (verbose)Audit logs integrate with OpenTelemetry tracing when enabled.
Complete Configuration Example
[features.guardrails]
enabled = true
# Input guardrails (pre-request)
[features.guardrails.input]
enabled = true
mode = "concurrent"
timeout_ms = 1000
on_timeout = "block"
on_error = "log_and_allow"
default_action = "block"
[features.guardrails.input.provider]
type = "openai_moderation"
model = "omni-moderation-latest"
[features.guardrails.input.actions]
hate = "block"
harassment = "block"
violence = "block"
sexual = "warn"
# Output guardrails (post-response)
[features.guardrails.output]
enabled = true
timeout_ms = 5000
on_error = "block"
default_action = "warn"
streaming_mode = "buffered"
[features.guardrails.output.provider]
type = "bedrock"
guardrail_id = "abc123"
guardrail_version = "1"
region = "us-east-1"
[features.guardrails.output.streaming_mode.buffered]
buffer_tokens = 100
# Built-in PII detection (in addition to provider)
[features.guardrails.pii]
enabled = true
types = ["EMAIL", "PHONE", "SSN", "CREDIT_CARD"]
action = "redact"
replacement = "[PII]"
apply_to = "both" # "input", "output", or "both"
# Audit logging
[features.guardrails.audit]
enabled = true
log_blocked = true
log_violations = trueRequest Flow
Request
│
▼
Input Guardrails (if enabled)
├─ Blocking: Wait → Evaluate → Pass/Fail
└─ Concurrent: Evaluate ║ LLM Call (race)
│
▼
LLM Provider
│
▼
Output Guardrails (if enabled)
├─ Non-streaming: Buffer → Evaluate → Action
└─ Streaming:
├─ FinalOnly: Stream → Evaluate at end
├─ Buffered: Accumulate → Periodic evaluation
└─ PerChunk: Evaluate each chunk
│
▼
Response (or Error)Error Responses
When guardrails block a request:
{
"error": {
"type": "guardrails_blocked",
"message": "Request blocked by content policy",
"violations": [
{
"category": "hate",
"severity": "high",
"confidence": 0.95,
"message": "Hate speech detected"
}
]
}
}HTTP status: 400 Bad Request for input violations, 500 Internal Server Error for output violations.