Hadrian is experimental alpha software. Do not use in production.
Hadrian
Features

Guardrails

Configure content moderation and safety policies for input and output

Guardrails evaluate content against safety policies before requests are sent to LLM providers (input guardrails) and after responses are received (output guardrails). You can block, warn, log, or redact content based on configurable rules.

Overview

The guardrails system supports:

  • Multiple providers - OpenAI Moderation, AWS Bedrock, Azure Content Safety, or custom HTTP webhooks
  • Built-in rules - Blocklist patterns, PII regex detection, content limits
  • Flexible actions - Block, warn, log, or redact violations
  • Execution modes - Blocking, concurrent, or streaming evaluation
  • Audit logging - Track all evaluations and violations

Quick Start

Enable OpenAI Moderation (free) for input guardrails:

[features.guardrails]
enabled = true

[features.guardrails.input]
enabled = true
mode = "blocking"

[features.guardrails.input.provider]
type = "openai_moderation"

Guardrail Providers

OpenAI Moderation

Free moderation API from OpenAI. Fast and effective for general content safety.

[features.guardrails.input.provider]
type = "openai_moderation"
api_key = "${OPENAI_API_KEY}"  # Optional, uses default OpenAI key
model = "omni-moderation-latest"  # or "text-moderation-latest"

# Custom thresholds per category (0.0-1.0, default varies by category)
[features.guardrails.input.provider.thresholds]
hate = 0.7
harassment = 0.7
self_harm = 0.5
sexual = 0.8
violence = 0.7

Detected categories: Hate, Harassment, Self-Harm, Sexual, Violence

OpenAI Moderation is free to use and doesn't count against your API quota.

AWS Bedrock Guardrails

Enterprise-grade guardrails with PII detection, topic filters, and custom word lists.

[features.guardrails.input.provider]
type = "bedrock"
guardrail_id = "abc123def456"
guardrail_version = "1"
region = "us-east-1"

# Optional: Override default AWS credentials
access_key_id = "${AWS_ACCESS_KEY_ID}"
secret_access_key = "${AWS_SECRET_ACCESS_KEY}"

# Enable trace for debugging
trace_enabled = true

Capabilities:

FeatureDescription
Content filtersHate, insults, sexual, violence, misconduct, prompt attacks
Word filtersCustom word lists and AWS managed lists
Topic filtersBlock off-topic conversations
PII detectionEmail, phone, SSN, credit card, address, name
Confidence levelsNone, Low, Medium, High per category

Configure guardrail policies in the AWS Console. The gateway references them by ID.

Azure AI Content Safety

Microsoft's content moderation with configurable severity thresholds.

[features.guardrails.input.provider]
type = "azure_content_safety"
endpoint = "https://myservice.cognitiveservices.azure.com"
api_key = "${AZURE_CONTENT_SAFETY_KEY}"
api_version = "2024-09-01"

# Severity thresholds (0-6 scale, block at threshold or above)
[features.guardrails.input.provider.thresholds]
hate = 2       # Block severity 2+
violence = 4   # Block severity 4+
self_harm = 2
sexual = 4

Severity scale:

LevelSeverityScore
InfoSafe0
LowMinor concerns1-2
MediumModerate concerns3-4
HighSignificant concerns5
CriticalSevere6

Custom HTTP Provider

Send content to your own moderation service via HTTP webhook.

[features.guardrails.input.provider]
type = "custom"
url = "https://my-guardrails.example.com/evaluate"
api_key = "${CUSTOM_GUARDRAILS_KEY}"
timeout_ms = 3000
retry_enabled = true
max_retries = 2

[features.guardrails.input.provider.headers]
X-Custom-Header = "value"

Request format:

{
  "input": "text to evaluate",
  "source": "user_input",
  "request_id": "req_abc123",
  "user_id": "user_456",
  "context": {}
}

Response format:

{
  "passed": false,
  "violations": [
    {
      "category": "hate",
      "severity": "high",
      "confidence": 0.95,
      "message": "Hate speech detected"
    }
  ]
}

Blocklist (Built-in)

Local pattern matching with literal strings or regex. No external API calls.

[features.guardrails.input.provider]
type = "blocklist"
case_insensitive = true

[[features.guardrails.input.provider.patterns]]
pattern = "competitor_name"
is_regex = false
category = "competitor_mention"
severity = "medium"
message = "Competitor name mentioned"

[[features.guardrails.input.provider.patterns]]
pattern = "(?i)\\b(password|secret|api.?key)\\s*[:=]"
is_regex = true
category = "confidential"
severity = "high"
message = "Potential secret detected"

PII Regex (Built-in)

Detect common PII patterns without external API calls.

[features.guardrails.input.provider]
type = "pii_regex"
email = true
phone = true
ssn = true
credit_card = true
ip_address = true
date_of_birth = true

Detected patterns:

TypeExample Pattern
Emailuser@example.com
Phone(555) 123-4567, +1-555-123-4567
SSN123-45-6789
Credit Card4111-1111-1111-1111
IP Address192.168.1.1
Date of Birth01/15/1990, 1990-01-15

Content Limits (Built-in)

Enforce size constraints on input content.

[features.guardrails.input.provider]
type = "content_limits"
max_characters = 100000
max_words = 20000
max_lines = 1000

Execution Modes

Input Guardrails

Blocking Mode (Default)

Evaluate guardrails before sending to the LLM. Safest option but adds latency.

[features.guardrails.input]
mode = "blocking"
timeout_ms = 5000
on_timeout = "block"  # or "allow"
Request → Guardrails → (pass) → LLM → Response
                    → (fail) → Error

Concurrent Mode

Evaluate guardrails and call LLM simultaneously. Lower latency for passing requests.

[features.guardrails.input]
mode = "concurrent"
timeout_ms = 1000
on_timeout = "block"
Request → ┬→ Guardrails ─┬→ (pass) → Wait for LLM → Response
          └→ LLM ────────┘→ (fail) → Cancel LLM → Error

Behavior:

  • If guardrails fail before LLM responds: cancel LLM request, return error
  • If LLM responds first: wait for guardrails result before returning
  • If guardrails timeout: action based on on_timeout setting

Output Guardrails

Output guardrails evaluate LLM responses before returning to the client.

[features.guardrails.output]
enabled = true
timeout_ms = 5000

[features.guardrails.output.provider]
type = "openai_moderation"

Streaming Evaluation Modes

For streaming responses, choose how to evaluate content:

[features.guardrails.output]
streaming_mode = "final_only"  # Default
ModeBehaviorTrade-off
final_onlyEvaluate after streaming completesLowest latency, harmful content may stream
bufferedEvaluate after N tokens accumulateBalance of latency and safety
per_chunkEvaluate each SSE chunkHighest safety, significant latency

Buffered mode configuration:

[features.guardrails.output]
streaming_mode = "buffered"

[features.guardrails.output.streaming_mode.buffered]
buffer_tokens = 100

Actions

Configure what happens when violations are detected:

[features.guardrails.input]
default_action = "block"

[features.guardrails.input.actions]
hate = "block"
harassment = "warn"
competitor_mention = "redact"
off_topic = "log"
ActionBehavior
blockReject request with error response
warnAllow request, add warning headers
logAllow request silently, log violation
redactRemove or mask violating content

Redaction configuration:

[features.guardrails.input.actions.redact]
replacement = "[REDACTED]"

Violation Categories

Standard categories across all providers:

Content Safety

CategoryDescription
hateHate speech, discrimination, slurs
harassmentBullying, threats against individuals
self_harmSelf-harm instructions or glorification
sexualSexual content
violenceGore, graphic violence
dangerousIllegal or dangerous activities

Security

CategoryDescription
prompt_attackJailbreak attempts, prompt injection
prompt_leakageAttempts to extract system prompts
malicious_codeMalware or malicious code

PII

CategoryDescription
pii_emailEmail addresses
pii_phonePhone numbers
pii_ssnSocial security numbers
pii_credit_cardCredit card numbers
pii_addressPhysical addresses
pii_namePersonal names

Business Policy

CategoryDescription
off_topicTopic filter violations
competitor_mentionCompetitor names
confidentialConfidential information

Severity Levels

Violations include a severity level:

LevelValueDescription
info0Informational only
low1May warrant logging
medium2May warrant warning
high3Typically requires action
critical4Immediate action required

Error Handling

Timeout Handling

[features.guardrails.input]
timeout_ms = 5000
on_timeout = "block"  # Fail-closed (default)
# on_timeout = "allow"  # Fail-open (higher availability)

Provider Error Handling

[features.guardrails.input]
on_error = "block"  # Fail-closed (default)
# on_error = "allow"  # Fail-open
# on_error = "log_and_allow"  # Log error but allow request

Audit Logging

Track all guardrail evaluations:

[features.guardrails.audit]
enabled = true
log_blocked = true       # Log blocked requests
log_violations = true    # Log all violations (even if not blocked)
log_redacted = true      # Log redaction events
log_all_evaluations = false  # Log every evaluation (verbose)

Audit logs integrate with OpenTelemetry tracing when enabled.

Complete Configuration Example

[features.guardrails]
enabled = true

# Input guardrails (pre-request)
[features.guardrails.input]
enabled = true
mode = "concurrent"
timeout_ms = 1000
on_timeout = "block"
on_error = "log_and_allow"
default_action = "block"

[features.guardrails.input.provider]
type = "openai_moderation"
model = "omni-moderation-latest"

[features.guardrails.input.actions]
hate = "block"
harassment = "block"
violence = "block"
sexual = "warn"

# Output guardrails (post-response)
[features.guardrails.output]
enabled = true
timeout_ms = 5000
on_error = "block"
default_action = "warn"
streaming_mode = "buffered"

[features.guardrails.output.provider]
type = "bedrock"
guardrail_id = "abc123"
guardrail_version = "1"
region = "us-east-1"

[features.guardrails.output.streaming_mode.buffered]
buffer_tokens = 100

# Built-in PII detection (in addition to provider)
[features.guardrails.pii]
enabled = true
types = ["EMAIL", "PHONE", "SSN", "CREDIT_CARD"]
action = "redact"
replacement = "[PII]"
apply_to = "both"  # "input", "output", or "both"

# Audit logging
[features.guardrails.audit]
enabled = true
log_blocked = true
log_violations = true

Request Flow

Request


Input Guardrails (if enabled)
  ├─ Blocking: Wait → Evaluate → Pass/Fail
  └─ Concurrent: Evaluate ║ LLM Call (race)


LLM Provider


Output Guardrails (if enabled)
  ├─ Non-streaming: Buffer → Evaluate → Action
  └─ Streaming:
     ├─ FinalOnly: Stream → Evaluate at end
     ├─ Buffered: Accumulate → Periodic evaluation
     └─ PerChunk: Evaluate each chunk


Response (or Error)

Error Responses

When guardrails block a request:

{
  "error": {
    "type": "guardrails_blocked",
    "message": "Request blocked by content policy",
    "violations": [
      {
        "category": "hate",
        "severity": "high",
        "confidence": 0.95,
        "message": "Hate speech detected"
      }
    ]
  }
}

HTTP status: 400 Bad Request for input violations, 500 Internal Server Error for output violations.

On this page