Hadrian is experimental alpha software. Do not use in production.
Hadrian
ConfigurationFeatures

Response Caching

Configure exact-match and semantic response caching

The [features.response_caching] section configures gateway-level response caching. Hadrian supports both exact-match caching (SHA-256 hash) and semantic caching (embedding similarity).

Configuration Reference

Main Settings

[features.response_caching]
enabled = true
ttl_secs = 3600
only_deterministic = true
max_size_bytes = 1048576
KeyTypeDefaultDescription
enabledbooleantrueEnable response caching
ttl_secsinteger3600Cache TTL in seconds (1 hour)
only_deterministicbooleantrueOnly cache when temperature = 0
max_size_bytesinteger1048576Maximum response size to cache (1 MB)

Cache Key Components

Control what's included in the cache key with [features.response_caching.key_components]:

[features.response_caching.key_components]
model = true
temperature = true
system_prompt = true
tools = true
KeyTypeDefaultDescription
modelbooleantrueInclude model name in cache key
temperaturebooleantrueInclude temperature in cache key
system_promptbooleantrueInclude system prompt in cache key
toolsbooleantrueInclude tools in cache key

Semantic Caching

Enable similarity-based cache matching with [features.response_caching.semantic]:

[features.response_caching.semantic]
enabled = true
similarity_threshold = 0.95
top_k = 1

[features.response_caching.semantic.embedding]
provider = "openai"
model = "text-embedding-3-small"
dimensions = 1536

[features.response_caching.semantic.vector_backend]
type = "pgvector"
table_name = "semantic_cache_embeddings"
index_type = "ivf_flat"
distance_metric = "cosine"
KeyTypeDefaultDescription
enabledbooleanfalseEnable semantic caching
similarity_thresholdfloat0.95Minimum cosine similarity for cache hit
top_kinteger1Number of similar results to consider

Embedding Configuration

[features.response_caching.semantic.embedding]
provider = "openai"
model = "text-embedding-3-small"
dimensions = 1536
KeyTypeDefaultDescription
providerstring"openai"Embedding provider
modelstring"text-embedding-3-small"Embedding model
dimensionsinteger1536Embedding dimensions

Vector Backend

pgvector

[features.response_caching.semantic.vector_backend]
type = "pgvector"
table_name = "semantic_cache_embeddings"
index_type = "ivf_flat"
distance_metric = "cosine"
KeyTypeDefaultDescription
table_namestring"semantic_cache_embeddings"Table for cache embeddings
index_typestring"ivf_flat"ivf_flat or hnsw
distance_metricstring"cosine"Distance metric

Qdrant

[features.response_caching.semantic.vector_backend]
type = "qdrant"
url = "http://localhost:6333"
api_key = "${QDRANT_API_KEY}"
collection_name = "semantic_cache"
distance_metric = "cosine"
KeyTypeDefaultDescription
urlstringrequiredQdrant server URL
api_keystringnoneAPI key for authentication
collection_namestring"semantic_cache"Collection name
distance_metricstring"cosine"Distance metric

Prompt Caching

Provider-level prompt caching (Anthropic) is configured separately:

[features.prompt_caching]
enabled = true
min_tokens = 1024
KeyTypeDefaultDescription
enabledbooleantrueEnable prompt caching
min_tokensinteger1024Minimum prompt length to cache

Prompt caching is provider-side (Anthropic handles it). Response caching is gateway-side (Hadrian handles it). Both can be enabled simultaneously.

Complete Examples

Exact-Match Only

[features.response_caching]
enabled = true
ttl_secs = 3600
only_deterministic = true
max_size_bytes = 1048576

[features.response_caching.key_components]
model = true
temperature = true
system_prompt = true
tools = true

With Semantic Caching (pgvector)

[features.response_caching]
enabled = true
ttl_secs = 7200
only_deterministic = true
max_size_bytes = 2097152

[features.response_caching.semantic]
enabled = true
similarity_threshold = 0.95
top_k = 1

[features.response_caching.semantic.embedding]
provider = "openai"
model = "text-embedding-3-small"
dimensions = 1536

[features.response_caching.semantic.vector_backend]
type = "pgvector"
table_name = "semantic_cache_embeddings"
index_type = "hnsw"
distance_metric = "cosine"

With Semantic Caching (Qdrant)

[features.response_caching]
enabled = true
ttl_secs = 3600
only_deterministic = true

[features.response_caching.semantic]
enabled = true
similarity_threshold = 0.92
top_k = 3

[features.response_caching.semantic.embedding]
provider = "openai"
model = "text-embedding-3-large"
dimensions = 3072

[features.response_caching.semantic.vector_backend]
type = "qdrant"
url = "http://qdrant:6333"
api_key = "${QDRANT_API_KEY}"
collection_name = "semantic_cache"
distance_metric = "cosine"

Similarity Threshold Guidelines

ThresholdUse Case
0.98-1.0Very strict, nearly identical queries
0.95-0.97Recommended for most use cases
0.92-0.94More permissive, catches rephrased queries
< 0.92May return semantically different results

Lower thresholds increase cache hit rates but may return cached responses for semantically different queries. Start with 0.95 and adjust based on your use case.

Cache Behavior

  1. Exact match: Request hash checked first (fastest)
  2. Semantic match: If no exact match, embedding similarity search (if enabled)
  3. Cache miss: Forward to LLM, store response with hash and embedding

Force refresh: Clients can bypass cache with Cache-Control: no-cache header.

See Also

On this page