Hadrian is experimental alpha software. Do not use in production.
Hadrian
ConfigurationFeatures

File Processing

Configure document processing for RAG (chunking, OCR, virus scanning)

The [features.file_processing] section controls how uploaded files are processed when added to vector stores. This includes text extraction, chunking, and embedding generation.

Configuration Reference

Main Settings

[features.file_processing]
mode = "inline"
max_file_size_mb = 10
max_concurrent_tasks = 4
default_max_chunk_tokens = 800
default_overlap_tokens = 200
stale_processing_timeout_secs = 1800
KeyTypeDefaultDescription
modestring"inline"Processing mode: inline or queue
max_file_size_mbinteger10Maximum file size in MB
max_concurrent_tasksinteger4Concurrent processing tasks (inline mode)
default_max_chunk_tokensinteger800Default chunk size in tokens
default_overlap_tokensinteger200Token overlap between chunks
stale_processing_timeout_secsinteger1800Timeout for detecting stuck files (30 min)
callback_urlstringnoneCompletion callback URL (queue mode)

Processing Modes

Inline Mode

Process files synchronously within the gateway process:

[features.file_processing]
mode = "inline"
max_concurrent_tasks = 4

Best for: Small deployments, files under 10MB, low volume.

Queue Mode

Publish processing jobs to an external queue for worker processes:

[features.file_processing]
mode = "queue"
callback_url = "http://localhost:8080/api/internal/file-callback"

[features.file_processing.queue]
backend = "redis"
url = "redis://localhost:6379"
queue_name = "hadrian_file_processing"
consumer_group = "hadrian_workers"

Best for: Production deployments, large files, high volume.

Queue Configuration

KeyTypeDefaultDescription
backendstringrequiredredis, rabbitmq, sqs, or pubsub
urlstringrequiredQueue connection URL
queue_namestring"hadrian_file_processing"Queue/topic name
consumer_groupstring"hadrian_workers"Consumer group (Redis Streams)
regionstringnoneAWS region (SQS only)
project_idstringnoneGCP project ID (Pub/Sub only)

Backend-specific URL formats:

# Redis
url = "redis://localhost:6379"

# RabbitMQ
url = "amqp://guest:guest@localhost:5672"

# AWS SQS
url = "https://sqs.us-east-1.amazonaws.com/123456789/queue-name"
region = "us-east-1"

# Google Cloud Pub/Sub
url = "projects/my-project/topics/file-processing"
project_id = "my-project"

Document Extraction

Configure OCR and PDF processing with [features.file_processing.document_extraction]:

[features.file_processing.document_extraction]
enable_ocr = true
ocr_language = "eng"
force_ocr = false
pdf_extract_images = true
pdf_image_dpi = 300
KeyTypeDefaultDescription
enable_ocrbooleanfalseEnable OCR for scanned documents
force_ocrbooleanfalseOCR even when text layer exists
ocr_languagestring"eng"Tesseract language code (ISO 639-3)
pdf_extract_imagesbooleanfalseExtract and OCR images from PDFs
pdf_image_dpiinteger300DPI for PDF image extraction

Common OCR language codes:

CodeLanguage
engEnglish
fraFrench
deuGerman
spaSpanish
chi_simSimplified Chinese
jpnJapanese

OCR requires Tesseract installed on the system: - Linux: apt install tesseract-ocr tesseract-ocr-eng - macOS: brew install tesseract - Windows: Install from UB-Mannheim/tesseract

Virus Scanning

Configure ClamAV virus scanning with [features.file_processing.virus_scan]:

[features.file_processing.virus_scan]
enabled = true
backend = "clamav"

[features.file_processing.virus_scan.clamav]
host = "localhost"
port = 3310
timeout_ms = 30000
max_file_size_mb = 25
# socket_path = "/var/run/clamav/clamd.sock"  # Alternative to TCP
KeyTypeDefaultDescription
enabledbooleanfalseEnable virus scanning
backendstring"clamav"Only clamav is supported

ClamAV settings:

KeyTypeDefaultDescription
hoststring"localhost"clamd host
portinteger3310clamd port
timeout_msinteger30000Scan timeout in milliseconds
max_file_size_mbinteger25Maximum scannable file size
socket_pathstringnoneUnix socket path (alternative to TCP)

Retry Configuration

[features.file_processing.retry]
enabled = true
max_retries = 3
initial_delay_ms = 100
max_delay_ms = 10000
backoff_multiplier = 2.0
jitter = 0.1

Circuit Breaker

[features.file_processing.circuit_breaker]
enabled = true
failure_threshold = 5
failure_window_secs = 60
recovery_timeout_secs = 30

Complete Examples

Development Setup

[features.file_processing]
mode = "inline"
max_file_size_mb = 10
max_concurrent_tasks = 2
default_max_chunk_tokens = 800
default_overlap_tokens = 200

Production with OCR

[features.file_processing]
mode = "inline"
max_file_size_mb = 50
max_concurrent_tasks = 8
default_max_chunk_tokens = 800
default_overlap_tokens = 200
stale_processing_timeout_secs = 3600

[features.file_processing.document_extraction]
enable_ocr = true
ocr_language = "eng"
force_ocr = false
pdf_extract_images = true
pdf_image_dpi = 300

[features.file_processing.virus_scan]
enabled = true
backend = "clamav"

[features.file_processing.virus_scan.clamav]
host = "clamav"
port = 3310
timeout_ms = 60000
max_file_size_mb = 50

[features.file_processing.retry]
enabled = true
max_retries = 5
initial_delay_ms = 200
max_delay_ms = 30000
backoff_multiplier = 2.0

[features.file_processing.circuit_breaker]
enabled = true
failure_threshold = 10
failure_window_secs = 120
recovery_timeout_secs = 60

Queue-Based Processing (Redis)

[features.file_processing]
mode = "queue"
max_file_size_mb = 100
callback_url = "http://gateway:8080/api/internal/file-callback"

[features.file_processing.queue]
backend = "redis"
url = "redis://redis:6379"
queue_name = "hadrian_file_processing"
consumer_group = "hadrian_workers"

[features.file_processing.document_extraction]
enable_ocr = true
ocr_language = "eng"

Stale File Detection

Files stuck in in_progress status longer than stale_processing_timeout_secs are considered stale. Re-adding the file to a vector store will reset it for re-processing.

Set to 0 to disable stale detection (not recommended).

See Also

On this page