File Processing
Configure document processing for RAG (chunking, OCR, virus scanning)
The [features.file_processing] section controls how uploaded files are processed when added to vector stores. This includes text extraction, chunking, and embedding generation.
Configuration Reference
Main Settings
[features.file_processing]
mode = "inline"
max_file_size_mb = 10
max_concurrent_tasks = 4
default_max_chunk_tokens = 800
default_overlap_tokens = 200
stale_processing_timeout_secs = 1800| Key | Type | Default | Description |
|---|---|---|---|
mode | string | "inline" | Processing mode: inline or queue |
max_file_size_mb | integer | 10 | Maximum file size in MB |
max_concurrent_tasks | integer | 4 | Concurrent processing tasks (inline mode) |
default_max_chunk_tokens | integer | 800 | Default chunk size in tokens |
default_overlap_tokens | integer | 200 | Token overlap between chunks |
stale_processing_timeout_secs | integer | 1800 | Timeout for detecting stuck files (30 min) |
callback_url | string | none | Completion callback URL (queue mode) |
Processing Modes
Inline Mode
Process files synchronously within the gateway process:
[features.file_processing]
mode = "inline"
max_concurrent_tasks = 4Best for: Small deployments, files under 10MB, low volume.
Queue Mode
Publish processing jobs to an external queue for worker processes:
[features.file_processing]
mode = "queue"
callback_url = "http://localhost:8080/api/internal/file-callback"
[features.file_processing.queue]
backend = "redis"
url = "redis://localhost:6379"
queue_name = "hadrian_file_processing"
consumer_group = "hadrian_workers"Best for: Production deployments, large files, high volume.
Queue Configuration
| Key | Type | Default | Description |
|---|---|---|---|
backend | string | required | redis, rabbitmq, sqs, or pubsub |
url | string | required | Queue connection URL |
queue_name | string | "hadrian_file_processing" | Queue/topic name |
consumer_group | string | "hadrian_workers" | Consumer group (Redis Streams) |
region | string | none | AWS region (SQS only) |
project_id | string | none | GCP project ID (Pub/Sub only) |
Backend-specific URL formats:
# Redis
url = "redis://localhost:6379"
# RabbitMQ
url = "amqp://guest:guest@localhost:5672"
# AWS SQS
url = "https://sqs.us-east-1.amazonaws.com/123456789/queue-name"
region = "us-east-1"
# Google Cloud Pub/Sub
url = "projects/my-project/topics/file-processing"
project_id = "my-project"Document Extraction
Configure OCR and PDF processing with [features.file_processing.document_extraction]:
[features.file_processing.document_extraction]
enable_ocr = true
ocr_language = "eng"
force_ocr = false
pdf_extract_images = true
pdf_image_dpi = 300| Key | Type | Default | Description |
|---|---|---|---|
enable_ocr | boolean | false | Enable OCR for scanned documents |
force_ocr | boolean | false | OCR even when text layer exists |
ocr_language | string | "eng" | Tesseract language code (ISO 639-3) |
pdf_extract_images | boolean | false | Extract and OCR images from PDFs |
pdf_image_dpi | integer | 300 | DPI for PDF image extraction |
Common OCR language codes:
| Code | Language |
|---|---|
eng | English |
fra | French |
deu | German |
spa | Spanish |
chi_sim | Simplified Chinese |
jpn | Japanese |
OCR requires Tesseract installed on the system: - Linux: apt install tesseract-ocr tesseract-ocr-eng - macOS: brew install tesseract - Windows: Install from
UB-Mannheim/tesseract
Virus Scanning
Configure ClamAV virus scanning with [features.file_processing.virus_scan]:
[features.file_processing.virus_scan]
enabled = true
backend = "clamav"
[features.file_processing.virus_scan.clamav]
host = "localhost"
port = 3310
timeout_ms = 30000
max_file_size_mb = 25
# socket_path = "/var/run/clamav/clamd.sock" # Alternative to TCP| Key | Type | Default | Description |
|---|---|---|---|
enabled | boolean | false | Enable virus scanning |
backend | string | "clamav" | Only clamav is supported |
ClamAV settings:
| Key | Type | Default | Description |
|---|---|---|---|
host | string | "localhost" | clamd host |
port | integer | 3310 | clamd port |
timeout_ms | integer | 30000 | Scan timeout in milliseconds |
max_file_size_mb | integer | 25 | Maximum scannable file size |
socket_path | string | none | Unix socket path (alternative to TCP) |
Retry Configuration
[features.file_processing.retry]
enabled = true
max_retries = 3
initial_delay_ms = 100
max_delay_ms = 10000
backoff_multiplier = 2.0
jitter = 0.1Circuit Breaker
[features.file_processing.circuit_breaker]
enabled = true
failure_threshold = 5
failure_window_secs = 60
recovery_timeout_secs = 30Complete Examples
Development Setup
[features.file_processing]
mode = "inline"
max_file_size_mb = 10
max_concurrent_tasks = 2
default_max_chunk_tokens = 800
default_overlap_tokens = 200Production with OCR
[features.file_processing]
mode = "inline"
max_file_size_mb = 50
max_concurrent_tasks = 8
default_max_chunk_tokens = 800
default_overlap_tokens = 200
stale_processing_timeout_secs = 3600
[features.file_processing.document_extraction]
enable_ocr = true
ocr_language = "eng"
force_ocr = false
pdf_extract_images = true
pdf_image_dpi = 300
[features.file_processing.virus_scan]
enabled = true
backend = "clamav"
[features.file_processing.virus_scan.clamav]
host = "clamav"
port = 3310
timeout_ms = 60000
max_file_size_mb = 50
[features.file_processing.retry]
enabled = true
max_retries = 5
initial_delay_ms = 200
max_delay_ms = 30000
backoff_multiplier = 2.0
[features.file_processing.circuit_breaker]
enabled = true
failure_threshold = 10
failure_window_secs = 120
recovery_timeout_secs = 60Queue-Based Processing (Redis)
[features.file_processing]
mode = "queue"
max_file_size_mb = 100
callback_url = "http://gateway:8080/api/internal/file-callback"
[features.file_processing.queue]
backend = "redis"
url = "redis://redis:6379"
queue_name = "hadrian_file_processing"
consumer_group = "hadrian_workers"
[features.file_processing.document_extraction]
enable_ocr = true
ocr_language = "eng"Stale File Detection
Files stuck in in_progress status longer than stale_processing_timeout_secs are considered stale. Re-adding the file to a vector store will reset it for re-processing.
Set to 0 to disable stale detection (not recommended).
See Also
- File Search Configuration - Vector search settings
- Knowledge Bases Guide - Conceptual overview