Error Handling
AuroraSOC implements multiple layers of error handling to ensure resilience in a security-critical environment.
When to Use This Page
Use this page when implementing API clients, worker consumers, or integration services that must remain robust during dependency failures and traffic spikes.
Prerequisites
- Familiarity with HTTP status code classes
- Basic async retry/backoff patterns
- Understanding of AuroraSOC auth and permission model
Error Handling Philosophy
API Error Response Format
All API errors return a consistent JSON structure:
{
"detail": "Human-readable error message"
}
For validation errors (Pydantic), FastAPI returns:
{
"detail": [
{
"loc": ["body", "severity"],
"msg": "String should match pattern '^(critical|high|medium|low|info)$'",
"type": "string_pattern_mismatch"
}
]
}
Error Codes Reference
| Code | Meaning | When It Occurs |
|---|---|---|
| 400 | Bad Request | Invalid input, validation failure |
| 401 | Unauthorized | Missing/expired JWT, invalid API key |
| 403 | Forbidden | Valid auth but insufficient permissions |
| 404 | Not Found | Resource doesn't exist |
| 409 | Conflict | Duplicate resource (e.g., duplicate IOC type+value) |
| 429 | Too Many Requests | Rate limit exceeded |
| 503 | Service Unavailable | Database down, degraded mode |
| 500 | Internal Server Error | Unhandled exception |
Retry Decision Matrix
| Error Class | Retry? | Backoff Strategy | Notes |
|---|---|---|---|
Validation (400) | No | None | Fix request contract first |
Authentication (401) | Conditional | Immediate token refresh + 1 retry | Prevent endless token loops |
Authorization (403) | No | None | Permission issue, not transient |
Not Found (404) | Conditional | Short retry only for eventual consistency paths | Most paths should not retry |
Conflict (409) | Conditional | None or short retry | Often indicates idempotency conflict |
Rate Limit (429) | Yes | Exponential + jitter | Respect server guidance when available |
Service Unavailable (503) | Yes | Exponential + capped retry | Mark operation as deferred if persistent |
Internal (500) | Yes (bounded) | Exponential + telemetry | Escalate after retry budget exhausted |
Database Unavailable Handling
The most critical error scenario is PostgreSQL being down. AuroraSOC handles this gracefully:
class DatabaseUnavailable(Exception):
"""Raised when PostgreSQL is not reachable."""
pass
@app.exception_handler(DatabaseUnavailable)
async def _handle_db_unavailable(request, exc):
return JSONResponse(
status_code=503,
content={"detail": "Database unavailable — running in degraded mode"},
)
In degraded mode, the API serves data from in-memory stores populated with demo data. This ensures the dashboard remains functional for monitoring even during database maintenance.
Circuit Breaker Pattern
The orchestrator uses a circuit breaker when dispatching to agent services:
class CircuitBreaker:
def __init__(self, failure_threshold: int = 3, reset_timeout: float = 30.0):
self.failure_count = 0
self.failure_threshold = failure_threshold
self.reset_timeout = reset_timeout
self.state = "closed" # closed | open | half-open
self.last_failure_time = 0
async def call(self, func, *args, **kwargs):
if self.state == "open":
if time.time() - self.last_failure_time > self.reset_timeout:
self.state = "half-open"
else:
raise CircuitBreakerOpen("Agent service unavailable")
try:
result = await func(*args, **kwargs)
if self.state == "half-open":
self.state = "closed"
self.failure_count = 0
return result
except Exception:
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = "open"
raise
Retry with Exponential Backoff
Agent dispatch includes automatic retry:
async def dispatch_with_retry(self, agent_type, prompt, max_retries=3):
for attempt in range(max_retries):
try:
return await self.circuit_breaker.call(
self._dispatch, agent_type, prompt
)
except Exception:
if attempt == max_retries - 1:
raise
wait = 2 ** attempt # 1s, 2s, 4s
await asyncio.sleep(wait)
Production-safe retry wrapper example
import asyncio
import random
from collections.abc import Awaitable, Callable
async def retry_with_jitter(
operation: Callable[[], Awaitable[dict]],
*,
max_attempts: int = 5,
base_delay: float = 0.5,
max_delay: float = 8.0,
) -> dict:
"""Retry transient failures with bounded exponential backoff and jitter."""
for attempt in range(1, max_attempts + 1):
try:
return await operation()
except Exception:
if attempt == max_attempts:
raise
# Jitter reduces synchronized retry storms across many clients.
raw_delay = min(max_delay, base_delay * (2 ** (attempt - 1)))
delay = raw_delay * (0.7 + random.random() * 0.6)
await asyncio.sleep(delay)
raise RuntimeError("retry loop exited unexpectedly")
Rate Limiting Errors
When rate limits are exceeded:
{
"detail": "Rate limit exceeded. Try again in 42 seconds."
}
Rate limits use Redis sliding window counters:
| Category | Limit | Window |
|---|---|---|
| General API | 100 requests | 1 minute |
| Investigations | 10 requests | 1 minute |
| Playbook execution | 5 requests | 1 minute |
Structured Logging for Errors
All errors are logged via structlog with OpenTelemetry trace context:
logger.error(
"agent_dispatch_failed",
agent_type="threat_hunter",
error=str(e),
case_id=case_id,
attempt=attempt,
trace_id=get_current_span().get_span_context().trace_id,
)
This produces structured JSON logs that can be correlated with traces in Jaeger:
{
"event": "agent_dispatch_failed",
"agent_type": "threat_hunter",
"error": "Connection refused",
"case_id": "550e8400-...",
"attempt": 2,
"trace_id": "abc123...",
"timestamp": "2024-01-15T10:30:00Z",
"level": "error"
}
WebSocket Error Handling
WebSocket connections handle errors silently to avoid disrupting other clients:
async def broadcast(self, data: dict) -> None:
dead: list[str] = []
for cid, ws in self.active.items():
try:
await ws.send_json(data)
except Exception:
dead.append(cid) # Mark for removal
for cid in dead:
self.active.pop(cid, None) # Clean up dead connections
Failed sends don't raise — they silently remove the dead connection and continue broadcasting to healthy clients.
Event Bus Error Handling
Redis Streams consumers use acknowledgment-based error handling:
async for msg_id, data in consumer.consume():
try:
await handler(data)
await consumer.ack(msg_id) # Success: acknowledge
except Exception:
pass # Message stays in pending — will be re-delivered
Unacknowledged messages remain in the Redis Stream's pending entries list (PEL) and are re-delivered after the consumer group's visibility timeout.
Edge Cases to Handle Explicitly
- Token refresh races across concurrent requests
- Retry storms after upstream recovery
- Duplicate processing when consumers restart before acknowledgment
- Mixed success in batch operations (partial commit semantics)
- Long-running operations that outlive client-side timeouts
Do not log secrets, raw credentials, or full bearer tokens in error messages. Log references, hashes, and trace IDs instead.