Error Handling
AuroraSOC implements multiple layers of error handling to ensure resilience in a security-critical environment.
Error Handling Philosophy
API Error Response Format
All API errors return a consistent JSON structure:
{
"detail": "Human-readable error message"
}
For validation errors (Pydantic), FastAPI returns:
{
"detail": [
{
"loc": ["body", "severity"],
"msg": "String should match pattern '^(critical|high|medium|low|info)$'",
"type": "string_pattern_mismatch"
}
]
}
Error Codes Reference
| Code | Meaning | When It Occurs |
|---|---|---|
| 400 | Bad Request | Invalid input, validation failure |
| 401 | Unauthorized | Missing/expired JWT, invalid API key |
| 403 | Forbidden | Valid auth but insufficient permissions |
| 404 | Not Found | Resource doesn't exist |
| 409 | Conflict | Duplicate resource (e.g., duplicate IOC type+value) |
| 429 | Too Many Requests | Rate limit exceeded |
| 503 | Service Unavailable | Database down, degraded mode |
| 500 | Internal Server Error | Unhandled exception |
Database Unavailable Handling
The most critical error scenario is PostgreSQL being down. AuroraSOC handles this gracefully:
class DatabaseUnavailable(Exception):
"""Raised when PostgreSQL is not reachable."""
pass
@app.exception_handler(DatabaseUnavailable)
async def _handle_db_unavailable(request, exc):
return JSONResponse(
status_code=503,
content={"detail": "Database unavailable — running in degraded mode"},
)
In degraded mode, the API serves data from in-memory stores populated with demo data. This ensures the dashboard remains functional for monitoring even during database maintenance.
Circuit Breaker Pattern
The orchestrator uses a circuit breaker when dispatching to agent services:
class CircuitBreaker:
def __init__(self, failure_threshold: int = 3, reset_timeout: float = 30.0):
self.failure_count = 0
self.failure_threshold = failure_threshold
self.reset_timeout = reset_timeout
self.state = "closed" # closed | open | half-open
self.last_failure_time = 0
async def call(self, func, *args, **kwargs):
if self.state == "open":
if time.time() - self.last_failure_time > self.reset_timeout:
self.state = "half-open"
else:
raise CircuitBreakerOpen("Agent service unavailable")
try:
result = await func(*args, **kwargs)
if self.state == "half-open":
self.state = "closed"
self.failure_count = 0
return result
except Exception:
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = "open"
raise
Retry with Exponential Backoff
Agent dispatch includes automatic retry:
async def dispatch_with_retry(self, agent_type, prompt, max_retries=3):
for attempt in range(max_retries):
try:
return await self.circuit_breaker.call(
self._dispatch, agent_type, prompt
)
except Exception:
if attempt == max_retries - 1:
raise
wait = 2 ** attempt # 1s, 2s, 4s
await asyncio.sleep(wait)
Rate Limiting Errors
When rate limits are exceeded:
{
"detail": "Rate limit exceeded. Try again in 42 seconds."
}
Rate limits use Redis sliding window counters:
| Category | Limit | Window |
|---|---|---|
| General API | 100 requests | 1 minute |
| Investigations | 10 requests | 1 minute |
| Playbook execution | 5 requests | 1 minute |
Structured Logging for Errors
All errors are logged via structlog with OpenTelemetry trace context:
logger.error(
"agent_dispatch_failed",
agent_type="threat_hunter",
error=str(e),
case_id=case_id,
attempt=attempt,
trace_id=get_current_span().get_span_context().trace_id,
)
This produces structured JSON logs that can be correlated with traces in Jaeger:
{
"event": "agent_dispatch_failed",
"agent_type": "threat_hunter",
"error": "Connection refused",
"case_id": "550e8400-...",
"attempt": 2,
"trace_id": "abc123...",
"timestamp": "2024-01-15T10:30:00Z",
"level": "error"
}
WebSocket Error Handling
WebSocket connections handle errors silently to avoid disrupting other clients:
async def broadcast(self, data: dict) -> None:
dead: list[str] = []
for cid, ws in self.active.items():
try:
await ws.send_json(data)
except Exception:
dead.append(cid) # Mark for removal
for cid in dead:
self.active.pop(cid, None) # Clean up dead connections
Failed sends don't raise — they silently remove the dead connection and continue broadcasting to healthy clients.
Event Bus Error Handling
Redis Streams consumers use acknowledgment-based error handling:
async for msg_id, data in consumer.consume():
try:
await handler(data)
await consumer.ack(msg_id) # Success: acknowledge
except Exception:
pass # Message stays in pending — will be re-delivered
Unacknowledged messages remain in the Redis Stream's pending entries list (PEL) and are re-delivered after the consumer group's visibility timeout.