انتقل إلى المحتوى الرئيسي

Error Handling

AuroraSOC implements multiple layers of error handling to ensure resilience in a security-critical environment.

Error Handling Philosophy

API Error Response Format

All API errors return a consistent JSON structure:

{
"detail": "Human-readable error message"
}

For validation errors (Pydantic), FastAPI returns:

{
"detail": [
{
"loc": ["body", "severity"],
"msg": "String should match pattern '^(critical|high|medium|low|info)$'",
"type": "string_pattern_mismatch"
}
]
}

Error Codes Reference

CodeMeaningWhen It Occurs
400Bad RequestInvalid input, validation failure
401UnauthorizedMissing/expired JWT, invalid API key
403ForbiddenValid auth but insufficient permissions
404Not FoundResource doesn't exist
409ConflictDuplicate resource (e.g., duplicate IOC type+value)
429Too Many RequestsRate limit exceeded
503Service UnavailableDatabase down, degraded mode
500Internal Server ErrorUnhandled exception

Database Unavailable Handling

The most critical error scenario is PostgreSQL being down. AuroraSOC handles this gracefully:

class DatabaseUnavailable(Exception):
"""Raised when PostgreSQL is not reachable."""
pass

@app.exception_handler(DatabaseUnavailable)
async def _handle_db_unavailable(request, exc):
return JSONResponse(
status_code=503,
content={"detail": "Database unavailable — running in degraded mode"},
)

In degraded mode, the API serves data from in-memory stores populated with demo data. This ensures the dashboard remains functional for monitoring even during database maintenance.

Circuit Breaker Pattern

The orchestrator uses a circuit breaker when dispatching to agent services:

class CircuitBreaker:
def __init__(self, failure_threshold: int = 3, reset_timeout: float = 30.0):
self.failure_count = 0
self.failure_threshold = failure_threshold
self.reset_timeout = reset_timeout
self.state = "closed" # closed | open | half-open
self.last_failure_time = 0

async def call(self, func, *args, **kwargs):
if self.state == "open":
if time.time() - self.last_failure_time > self.reset_timeout:
self.state = "half-open"
else:
raise CircuitBreakerOpen("Agent service unavailable")

try:
result = await func(*args, **kwargs)
if self.state == "half-open":
self.state = "closed"
self.failure_count = 0
return result
except Exception:
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = "open"
raise

Retry with Exponential Backoff

Agent dispatch includes automatic retry:

async def dispatch_with_retry(self, agent_type, prompt, max_retries=3):
for attempt in range(max_retries):
try:
return await self.circuit_breaker.call(
self._dispatch, agent_type, prompt
)
except Exception:
if attempt == max_retries - 1:
raise
wait = 2 ** attempt # 1s, 2s, 4s
await asyncio.sleep(wait)

Rate Limiting Errors

When rate limits are exceeded:

{
"detail": "Rate limit exceeded. Try again in 42 seconds."
}

Rate limits use Redis sliding window counters:

CategoryLimitWindow
General API100 requests1 minute
Investigations10 requests1 minute
Playbook execution5 requests1 minute

Structured Logging for Errors

All errors are logged via structlog with OpenTelemetry trace context:

logger.error(
"agent_dispatch_failed",
agent_type="threat_hunter",
error=str(e),
case_id=case_id,
attempt=attempt,
trace_id=get_current_span().get_span_context().trace_id,
)

This produces structured JSON logs that can be correlated with traces in Jaeger:

{
"event": "agent_dispatch_failed",
"agent_type": "threat_hunter",
"error": "Connection refused",
"case_id": "550e8400-...",
"attempt": 2,
"trace_id": "abc123...",
"timestamp": "2024-01-15T10:30:00Z",
"level": "error"
}

WebSocket Error Handling

WebSocket connections handle errors silently to avoid disrupting other clients:

async def broadcast(self, data: dict) -> None:
dead: list[str] = []
for cid, ws in self.active.items():
try:
await ws.send_json(data)
except Exception:
dead.append(cid) # Mark for removal
for cid in dead:
self.active.pop(cid, None) # Clean up dead connections

Failed sends don't raise — they silently remove the dead connection and continue broadcasting to healthy clients.

Event Bus Error Handling

Redis Streams consumers use acknowledgment-based error handling:

async for msg_id, data in consumer.consume():
try:
await handler(data)
await consumer.ack(msg_id) # Success: acknowledge
except Exception:
pass # Message stays in pending — will be re-delivered

Unacknowledged messages remain in the Redis Stream's pending entries list (PEL) and are re-delivered after the consumer group's visibility timeout.