Skip to main content

Error Handling

AuroraSOC implements multiple layers of error handling to ensure resilience in a security-critical environment.

When to Use This Page

Use this page when implementing API clients, worker consumers, or integration services that must remain robust during dependency failures and traffic spikes.

Prerequisites

  • Familiarity with HTTP status code classes
  • Basic async retry/backoff patterns
  • Understanding of AuroraSOC auth and permission model

Error Handling Philosophy

API Error Response Format

All API errors return a consistent JSON structure:

{
"detail": "Human-readable error message"
}

For validation errors (Pydantic), FastAPI returns:

{
"detail": [
{
"loc": ["body", "severity"],
"msg": "String should match pattern '^(critical|high|medium|low|info)$'",
"type": "string_pattern_mismatch"
}
]
}

Error Codes Reference

CodeMeaningWhen It Occurs
400Bad RequestInvalid input, validation failure
401UnauthorizedMissing/expired JWT, invalid API key
403ForbiddenValid auth but insufficient permissions
404Not FoundResource doesn't exist
409ConflictDuplicate resource (e.g., duplicate IOC type+value)
429Too Many RequestsRate limit exceeded
503Service UnavailableDatabase down, degraded mode
500Internal Server ErrorUnhandled exception

Retry Decision Matrix

Error ClassRetry?Backoff StrategyNotes
Validation (400)NoNoneFix request contract first
Authentication (401)ConditionalImmediate token refresh + 1 retryPrevent endless token loops
Authorization (403)NoNonePermission issue, not transient
Not Found (404)ConditionalShort retry only for eventual consistency pathsMost paths should not retry
Conflict (409)ConditionalNone or short retryOften indicates idempotency conflict
Rate Limit (429)YesExponential + jitterRespect server guidance when available
Service Unavailable (503)YesExponential + capped retryMark operation as deferred if persistent
Internal (500)Yes (bounded)Exponential + telemetryEscalate after retry budget exhausted

Database Unavailable Handling

The most critical error scenario is PostgreSQL being down. AuroraSOC handles this gracefully:

class DatabaseUnavailable(Exception):
"""Raised when PostgreSQL is not reachable."""
pass

@app.exception_handler(DatabaseUnavailable)
async def _handle_db_unavailable(request, exc):
return JSONResponse(
status_code=503,
content={"detail": "Database unavailable — running in degraded mode"},
)

In degraded mode, the API serves data from in-memory stores populated with demo data. This ensures the dashboard remains functional for monitoring even during database maintenance.

Circuit Breaker Pattern

The orchestrator uses a circuit breaker when dispatching to agent services:

class CircuitBreaker:
def __init__(self, failure_threshold: int = 3, reset_timeout: float = 30.0):
self.failure_count = 0
self.failure_threshold = failure_threshold
self.reset_timeout = reset_timeout
self.state = "closed" # closed | open | half-open
self.last_failure_time = 0

async def call(self, func, *args, **kwargs):
if self.state == "open":
if time.time() - self.last_failure_time > self.reset_timeout:
self.state = "half-open"
else:
raise CircuitBreakerOpen("Agent service unavailable")

try:
result = await func(*args, **kwargs)
if self.state == "half-open":
self.state = "closed"
self.failure_count = 0
return result
except Exception:
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = "open"
raise

Retry with Exponential Backoff

Agent dispatch includes automatic retry:

async def dispatch_with_retry(self, agent_type, prompt, max_retries=3):
for attempt in range(max_retries):
try:
return await self.circuit_breaker.call(
self._dispatch, agent_type, prompt
)
except Exception:
if attempt == max_retries - 1:
raise
wait = 2 ** attempt # 1s, 2s, 4s
await asyncio.sleep(wait)

Production-safe retry wrapper example

import asyncio
import random
from collections.abc import Awaitable, Callable


async def retry_with_jitter(
operation: Callable[[], Awaitable[dict]],
*,
max_attempts: int = 5,
base_delay: float = 0.5,
max_delay: float = 8.0,
) -> dict:
"""Retry transient failures with bounded exponential backoff and jitter."""
for attempt in range(1, max_attempts + 1):
try:
return await operation()
except Exception:
if attempt == max_attempts:
raise

# Jitter reduces synchronized retry storms across many clients.
raw_delay = min(max_delay, base_delay * (2 ** (attempt - 1)))
delay = raw_delay * (0.7 + random.random() * 0.6)
await asyncio.sleep(delay)

raise RuntimeError("retry loop exited unexpectedly")

Rate Limiting Errors

When rate limits are exceeded:

{
"detail": "Rate limit exceeded. Try again in 42 seconds."
}

Rate limits use Redis sliding window counters:

CategoryLimitWindow
General API100 requests1 minute
Investigations10 requests1 minute
Playbook execution5 requests1 minute

Structured Logging for Errors

All errors are logged via structlog with OpenTelemetry trace context:

logger.error(
"agent_dispatch_failed",
agent_type="threat_hunter",
error=str(e),
case_id=case_id,
attempt=attempt,
trace_id=get_current_span().get_span_context().trace_id,
)

This produces structured JSON logs that can be correlated with traces in Jaeger:

{
"event": "agent_dispatch_failed",
"agent_type": "threat_hunter",
"error": "Connection refused",
"case_id": "550e8400-...",
"attempt": 2,
"trace_id": "abc123...",
"timestamp": "2024-01-15T10:30:00Z",
"level": "error"
}

WebSocket Error Handling

WebSocket connections handle errors silently to avoid disrupting other clients:

async def broadcast(self, data: dict) -> None:
dead: list[str] = []
for cid, ws in self.active.items():
try:
await ws.send_json(data)
except Exception:
dead.append(cid) # Mark for removal
for cid in dead:
self.active.pop(cid, None) # Clean up dead connections

Failed sends don't raise — they silently remove the dead connection and continue broadcasting to healthy clients.

Event Bus Error Handling

Redis Streams consumers use acknowledgment-based error handling:

async for msg_id, data in consumer.consume():
try:
await handler(data)
await consumer.ack(msg_id) # Success: acknowledge
except Exception:
pass # Message stays in pending — will be re-delivered

Unacknowledged messages remain in the Redis Stream's pending entries list (PEL) and are re-delivered after the consumer group's visibility timeout.

Edge Cases to Handle Explicitly

  1. Token refresh races across concurrent requests
  2. Retry storms after upstream recovery
  3. Duplicate processing when consumers restart before acknowledgment
  4. Mixed success in batch operations (partial commit semantics)
  5. Long-running operations that outlive client-side timeouts
Security Logging Guardrail

Do not log secrets, raw credentials, or full bearer tokens in error messages. Log references, hashes, and trace IDs instead.