Error Handling

AuroraSOC implements multiple layers of error handling to ensure resilience in a security-critical environment.

When to Use This Page

Use this page when implementing API clients, worker consumers, or integration services that must remain robust during dependency failures and traffic spikes.

Prerequisites

Familiarity with HTTP status code classes
Basic async retry/backoff patterns
Understanding of AuroraSOC auth and permission model

Error Handling Philosophy

API Error Response Format

All API errors return a consistent JSON structure:

{
  "detail": "Human-readable error message"
}

For validation errors (Pydantic), FastAPI returns:

{
  "detail": [
    {
      "loc": ["body", "severity"],
      "msg": "String should match pattern '^(critical|high|medium|low|info)$'",
      "type": "string_pattern_mismatch"
    }
  ]
}

Error Codes Reference

Code	Meaning	When It Occurs
400	Bad Request	Invalid input, validation failure
401	Unauthorized	Missing/expired JWT, invalid API key
403	Forbidden	Valid auth but insufficient permissions
404	Not Found	Resource doesn't exist
409	Conflict	Duplicate resource (e.g., duplicate IOC type+value)
429	Too Many Requests	Rate limit exceeded
503	Service Unavailable	Database down, degraded mode
500	Internal Server Error	Unhandled exception

Retry Decision Matrix

Error Class	Retry?	Backoff Strategy	Notes
Validation (`400`)	No	None	Fix request contract first
Authentication (`401`)	Conditional	Immediate token refresh + 1 retry	Prevent endless token loops
Authorization (`403`)	No	None	Permission issue, not transient
Not Found (`404`)	Conditional	Short retry only for eventual consistency paths	Most paths should not retry
Conflict (`409`)	Conditional	None or short retry	Often indicates idempotency conflict
Rate Limit (`429`)	Yes	Exponential + jitter	Respect server guidance when available
Service Unavailable (`503`)	Yes	Exponential + capped retry	Mark operation as deferred if persistent
Internal (`500`)	Yes (bounded)	Exponential + telemetry	Escalate after retry budget exhausted

Database Unavailable Handling

The most critical error scenario is PostgreSQL being down. AuroraSOC handles this gracefully:

class DatabaseUnavailable(Exception):
    """Raised when PostgreSQL is not reachable."""
    pass

@app.exception_handler(DatabaseUnavailable)
async def _handle_db_unavailable(request, exc):
    return JSONResponse(
        status_code=503,
        content={"detail": "Database unavailable — running in degraded mode"},
    )

In degraded mode, the API serves data from in-memory stores populated with demo data. This ensures the dashboard remains functional for monitoring even during database maintenance.

Circuit Breaker Pattern

The orchestrator uses a circuit breaker when dispatching to agent services:

class CircuitBreaker:
    def __init__(self, failure_threshold: int = 3, reset_timeout: float = 30.0):
        self.failure_count = 0
        self.failure_threshold = failure_threshold
        self.reset_timeout = reset_timeout
        self.state = "closed"    # closed | open | half-open
        self.last_failure_time = 0

    async def call(self, func, *args, **kwargs):
        if self.state == "open":
            if time.time() - self.last_failure_time > self.reset_timeout:
                self.state = "half-open"
            else:
                raise CircuitBreakerOpen("Agent service unavailable")

        try:
            result = await func(*args, **kwargs)
            if self.state == "half-open":
                self.state = "closed"
                self.failure_count = 0
            return result
        except Exception:
            self.failure_count += 1
            self.last_failure_time = time.time()
            if self.failure_count >= self.failure_threshold:
                self.state = "open"
            raise

Retry with Exponential Backoff

Agent dispatch includes automatic retry:

async def dispatch_with_retry(self, agent_type, prompt, max_retries=3):
    for attempt in range(max_retries):
        try:
            return await self.circuit_breaker.call(
                self._dispatch, agent_type, prompt
            )
        except Exception:
            if attempt == max_retries - 1:
                raise
            wait = 2 ** attempt  # 1s, 2s, 4s
            await asyncio.sleep(wait)

Production-safe retry wrapper example

import asyncio
import random
from collections.abc import Awaitable, Callable


async def retry_with_jitter(
    operation: Callable[[], Awaitable[dict]],
    *,
    max_attempts: int = 5,
    base_delay: float = 0.5,
    max_delay: float = 8.0,
) -> dict:
    """Retry transient failures with bounded exponential backoff and jitter."""
    for attempt in range(1, max_attempts + 1):
        try:
            return await operation()
        except Exception:
            if attempt == max_attempts:
                raise

            # Jitter reduces synchronized retry storms across many clients.
            raw_delay = min(max_delay, base_delay * (2 ** (attempt - 1)))
            delay = raw_delay * (0.7 + random.random() * 0.6)
            await asyncio.sleep(delay)

    raise RuntimeError("retry loop exited unexpectedly")

Rate Limiting Errors

When rate limits are exceeded:

{
  "detail": "Rate limit exceeded. Try again in 42 seconds."
}

Rate limits use Redis sliding window counters:

Category	Limit	Window
General API	100 requests	1 minute
Investigations	10 requests	1 minute
Playbook execution	5 requests	1 minute

Structured Logging for Errors

All errors are logged via structlog with OpenTelemetry trace context:

logger.error(
    "agent_dispatch_failed",
    agent_type="threat_hunter",
    error=str(e),
    case_id=case_id,
    attempt=attempt,
    trace_id=get_current_span().get_span_context().trace_id,
)

This produces structured JSON logs that can be correlated with traces in Jaeger:

{
  "event": "agent_dispatch_failed",
  "agent_type": "threat_hunter",
  "error": "Connection refused",
  "case_id": "550e8400-...",
  "attempt": 2,
  "trace_id": "abc123...",
  "timestamp": "2024-01-15T10:30:00Z",
  "level": "error"
}

WebSocket Error Handling

WebSocket connections handle errors silently to avoid disrupting other clients:

async def broadcast(self, data: dict) -> None:
    dead: list[str] = []
    for cid, ws in self.active.items():
        try:
            await ws.send_json(data)
        except Exception:
            dead.append(cid)  # Mark for removal
    for cid in dead:
        self.active.pop(cid, None)  # Clean up dead connections

Failed sends don't raise — they silently remove the dead connection and continue broadcasting to healthy clients.

Event Bus Error Handling

Redis Streams consumers use acknowledgment-based error handling:

async for msg_id, data in consumer.consume():
    try:
        await handler(data)
        await consumer.ack(msg_id)      # Success: acknowledge
    except Exception:
        pass  # Message stays in pending — will be re-delivered

Unacknowledged messages remain in the Redis Stream's pending entries list (PEL) and are re-delivered after the consumer group's visibility timeout.

Edge Cases to Handle Explicitly

Token refresh races across concurrent requests
Retry storms after upstream recovery
Duplicate processing when consumers restart before acknowledgment
Mixed success in batch operations (partial commit semantics)
Long-running operations that outlive client-side timeouts

Security Logging Guardrail

Do not log secrets, raw credentials, or full bearer tokens in error messages. Log references, hashes, and trace IDs instead.

When to Use This Page​

Prerequisites​

Error Handling Philosophy​

API Error Response Format​

Error Codes Reference​

Retry Decision Matrix​

Database Unavailable Handling​

Circuit Breaker Pattern​

Retry with Exponential Backoff​

Production-safe retry wrapper example​

Rate Limiting Errors​

Structured Logging for Errors​

WebSocket Error Handling​

Event Bus Error Handling​

Edge Cases to Handle Explicitly​

Related Pages​