Skip to main content

Architecture Overview

This section provides a deep-dive into AuroraSOC's architecture for developers contributing to or extending the platform. Understanding the architecture is essential before modifying any component.

If you are new to the codebase, read this page in order from top to bottom once, then use the linked deep-dive pages at the end.

The default Compose stack follows the Python API and event-processing path. Enable --profile rust-core only when you need the optional Rust fast path for high-throughput ingest and attestation workloads.

If you are looking for the operator-facing startup path, use AI Agent Fleet Deployment first and return here when you need implementation detail.

When to Use This Page

Use this page before you:

  • Add new agents, tools, events, or storage integrations
  • Change message routing, task execution, or API behavior
  • Plan performance testing or production hardening work

Prerequisites

Recommended background before making architecture changes:

  • Familiarity with async Python (asyncio, FastAPI, SQLAlchemy async)
  • Basic understanding of Redis Streams consumer groups
  • Familiarity with container networking (Compose and service DNS)
  • Working knowledge of OpenTelemetry and Prometheus basics

High-Level Architecture

In default deployments, MQTT_S --> PY_EDGE --> REDIS_S is the active edge-ingest path. The Rust node is only present when --profile rust-core is enabled.

Project Structure

AuroraSOC/
├── aurorasoc/ # Python backend (main application)
│ ├── agents/ # 16 AI agents + factory + prompts
│ ├── agents/ # 13 specialist agents + orchestrator + prompts
│ ├── api/ # FastAPI application and contract surface
│ ├── config/ # Pydantic settings (10 subsystem configs)
│ ├── core/ # Auth, DB, logging, rate limiting, tracing
│ ├── engine/ # SOAR playbook engine
│ ├── events/ # Redis Streams, NATS, MQTT consumers
│ ├── workers/ # Redis stream workers (agent task execution)
│ ├── memory/ # Three-tier agent memory system
│ ├── models/ # Pydantic domain models + enums
│ ├── services/ # Background scheduler
│ ├── tools/ # 50+ MCP tools across multiple modules
│ └── workflows/ # BeeAI AgentWorkflows
├── rust_core/ # Optional Rust fast path
│ └── src/ # Event normalizer, security middleware, publishers
├── dashboard/ # Next.js 15 frontend
│ └── src/ # React components, Zustand store, API client
├── firmware/ # Three firmware platforms
│ ├── esp32s3/ # Zephyr RTOS (C)
│ ├── nrf52840/ # Embassy-rs (Rust)
│ └── stm32/ # Ada SPARK (Ada)
├── infrastructure/ # Docker, monitoring, broker configs
├── tests/ # pytest test suite
├── alembic/ # Database migrations
└── docs/ # This Docusaurus documentation

Component Interactions

Request Flow: Alert Investigation

Runtime Data Flow (Task Worker Path)

This complements the API-centric flow above and focuses on worker correlation behavior.

Design Principles

1. Graceful Degradation

Fallback behavior is intentionally mode-aware:

  • PostgreSQL down in dummy mode -> selected read endpoints may serve in-memory showcase data
  • PostgreSQL down in dry_run or real mode -> DB-backed reads fail clearly instead of silently substituting showcase data
  • Redis down -> selected runtime protections fall back to in-memory behavior where explicitly implemented
  • pgvector unavailable -> agent memory falls back to sliding-window-only behavior
  • Agent offline -> circuit breakers and timeout controls prevent cascading failures

2. Configuration over Code

All behavior is configurable via environment variables with sensible defaults. No code changes needed for:

  • LLM provider switching
  • Port assignments
  • Connection strings
  • Feature toggles
  • Rate limits

3. Separation of Concerns

Each module has a single responsibility:

  • agents/ — AI agent creation and configuration
  • tools/ — External system integration
  • events/ — Message transport
  • memory/ — Knowledge persistence
  • engine/ — Playbook execution
  • core/ — Cross-cutting concerns (auth, logging, tracing)

4. Event Sourcing Lite

While not a full event-sourced system, AuroraSOC captures all state changes in Redis Streams, providing:

  • Complete audit trail
  • Event replay capability
  • Decoupled producers and consumers

Failure-Mode and Recovery Matrix

Failure ModeDetection SignalImmediate BehaviorRecovery Strategy
Agent A2A endpoint unavailableStartup connectivity probe + dispatch exceptionWarning-only degraded startup, circuit-breaker protectionRestore service, breaker resets via half-open success
PostgreSQL unavailableDB health checks and query failuresAPI can return degraded responses for selected viewsRestore DB, re-run failed writes if needed
Redis stream lag growthPrometheus lag/retry/dead-letter signalsInvestigation latency increasesScale worker consumers, inspect slow handlers
Result correlation mismatchaurora_agent_results_unmatched_total increasingPending futures may timeoutValidate correlation IDs and event payload schema
Metrics exporter port conflictWorker startup warningWorker continues without local metrics endpointChange AGENT_TASK_WORKER_METRICS_PORT and restart worker

Performance Considerations

Latency-sensitive path

The most latency-sensitive path is:

  1. API task publish
  2. Worker task consume and dispatch
  3. Result publish and correlation
  4. API response/websocket fanout

Instrumentation to watch:

  • aurora_agent_task_worker_task_duration_ms
  • aurora_agent_result_correlation_latency_ms
  • aurora_agent_result_futures_pending

Throughput tuning levers

LeverLocationEffect
REDIS_BATCH_SIZEworker and consumersHigher throughput, potential burst latency
REDIS_BLOCK_MSstream consumersLower polling overhead, affects responsiveness
Worker replica countdeployment/composeHigher parallelism for task handling
A2A timeout valuesdispatch layerPrevents long hangs during partial outages

Technology Stack Decision Matrix

ComponentTechnologyWhy This ChoiceAlternatives Considered
AI FrameworkBeeAIA2A + MCP native supportLangChain, CrewAI, AutoGen
APIFastAPIAsync, type-safe, OpenAPIDjango, Flask, Express
ORMSQLAlchemy 2.0Async, mature, type-safeTortoise ORM, Prisma
Event BusRedis StreamsLow latency, consumer groupsKafka, RabbitMQ
FederationNATS JetStreamLightweight, persistentKafka, Pulsar
IoT TransportMQTT v5Industry standard, QoSAMQP, CoAP
Vector DBpgvector (PG ext.)Single-DB simplicity, HNSW indexesQdrant, Pinecone, Weaviate, Milvus
Core EngineRust (tokio+axum, opt-in profile)High-throughput normalization and attestation fast pathGo, C++
FrontendNext.js 15SSR, React, TurbopackNuxt, SvelteKit
FirmwareC/Rust/AdaPlatform-specific strengthsMicroPython, Arduino
DatabasePostgreSQL 16JSONB, reliability, extensionsMySQL, MongoDB
TracingOpenTelemetryVendor-neutral, standardJaeger, Zipkin (OTLP exports to these)

Port Map

PortServiceProtocol
8000FastAPIHTTP/WS
8080Rust Core Engine (opt-in profile)HTTP
3000Next.js DashboardHTTP
5432PostgreSQL + pgvectorTCP
6379RedisTCP
4222NATSTCP
1883MQTTTCP
4317OTLP gRPCgRPC
9000-9016A2A AgentsHTTP
9090PrometheusHTTP
3001GrafanaHTTP