Architecture Overview
This section provides a deep-dive into AuroraSOC's architecture for developers contributing to or extending the platform. Understanding the architecture is essential before modifying any component.
If you are new to the codebase, read this page in order from top to bottom once, then use the linked deep-dive pages at the end.
The default Compose stack follows the Python API and event-processing path. Enable --profile rust-core only when you need the optional Rust fast path for high-throughput ingest and attestation workloads.
If you are looking for the operator-facing startup path, use AI Agent Fleet Deployment first and return here when you need implementation detail.
When to Use This Page
Use this page before you:
- Add new agents, tools, events, or storage integrations
- Change message routing, task execution, or API behavior
- Plan performance testing or production hardening work
Prerequisites
Recommended background before making architecture changes:
- Familiarity with async Python (
asyncio, FastAPI, SQLAlchemy async) - Basic understanding of Redis Streams consumer groups
- Familiarity with container networking (Compose and service DNS)
- Working knowledge of OpenTelemetry and Prometheus basics
High-Level Architecture
In default deployments, MQTT_S --> PY_EDGE --> REDIS_S is the active edge-ingest path. The Rust node is only present when --profile rust-core is enabled.
Project Structure
AuroraSOC/
├── aurorasoc/ # Python backend (main application)
│ ├── agents/ # 16 AI agents + factory + prompts
│ ├── agents/ # 13 specialist agents + orchestrator + prompts
│ ├── api/ # FastAPI application and contract surface
│ ├── config/ # Pydantic settings (10 subsystem configs)
│ ├── core/ # Auth, DB, logging, rate limiting, tracing
│ ├── engine/ # SOAR playbook engine
│ ├── events/ # Redis Streams, NATS, MQTT consumers
│ ├── workers/ # Redis stream workers (agent task execution)
│ ├── memory/ # Three-tier agent memory system
│ ├── models/ # Pydantic domain models + enums
│ ├── services/ # Background scheduler
│ ├── tools/ # 50+ MCP tools across multiple modules
│ └── workflows/ # BeeAI AgentWorkflows
├── rust_core/ # Optional Rust fast path
│ └── src/ # Event normalizer, security middleware, publishers
├── dashboard/ # Next.js 15 frontend
│ └── src/ # React components, Zustand store, API client
├── firmware/ # Three firmware platforms
│ ├── esp32s3/ # Zephyr RTOS (C)
│ ├── nrf52840/ # Embassy-rs (Rust)
│ └── stm32/ # Ada SPARK (Ada)
├── infrastructure/ # Docker, monitoring, broker configs
├── tests/ # pytest test suite
├── alembic/ # Database migrations
└── docs/ # This Docusaurus documentation
Component Interactions
Request Flow: Alert Investigation
Runtime Data Flow (Task Worker Path)
This complements the API-centric flow above and focuses on worker correlation behavior.
Design Principles
1. Graceful Degradation
Fallback behavior is intentionally mode-aware:
- PostgreSQL down in
dummymode -> selected read endpoints may serve in-memory showcase data - PostgreSQL down in
dry_runorrealmode -> DB-backed reads fail clearly instead of silently substituting showcase data - Redis down -> selected runtime protections fall back to in-memory behavior where explicitly implemented
- pgvector unavailable -> agent memory falls back to sliding-window-only behavior
- Agent offline -> circuit breakers and timeout controls prevent cascading failures
2. Configuration over Code
All behavior is configurable via environment variables with sensible defaults. No code changes needed for:
- LLM provider switching
- Port assignments
- Connection strings
- Feature toggles
- Rate limits
3. Separation of Concerns
Each module has a single responsibility:
agents/— AI agent creation and configurationtools/— External system integrationevents/— Message transportmemory/— Knowledge persistenceengine/— Playbook executioncore/— Cross-cutting concerns (auth, logging, tracing)
4. Event Sourcing Lite
While not a full event-sourced system, AuroraSOC captures all state changes in Redis Streams, providing:
- Complete audit trail
- Event replay capability
- Decoupled producers and consumers
Failure-Mode and Recovery Matrix
| Failure Mode | Detection Signal | Immediate Behavior | Recovery Strategy |
|---|---|---|---|
| Agent A2A endpoint unavailable | Startup connectivity probe + dispatch exception | Warning-only degraded startup, circuit-breaker protection | Restore service, breaker resets via half-open success |
| PostgreSQL unavailable | DB health checks and query failures | API can return degraded responses for selected views | Restore DB, re-run failed writes if needed |
| Redis stream lag growth | Prometheus lag/retry/dead-letter signals | Investigation latency increases | Scale worker consumers, inspect slow handlers |
| Result correlation mismatch | aurora_agent_results_unmatched_total increasing | Pending futures may timeout | Validate correlation IDs and event payload schema |
| Metrics exporter port conflict | Worker startup warning | Worker continues without local metrics endpoint | Change AGENT_TASK_WORKER_METRICS_PORT and restart worker |
Performance Considerations
Latency-sensitive path
The most latency-sensitive path is:
- API task publish
- Worker task consume and dispatch
- Result publish and correlation
- API response/websocket fanout
Instrumentation to watch:
aurora_agent_task_worker_task_duration_msaurora_agent_result_correlation_latency_msaurora_agent_result_futures_pending
Throughput tuning levers
| Lever | Location | Effect |
|---|---|---|
REDIS_BATCH_SIZE | worker and consumers | Higher throughput, potential burst latency |
REDIS_BLOCK_MS | stream consumers | Lower polling overhead, affects responsiveness |
| Worker replica count | deployment/compose | Higher parallelism for task handling |
| A2A timeout values | dispatch layer | Prevents long hangs during partial outages |
Related Pages
Technology Stack Decision Matrix
| Component | Technology | Why This Choice | Alternatives Considered |
|---|---|---|---|
| AI Framework | BeeAI | A2A + MCP native support | LangChain, CrewAI, AutoGen |
| API | FastAPI | Async, type-safe, OpenAPI | Django, Flask, Express |
| ORM | SQLAlchemy 2.0 | Async, mature, type-safe | Tortoise ORM, Prisma |
| Event Bus | Redis Streams | Low latency, consumer groups | Kafka, RabbitMQ |
| Federation | NATS JetStream | Lightweight, persistent | Kafka, Pulsar |
| IoT Transport | MQTT v5 | Industry standard, QoS | AMQP, CoAP |
| Vector DB | pgvector (PG ext.) | Single-DB simplicity, HNSW indexes | Qdrant, Pinecone, Weaviate, Milvus |
| Core Engine | Rust (tokio+axum, opt-in profile) | High-throughput normalization and attestation fast path | Go, C++ |
| Frontend | Next.js 15 | SSR, React, Turbopack | Nuxt, SvelteKit |
| Firmware | C/Rust/Ada | Platform-specific strengths | MicroPython, Arduino |
| Database | PostgreSQL 16 | JSONB, reliability, extensions | MySQL, MongoDB |
| Tracing | OpenTelemetry | Vendor-neutral, standard | Jaeger, Zipkin (OTLP exports to these) |
Port Map
| Port | Service | Protocol |
|---|---|---|
| 8000 | FastAPI | HTTP/WS |
| 8080 | Rust Core Engine (opt-in profile) | HTTP |
| 3000 | Next.js Dashboard | HTTP |
| 5432 | PostgreSQL + pgvector | TCP |
| 6379 | Redis | TCP |
| 4222 | NATS | TCP |
| 1883 | MQTT | TCP |
| 4317 | OTLP gRPC | gRPC |
| 9000-9016 | A2A Agents | HTTP |
| 9090 | Prometheus | HTTP |
| 3001 | Grafana | HTTP |