System Architecture — End-to-End Data Flows
This page explains how every component in AuroraSOC connects and communicates. You'll learn what happens when an alert enters the system, how agents coordinate to investigate it, and how results reach the analyst's dashboard.
High-Level Architecture
The Five Core Flows
Flow 1: External Alert Ingestion
What happens when an alert enters the system from an external source (SIEM, third-party feed, etc.).
External Source → FastAPI POST /api/alerts → Deduplication Check → PostgreSQL → Redis Stream (aurora:alerts)
- External source sends an HTTP POST to
/api/alertswith alert data (title, severity, source, IOCs, MITRE techniques). - FastAPI endpoint validates the request using Pydantic models and checks JWT authentication.
- Deduplication — a SHA-256 hash is computed from
title + source + IOCs. If an alert with the same hash already exists, it's deduplicated (not duplicated). - PostgreSQL write — the alert is stored in the
alertstable. - Redis Stream publish — the alert is published to
aurora:alertsso background consumers (agents, WebSocket broadcasters) can react. - WebSocket broadcast — the FastAPI background consumer reads from
aurora:alertsand pushes the alert to all connected dashboard clients via WebSocket.
Flow 2: IoT/CPS Alert Ingestion (Edge → Rust → Redis)
What happens when a physical device detects something unusual.
Firmware → MQTT (mTLS) → Mosquitto → Rust Core → Normalize → Redis Stream (aurora:cps:alerts)
- Firmware (running on STM32, nRF52, or ESP32-S3) detects an anomaly — a tamper event, unexpected firmware write, debug interface activation, or sensor reading outside normal range.
- MQTT publish — the device publishes a message to a topic like
aurora/devices/{device_id}/telemetryusing QoS 2 (exactly-once) over mTLS (both sides verify certificates). - Mosquitto broker receives and routes the message.
- Rust Core Engine subscribes to MQTT topics. When a message arrives:
- Normalizes the raw payload into a unified alert format (regardless of whether the original was CEF, Syslog, JSON, or raw binary).
- Verifies attestation if the message includes a hardware attestation certificate (ECDSA P-256 signature verification).
- Publishes the normalized alert to
aurora:cps:alertsRedis Stream.
- Redis Stream delivers the CPS alert to the API and agent layer.
Flow 3: Agent Investigation (Orchestrator → Specialists)
What happens when the system decides to investigate an alert.
Alert → Orchestrator → ThinkTool → HandoffTool[Specialist] → MCP Tools → Results → Synthesize → Report
This is the most important flow. Let's trace it step by step:
- Trigger — an alert arrives (via REST API or Redis Stream) that requires investigation.
- Orchestrator receives the task — via the
aurora:agent:tasksRedis Stream or an API call. - ThinkTool (mandatory first step) — Before taking any action, the Orchestrator is forced by
ConditionalRequirement(ThinkTool, force_at_step=1)to write a structured reasoning plan:- What type of alert is this?
- Which specialist agents are needed?
- What order should they investigate in?
- Are any CPS/IoT assets involved?
- HandoffTool delegation — The Orchestrator uses BeeAI's
HandoffToolto delegate tasks to specialist agents. EachHandoffToolpoints to a specific specialist's A2A (Agent-to-Agent) HTTP endpoint:delegate_to_SecurityAnalyst → http://agent-security-analyst-svc:9001
delegate_to_ThreatHunter → http://agent-threat-hunter-svc:9002 - Specialist receives task via A2A — The specialist agent's BeeAI
A2AServerreceives the task. - Specialist uses MCP tools — The specialist calls its domain-specific tools. For example,
SecurityAnalystconnects to:SIEM MCP server (:8101)to search logsSOAR MCP server (:8102)to check existing playbooksOSINT MCP server (:8110)to enrich IOCs
- Specialist returns findings — structured JSON with severity, confidence, MITRE techniques, IOCs, recommendations.
- Orchestrator synthesizes — collects all specialist findings, resolves conflicts, computes overall severity.
- Human approval gate — if any recommended action is high-risk (e.g., isolating a production server, revoking certificates), the Orchestrator flags
requires_human_approval = true. The API creates aHumanApprovalModelrecord and notifies the dashboard via WebSocket. The flow pauses until a human analyst approves or rejects. - Results published — the final investigation result is published to
aurora:agent:resultsand stored in PostgreSQL.
Flow 4: Agent Memory and Learning
How agents remember past investigations.
Investigation Complete → Summary Embedded → Qdrant Store → Future Query → Similar Cases Retrieved
-
Tiered Memory — each agent has 3 memory tiers:
- Tier 1 (Working):
SlidingMemory— last 20-60 messages in the current conversation. This is what the LLM sees as context. - Tier 2 (Episodic):
EpisodicMemoryStorebacked by Qdrant — completed investigation summaries stored as vector embeddings. - Tier 3 (Threat Intel):
ThreatIntelMemorybacked by Qdrant + Redis — cached IOC lookups and threat intelligence.
- Tier 1 (Working):
-
Auto-persist — after a configurable number of messages (e.g., every 20
add()calls forANALYST_MEMORY), the memory system automatically persists a summary of the current conversation to the episodic store. -
Recall — when a new alert arrives that the agent hasn't seen before, it can call
recall_similar("DNS tunneling via TXT records")which:- Embeds the query text into a vector using Sentence Transformers
- Searches Qdrant for the 5 most similar past investigation summaries
- Returns them as additional context for the current investigation
Flow 5: Cross-Site Federation (NATS JetStream)
How multiple AuroraSOC deployments share intelligence.
Site A detects IOC → NATS publish (aurora.intel.ioc.*) → Site B receives → IOC stored locally
- Site A — an analyst or agent identifies a new IOC (malicious IP, domain, file hash).
- NATS publish — the IOC is published to a NATS JetStream subject like
aurora.intel.ioc.ip. - Cross-site delivery — NATS JetStream provides exactly-once delivery with message acknowledgment. The IOC is persisted in NATS until acknowledged by all subscriber sites.
- Site B receives — the IOC consumer at Site B receives the message, validates it, and stores it locally for use by agents.
Federation subjects:
| Subject Pattern | Content | Purpose |
|---|---|---|
aurora.intel.ioc.* | IOC records | Share threat indicators across sites |
aurora.alerts.federation | Alert summaries | Cross-site alert visibility |
aurora.cps.attestation | Attestation results | Share device trust status |
Authentication and Authorization
Every request to the API must be authenticated. Here's how:
JWT Token Flow
RBAC (Role-Based Access Control)
The system has 5 roles, each with different permissions:
| Role | Who | Can Do |
|---|---|---|
admin | System administrator | Everything — manage users, configure system, full API access |
analyst | SOC analyst | View/investigate alerts, manage cases, run playbooks |
operator | SOC operator | View alerts and cases, limited playbook execution |
viewer | Read-only user | View dashboards and reports only |
api_service | Service account | Agent-to-API communication, automated tool access |
Permissions are granular — each endpoint requires specific permissions. For example:
alert:create— onlyadmin,analyst,api_servicecase:investigate— onlyadmin,analystplaybook:execute— onlyadmin,analyst,operatordevice:revoke_cert— onlyadmin
Circuit Breaker Pattern
The Orchestrator uses a circuit breaker to handle failing specialist agents gracefully:
- CLOSED — normal operation. Requests flow to the specialist.
- OPEN — after 5 consecutive failures, the circuit "opens." All requests are immediately rejected without trying to contact the failing agent. This prevents cascading failures.
- HALF_OPEN — after 60 seconds, one test request is allowed through. If it succeeds, the circuit closes (agent is recovered). If it fails, the circuit reopens.
Retries use exponential backoff via the tenacity library — wait 1s, 2s, 4s between retries, up to 3 attempts.
Background Services
The FastAPI app runs several background workers:
| Service | What It Does | How |
|---|---|---|
| Alert WebSocket Broadcaster | Reads aurora:alerts stream, pushes to WebSocket clients | asyncio.create_task() in app lifespan |
| Agent Thought Broadcaster | Reads aurora:audit stream, pushes agent reasoning traces | asyncio.create_task() in app lifespan |
| Scheduler | Runs periodic tasks | AsyncIOScheduler (APScheduler) |
Scheduled Tasks
The scheduler runs these periodic jobs:
| Job | Frequency | Purpose |
|---|---|---|
| Alert deduplication cleanup | Every 5 minutes | Remove expired dedup hashes |
| Scheduled threat hunting | Every hour | Proactive hunting sweeps |
| Prometheus metrics export | Every 60 seconds | Push metrics to Prometheus |
| Approval expiry check | Every 5 minutes | Auto-reject expired approval requests |
Network Ports Reference
| Service | Port | Protocol | Purpose |
|---|---|---|---|
| FastAPI | 8000 | HTTP/WS | REST API + WebSocket |
| Next.js Dashboard | 3000 | HTTP | Browser UI |
| Orchestrator | 9000 | HTTP (A2A) | Master agent coordinator |
| Specialist Agents | 9001-9015 | HTTP (A2A) | 15 specialist agents |
| MCP SIEM | 8101 | HTTP (MCP) | SIEM tool server |
| MCP SOAR | 8102 | HTTP (MCP) | SOAR tool server |
| MCP EDR | 8103 | HTTP (MCP) | EDR tool server |
| MCP Network | 8104 | HTTP (MCP) | Network tool server |
| MCP Malware | 8105 | HTTP (MCP) | Malware analysis tools |
| MCP Threat Intel | 8106 | HTTP (MCP) | Threat intel tool server |
| MCP CPS | 8107 | HTTP (MCP) | CPS/IoT tool server |
| MCP UEBA | 8108 | HTTP (MCP) | UEBA tool server |
| MCP Forensics | 8109 | HTTP (MCP) | Forensic tool server |
| MCP OSINT | 8110 | HTTP (MCP) | OSINT tool server |
| MCP Net Capture | 8111 | HTTP (MCP) | Packet capture tools |
| MCP Document | 8112 | HTTP (MCP) | Report generation tools |
| MCP Malware Intel | 8113 | HTTP (MCP) | Malware intelligence tools |
| MCP Cloud | 8114 | HTTP (MCP) | Cloud provider tools |
| MCP Vuln Intel | 8115 | HTTP (MCP) | Vulnerability intelligence |
| PostgreSQL | 5432 | TCP | Primary database |
| Redis | 6379 | TCP | Cache + Streams |
| Qdrant | 6333/6334 | HTTP/gRPC | Vector database |
| NATS | 4222/8222 | TCP/HTTP | Cross-site messaging |
| Mosquitto | 1883/8883 | MQTT/MQTTS | IoT broker (plain/TLS) |
| OTel Collector | 4317/4318 | gRPC/HTTP | Telemetry intake |
| Prometheus | 9090 | HTTP | Metrics database |
| Grafana | 4000 | HTTP | Dashboards |
Why This Architecture?
| Decision | Alternative | Why We Chose This |
|---|---|---|
| Redis Streams (not Kafka) | Apache Kafka | AuroraSOC runs co-located. Redis Streams gives sub-millisecond latency with zero JVM overhead. Kafka is designed for multi-datacenter scale we don't need. |
| NATS for federation (not Kafka) | Apache Kafka | NATS gives exactly-once delivery, lightweight binary, and leaf-node clustering. Perfect for cross-site sync without Kafka's operational weight. |
| MQTT for IoT (not HTTP) | HTTP polling | Constrained devices can't afford TCP overhead of HTTP. MQTT is purpose-built for low-bandwidth, unreliable networks. QoS 2 guarantees exactl-once delivery. |
| A2A handoffs (not direct function calls) | Python function calls | Each agent runs as an independent HTTP service — can be scaled, restarted, or replaced independently. Network boundary enforces isolation. |
| MCP for tools (not hardcoded) | Import Python modules | MCP makes tool access auditable and swappable. Any MCP-compatible server works. Tool changes don't require agent redeployment. |
Next: AI Agents → — Deep dive into every specialist agent.