System Architecture — End-to-End Data Flows

This page explains how every component in AuroraSOC connects and communicates. You'll learn what happens when an alert enters the system, how agents coordinate to investigate it, and how results reach the analyst's dashboard.

High-Level Architecture

In the default Compose stack, CPS device traffic enters through the Python MQTT consumer and joins the same Redis and agent pipeline as other alerts. Enable the optional rust-core profile only when you need the dedicated Rust normalize-and-attest fast path.

The Five Core Flows

Flow 1: External Alert Ingestion

What happens when an alert enters the system from an external source (SIEM, third-party feed, etc.).

External Source → FastAPI POST /api/alerts → Deduplication Check → PostgreSQL → Redis Stream (aurora:alerts)

External source sends an HTTP POST to /api/alerts with alert data (title, severity, source, IOCs, MITRE techniques).
FastAPI endpoint validates the request using Pydantic models and checks JWT authentication. IOC values are type-validated (for example IPs, domains, URLs, hashes, CVEs), and raw_log payloads are bounded so a single alert cannot push an arbitrarily large request body into the application layer.
Deduplication — a SHA-256 hash is computed from title + source + IOCs. If an alert with the same hash already exists, it's deduplicated (not duplicated).
PostgreSQL write — the alert is stored in the alerts table.
Redis Stream publish — the alert is published to aurora:alerts so background consumers (agents, WebSocket broadcasters) can react.
WebSocket broadcast — the FastAPI background consumer reads from aurora:alerts and pushes the alert to all connected dashboard clients via WebSocket.

Flow 2: IoT/CPS Alert Ingestion (Edge → Python → Redis)

What happens when a physical device detects something unusual in the default deployment.

Firmware → MQTT (mTLS) → Mosquitto → Python MQTT Consumer → Redis Stream (aurora:cps:alerts)

Firmware (running on STM32, nRF52, or ESP32-S3) detects an anomaly — a tamper event, unexpected firmware write, debug interface activation, or sensor reading outside normal range.
MQTT publish — the device publishes a message to a topic like aurora/devices/{device_id}/telemetry using QoS 2 (exactly-once) over mTLS (both sides verify certificates).
Mosquitto broker receives and routes the message.
Python MQTT consumer validates the topic and device identifier, converts the payload into AuroraSOC's CPS alert shape, and publishes the event to aurora:cps:alerts.
Redis Stream delivers the CPS alert to the API and agent layer.
Optional rust-core profile — when enabled, the Rust service adds a dedicated normalize-and-attest fast path for deployments that need it. It consumes MQTT topics, verifies attestation payloads, and publishes into the same downstream pipeline without changing the default Python path.

Flow 3: Agent Investigation (Orchestrator → Specialists)

What happens when the system decides to investigate an alert.

Alert → Orchestrator → ThinkTool → HandoffTool[Specialist] → MCP Tools → Results → Synthesize → Report

This is the most important flow. Let's trace it step by step:

Trigger — an alert arrives (via REST API or Redis Stream) that requires investigation.
Orchestrator receives the task — via the aurora:agent:tasks Redis Stream or an API call.
ThinkTool (mandatory first step) — Before taking any action, the Orchestrator is forced by ConditionalRequirement(ThinkTool, force_at_step=1) to write a structured reasoning plan:
- What type of alert is this?
- Which specialist agents are needed?
- What order should they investigate in?
- Are any CPS/IoT assets involved?

HandoffTool delegation — The Orchestrator uses BeeAI's HandoffTool to delegate tasks to specialist agents. Endpoints are resolved through settings.a2a.get_agent_url(...) based on A2A_DISCOVERY_MODE:

 compose mode: delegate_to_SecurityAnalyst → http://agent-security-analyst:9001
 compose mode: delegate_to_ThreatHunter    → http://agent-threat-hunter:9002
 k8s mode:    delegate_to_SecurityAnalyst → http://agent-security-analyst-svc:9001
 k8s mode:    delegate_to_ThreatHunter    → http://agent-threat-hunter-svc:9002

Specialist receives task via A2A — The specialist agent's BeeAI A2AServer receives the task.
Specialist uses MCP tools — The specialist calls its domain-specific tools. For example, SecurityAnalyst connects to:
- SIEM MCP server (:8101) to search logs
- SOAR MCP server (:8102) to check existing playbooks
- OSINT MCP server (:8110) to enrich IOCs
Specialist returns findings — structured JSON with severity, confidence, MITRE techniques, IOCs, recommendations.
Orchestrator synthesizes — collects all specialist findings, resolves conflicts, computes overall severity.
Human approval gate — if any recommended action is high-risk (e.g., isolating a production server, revoking certificates), the Orchestrator flags requires_human_approval = true. The API creates a HumanApprovalModel record and notifies the dashboard via WebSocket. The flow pauses until a human analyst approves or rejects.
Results published — the final investigation result is published to aurora:agent:results and stored in PostgreSQL.

Flow 4: Agent Memory and Learning

How agents remember past investigations.

Investigation Complete → Summary Embedded → pgvector Store → Future Query → Similar Cases Retrieved

Tiered Memory — each agent has 3 memory tiers:
- Tier 1 (Working): SlidingMemory — last 20-60 messages in the current conversation. This is what the LLM sees as context.
- Tier 2 (Episodic): EpisodicMemoryStore backed by PostgreSQL + pgvector — completed investigation summaries stored as vector embeddings.
- Tier 3 (Threat Intel): ThreatIntelMemory backed by PostgreSQL + pgvector + Redis — cached IOC lookups and semantic threat-intelligence recall.
Auto-persist — after a configurable number of messages (e.g., every 20 add() calls for ANALYST_MEMORY), the memory system automatically persists a summary of the current conversation to the episodic store.
Recall — when a new alert arrives that the agent hasn't seen before, it can call recall_similar("DNS tunneling via TXT records") which:
- Embeds the query text into a vector using Sentence Transformers
- Searches pgvector for the 5 most similar past investigation summaries
- Returns them as additional context for the current investigation

Flow 5: Cross-Site Federation (NATS JetStream)

How multiple AuroraSOC deployments share intelligence.

Site A detects IOC → NATS publish (aurora.intel.ioc.*) → Site B receives → IOC stored locally

Site A — an analyst or agent identifies a new IOC (malicious IP, domain, file hash).
NATS publish — the IOC is published to a NATS JetStream subject like aurora.intel.ioc.ip.
Cross-site delivery — NATS JetStream provides exactly-once delivery with message acknowledgment. The IOC is persisted in NATS until acknowledged by all subscriber sites.
Site B receives — the IOC consumer at Site B receives the message, validates it, and stores it locally for use by agents.

Federation subjects:

Subject Pattern	Content	Purpose
`aurora.intel.ioc.*`	IOC records	Share threat indicators across sites
`aurora.alerts.federation`	Alert summaries	Cross-site alert visibility
`aurora.cps.attestation`	Attestation results	Share device trust status

Authentication and Authorization

Every request to the API must be authenticated. Here's how:

JWT Token Flow

RBAC (Role-Based Access Control)

The system has 5 roles, each with different permissions:

Role	Who	Can Do
`admin`	System administrator	Everything — manage users, configure system, full API access
`analyst`	SOC analyst	View/investigate alerts, manage cases, run playbooks
`operator`	SOC operator	View alerts and cases, limited playbook execution
`viewer`	Read-only user	View dashboards and reports only
`api_service`	Service account	Agent-to-API communication, automated tool access

Permissions are granular — each endpoint requires specific permissions. For example:

alert:create — only admin, analyst, api_service
case:investigate — only admin, analyst
playbook:execute — only admin, analyst, operator
device:revoke_cert — only admin

Circuit Breaker Pattern

The Orchestrator uses a circuit breaker to handle failing specialist agents gracefully:

CLOSED — normal operation. Requests flow to the specialist.
OPEN — after 5 consecutive failures, the circuit "opens." All requests are immediately rejected without trying to contact the failing agent. This prevents cascading failures.
HALF_OPEN — after 60 seconds, one test request is allowed through. If it succeeds, the circuit closes (agent is recovered). If it fails, the circuit reopens.

Retries use exponential backoff via the tenacity library — wait 1s, 2s, 4s between retries, up to 3 attempts.

Background Services

The FastAPI app runs several background workers:

Service	What It Does	How
Alert WebSocket Broadcaster	Reads `aurora:alerts` stream, pushes to WebSocket clients	`asyncio.create_task()` in app lifespan
Agent Thought Broadcaster	Reads `aurora:audit` stream, pushes agent reasoning traces	`asyncio.create_task()` in app lifespan
Scheduler	Runs periodic tasks	`AsyncIOScheduler` (APScheduler)

Scheduled Tasks

The scheduler runs these periodic jobs:

Job	Frequency	Purpose
Alert deduplication cleanup	Every 5 minutes	Remove expired dedup hashes
Scheduled threat hunting	Every hour	Proactive hunting sweeps
Prometheus metrics export	Every 60 seconds	Push metrics to Prometheus
Approval expiry check	Every 2 minutes	Auto-reject expired approval requests

Network Ports Reference

Service	Port	Protocol	Purpose
FastAPI	8000	HTTP/WS	REST API + WebSocket
Next.js Dashboard	3000	HTTP	Browser UI
Orchestrator	9000	HTTP (A2A)	Master agent coordinator
Specialist Agents	9001-9016	HTTP (A2A)	16 specialist agents
MCP SIEM	8101	HTTP (MCP)	SIEM tool server
MCP SOAR	8102	HTTP (MCP)	SOAR tool server
MCP EDR	8103	HTTP (MCP)	EDR tool server
MCP Network	8104	HTTP (MCP)	Network tool server
MCP Malware	8105	HTTP (MCP)	Malware analysis tools
MCP Threat Intel	8106	HTTP (MCP)	Threat intel tool server
MCP CPS	8107	HTTP (MCP)	CPS/IoT tool server
MCP UEBA	8108	HTTP (MCP)	UEBA tool server
MCP Forensics	8109	HTTP (MCP)	Forensic tool server
MCP OSINT	8110	HTTP (MCP)	OSINT tool server
MCP Net Capture	8111	HTTP (MCP)	Packet capture tools
MCP Document	8112	HTTP (MCP)	Report generation tools
MCP Malware Intel	8113	HTTP (MCP)	Malware intelligence tools
MCP Cloud	8114	HTTP (MCP)	Cloud provider tools
MCP Vuln Intel	8115	HTTP (MCP)	Vulnerability intelligence
PostgreSQL	5432	TCP	Primary database
Redis	6379	TCP	Cache + Streams
pgvector	—	PostgreSQL ext.	Vector embeddings (inside PostgreSQL)
NATS	4222/8222	TCP/HTTP	Cross-site messaging
Mosquitto	1883/8883	MQTT/MQTTS	IoT broker (plain/TLS)
OTel Collector	4317/4318	gRPC/HTTP	Telemetry intake
Prometheus	9090	HTTP	Metrics database
Grafana	4000	HTTP	Dashboards

Why This Architecture?

Decision	Alternative	Why We Chose This
Redis Streams (not Kafka)	Apache Kafka	AuroraSOC runs co-located. Redis Streams gives sub-millisecond latency with zero JVM overhead. Kafka is designed for multi-datacenter scale we don't need.
NATS for federation (not Kafka)	Apache Kafka	NATS gives exactly-once delivery, lightweight binary, and leaf-node clustering. Perfect for cross-site sync without Kafka's operational weight.
MQTT for IoT (not HTTP)	HTTP polling	Constrained devices can't afford TCP overhead of HTTP. MQTT is purpose-built for low-bandwidth, unreliable networks. QoS 2 guarantees exactl-once delivery.
A2A handoffs (not direct function calls)	Python function calls	Each agent runs as an independent HTTP service — can be scaled, restarted, or replaced independently. Network boundary enforces isolation.
MCP for tools (not hardcoded)	Import Python modules	MCP makes tool access auditable and swappable. Any MCP-compatible server works. Tool changes don't require agent redeployment.

Next: AI Agents → — Deep dive into every specialist agent.

High-Level Architecture​

The Five Core Flows​

Flow 1: External Alert Ingestion​

Flow 2: IoT/CPS Alert Ingestion (Edge → Python → Redis)​

Flow 3: Agent Investigation (Orchestrator → Specialists)​

Flow 4: Agent Memory and Learning​

Flow 5: Cross-Site Federation (NATS JetStream)​

Authentication and Authorization​

JWT Token Flow​

RBAC (Role-Based Access Control)​

Circuit Breaker Pattern​

Background Services​

Scheduled Tasks​

Network Ports Reference​

Why This Architecture?​