Skip to main content

System Architecture — End-to-End Data Flows

This page explains how every component in AuroraSOC connects and communicates. You'll learn what happens when an alert enters the system, how agents coordinate to investigate it, and how results reach the analyst's dashboard.


High-Level Architecture


The Five Core Flows

Flow 1: External Alert Ingestion

What happens when an alert enters the system from an external source (SIEM, third-party feed, etc.).

External Source → FastAPI POST /api/alerts → Deduplication Check → PostgreSQL → Redis Stream (aurora:alerts)
  1. External source sends an HTTP POST to /api/alerts with alert data (title, severity, source, IOCs, MITRE techniques).
  2. FastAPI endpoint validates the request using Pydantic models and checks JWT authentication.
  3. Deduplication — a SHA-256 hash is computed from title + source + IOCs. If an alert with the same hash already exists, it's deduplicated (not duplicated).
  4. PostgreSQL write — the alert is stored in the alerts table.
  5. Redis Stream publish — the alert is published to aurora:alerts so background consumers (agents, WebSocket broadcasters) can react.
  6. WebSocket broadcast — the FastAPI background consumer reads from aurora:alerts and pushes the alert to all connected dashboard clients via WebSocket.

Flow 2: IoT/CPS Alert Ingestion (Edge → Rust → Redis)

What happens when a physical device detects something unusual.

Firmware → MQTT (mTLS) → Mosquitto → Rust Core → Normalize → Redis Stream (aurora:cps:alerts)
  1. Firmware (running on STM32, nRF52, or ESP32-S3) detects an anomaly — a tamper event, unexpected firmware write, debug interface activation, or sensor reading outside normal range.
  2. MQTT publish — the device publishes a message to a topic like aurora/devices/{device_id}/telemetry using QoS 2 (exactly-once) over mTLS (both sides verify certificates).
  3. Mosquitto broker receives and routes the message.
  4. Rust Core Engine subscribes to MQTT topics. When a message arrives:
    • Normalizes the raw payload into a unified alert format (regardless of whether the original was CEF, Syslog, JSON, or raw binary).
    • Verifies attestation if the message includes a hardware attestation certificate (ECDSA P-256 signature verification).
    • Publishes the normalized alert to aurora:cps:alerts Redis Stream.
  5. Redis Stream delivers the CPS alert to the API and agent layer.

Flow 3: Agent Investigation (Orchestrator → Specialists)

What happens when the system decides to investigate an alert.

Alert → Orchestrator → ThinkTool → HandoffTool[Specialist] → MCP Tools → Results → Synthesize → Report

This is the most important flow. Let's trace it step by step:

  1. Trigger — an alert arrives (via REST API or Redis Stream) that requires investigation.
  2. Orchestrator receives the task — via the aurora:agent:tasks Redis Stream or an API call.
  3. ThinkTool (mandatory first step) — Before taking any action, the Orchestrator is forced by ConditionalRequirement(ThinkTool, force_at_step=1) to write a structured reasoning plan:
    • What type of alert is this?
    • Which specialist agents are needed?
    • What order should they investigate in?
    • Are any CPS/IoT assets involved?
  4. HandoffTool delegation — The Orchestrator uses BeeAI's HandoffTool to delegate tasks to specialist agents. Each HandoffTool points to a specific specialist's A2A (Agent-to-Agent) HTTP endpoint:
    delegate_to_SecurityAnalyst → http://agent-security-analyst-svc:9001
    delegate_to_ThreatHunter → http://agent-threat-hunter-svc:9002
  5. Specialist receives task via A2A — The specialist agent's BeeAI A2AServer receives the task.
  6. Specialist uses MCP tools — The specialist calls its domain-specific tools. For example, SecurityAnalyst connects to:
    • SIEM MCP server (:8101) to search logs
    • SOAR MCP server (:8102) to check existing playbooks
    • OSINT MCP server (:8110) to enrich IOCs
  7. Specialist returns findings — structured JSON with severity, confidence, MITRE techniques, IOCs, recommendations.
  8. Orchestrator synthesizes — collects all specialist findings, resolves conflicts, computes overall severity.
  9. Human approval gate — if any recommended action is high-risk (e.g., isolating a production server, revoking certificates), the Orchestrator flags requires_human_approval = true. The API creates a HumanApprovalModel record and notifies the dashboard via WebSocket. The flow pauses until a human analyst approves or rejects.
  10. Results published — the final investigation result is published to aurora:agent:results and stored in PostgreSQL.

Flow 4: Agent Memory and Learning

How agents remember past investigations.

Investigation Complete → Summary Embedded → Qdrant Store → Future Query → Similar Cases Retrieved
  1. Tiered Memory — each agent has 3 memory tiers:

    • Tier 1 (Working): SlidingMemory — last 20-60 messages in the current conversation. This is what the LLM sees as context.
    • Tier 2 (Episodic): EpisodicMemoryStore backed by Qdrant — completed investigation summaries stored as vector embeddings.
    • Tier 3 (Threat Intel): ThreatIntelMemory backed by Qdrant + Redis — cached IOC lookups and threat intelligence.
  2. Auto-persist — after a configurable number of messages (e.g., every 20 add() calls for ANALYST_MEMORY), the memory system automatically persists a summary of the current conversation to the episodic store.

  3. Recall — when a new alert arrives that the agent hasn't seen before, it can call recall_similar("DNS tunneling via TXT records") which:

    • Embeds the query text into a vector using Sentence Transformers
    • Searches Qdrant for the 5 most similar past investigation summaries
    • Returns them as additional context for the current investigation

Flow 5: Cross-Site Federation (NATS JetStream)

How multiple AuroraSOC deployments share intelligence.

Site A detects IOC → NATS publish (aurora.intel.ioc.*) → Site B receives → IOC stored locally
  1. Site A — an analyst or agent identifies a new IOC (malicious IP, domain, file hash).
  2. NATS publish — the IOC is published to a NATS JetStream subject like aurora.intel.ioc.ip.
  3. Cross-site delivery — NATS JetStream provides exactly-once delivery with message acknowledgment. The IOC is persisted in NATS until acknowledged by all subscriber sites.
  4. Site B receives — the IOC consumer at Site B receives the message, validates it, and stores it locally for use by agents.

Federation subjects:

Subject PatternContentPurpose
aurora.intel.ioc.*IOC recordsShare threat indicators across sites
aurora.alerts.federationAlert summariesCross-site alert visibility
aurora.cps.attestationAttestation resultsShare device trust status

Authentication and Authorization

Every request to the API must be authenticated. Here's how:

JWT Token Flow

RBAC (Role-Based Access Control)

The system has 5 roles, each with different permissions:

RoleWhoCan Do
adminSystem administratorEverything — manage users, configure system, full API access
analystSOC analystView/investigate alerts, manage cases, run playbooks
operatorSOC operatorView alerts and cases, limited playbook execution
viewerRead-only userView dashboards and reports only
api_serviceService accountAgent-to-API communication, automated tool access

Permissions are granular — each endpoint requires specific permissions. For example:

  • alert:create — only admin, analyst, api_service
  • case:investigate — only admin, analyst
  • playbook:execute — only admin, analyst, operator
  • device:revoke_cert — only admin

Circuit Breaker Pattern

The Orchestrator uses a circuit breaker to handle failing specialist agents gracefully:

  • CLOSED — normal operation. Requests flow to the specialist.
  • OPEN — after 5 consecutive failures, the circuit "opens." All requests are immediately rejected without trying to contact the failing agent. This prevents cascading failures.
  • HALF_OPEN — after 60 seconds, one test request is allowed through. If it succeeds, the circuit closes (agent is recovered). If it fails, the circuit reopens.

Retries use exponential backoff via the tenacity library — wait 1s, 2s, 4s between retries, up to 3 attempts.


Background Services

The FastAPI app runs several background workers:

ServiceWhat It DoesHow
Alert WebSocket BroadcasterReads aurora:alerts stream, pushes to WebSocket clientsasyncio.create_task() in app lifespan
Agent Thought BroadcasterReads aurora:audit stream, pushes agent reasoning tracesasyncio.create_task() in app lifespan
SchedulerRuns periodic tasksAsyncIOScheduler (APScheduler)

Scheduled Tasks

The scheduler runs these periodic jobs:

JobFrequencyPurpose
Alert deduplication cleanupEvery 5 minutesRemove expired dedup hashes
Scheduled threat huntingEvery hourProactive hunting sweeps
Prometheus metrics exportEvery 60 secondsPush metrics to Prometheus
Approval expiry checkEvery 5 minutesAuto-reject expired approval requests

Network Ports Reference

ServicePortProtocolPurpose
FastAPI8000HTTP/WSREST API + WebSocket
Next.js Dashboard3000HTTPBrowser UI
Orchestrator9000HTTP (A2A)Master agent coordinator
Specialist Agents9001-9015HTTP (A2A)15 specialist agents
MCP SIEM8101HTTP (MCP)SIEM tool server
MCP SOAR8102HTTP (MCP)SOAR tool server
MCP EDR8103HTTP (MCP)EDR tool server
MCP Network8104HTTP (MCP)Network tool server
MCP Malware8105HTTP (MCP)Malware analysis tools
MCP Threat Intel8106HTTP (MCP)Threat intel tool server
MCP CPS8107HTTP (MCP)CPS/IoT tool server
MCP UEBA8108HTTP (MCP)UEBA tool server
MCP Forensics8109HTTP (MCP)Forensic tool server
MCP OSINT8110HTTP (MCP)OSINT tool server
MCP Net Capture8111HTTP (MCP)Packet capture tools
MCP Document8112HTTP (MCP)Report generation tools
MCP Malware Intel8113HTTP (MCP)Malware intelligence tools
MCP Cloud8114HTTP (MCP)Cloud provider tools
MCP Vuln Intel8115HTTP (MCP)Vulnerability intelligence
PostgreSQL5432TCPPrimary database
Redis6379TCPCache + Streams
Qdrant6333/6334HTTP/gRPCVector database
NATS4222/8222TCP/HTTPCross-site messaging
Mosquitto1883/8883MQTT/MQTTSIoT broker (plain/TLS)
OTel Collector4317/4318gRPC/HTTPTelemetry intake
Prometheus9090HTTPMetrics database
Grafana4000HTTPDashboards

Why This Architecture?

DecisionAlternativeWhy We Chose This
Redis Streams (not Kafka)Apache KafkaAuroraSOC runs co-located. Redis Streams gives sub-millisecond latency with zero JVM overhead. Kafka is designed for multi-datacenter scale we don't need.
NATS for federation (not Kafka)Apache KafkaNATS gives exactly-once delivery, lightweight binary, and leaf-node clustering. Perfect for cross-site sync without Kafka's operational weight.
MQTT for IoT (not HTTP)HTTP pollingConstrained devices can't afford TCP overhead of HTTP. MQTT is purpose-built for low-bandwidth, unreliable networks. QoS 2 guarantees exactl-once delivery.
A2A handoffs (not direct function calls)Python function callsEach agent runs as an independent HTTP service — can be scaled, restarted, or replaced independently. Network boundary enforces isolation.
MCP for tools (not hardcoded)Import Python modulesMCP makes tool access auditable and swappable. Any MCP-compatible server works. Tool changes don't require agent redeployment.

Next: AI Agents → — Deep dive into every specialist agent.