Skip to main content

Monitoring & Observability

AuroraSOC uses a comprehensive observability stack: OpenTelemetry for tracing, Prometheus for metrics, Grafana for visualization, and structured logging via structlog.

Observability Stack

OpenTelemetry Configuration

Collector (infrastructure/otel/otel-collector.yml)

receivers:
otlp:
protocols:
grpc:
endpoint: "0.0.0.0:4317"
http:
endpoint: "0.0.0.0:4318"

processors:
memory_limiter:
check_interval: 1s
limit_mib: 512
batch:
timeout: 5s
send_batch_size: 1024

exporters:
prometheus:
endpoint: "0.0.0.0:8889"
namespace: "aurorasoc"
otlp/jaeger:
endpoint: "jaeger:4317"
tls:
insecure: true
logging:
loglevel: info

service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp/jaeger, logging]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [prometheus]

Python Tracing Setup

# aurorasoc/core/tracing.py
def setup_tracing():
resource = Resource.create({"service.name": "aurorasoc-api", "service.version": "2.0.0"})
exporter = OTLPSpanExporter(endpoint=settings.observability.otlp_endpoint, insecure=True)
provider = TracerProvider(resource=resource)
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)

Structured Logging

# aurorasoc/core/logging.py
def setup_logging():
structlog.configure(
processors=[
structlog.contextvars.merge_contextvars,
structlog.processors.add_log_level,
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.JSONRenderer(),
],
logger_factory=structlog.PrintLoggerFactory(),
)

Output example:

{
"event": "alert_created",
"alert_id": "550e8400-...",
"severity": "critical",
"source": "suricata",
"dedup_hash": "a3f2b8...",
"level": "info",
"timestamp": "2024-01-15T10:30:00Z"
}

Prometheus Scrape Targets

Configuration (infrastructure/prometheus/prometheus.yml)

TargetPortPathInterval
aurorasoc-api8000/metrics15s
orchestrator9000/metrics15s
Agent services (10)9001-9012/metrics15s
rust-core50051/metrics15s
Redis6379/metrics15s
NATS8222/varz15s
PostgreSQL5432/metrics15s
Mosquitto1883/metrics15s

Alert Rules

Security Operations Alerts

groups:
- name: aurorasoc_alerts
rules:
- alert: HighSeverityAlertSpike
expr: rate(aurorasoc_alerts_total{severity="critical"}[5m]) > 10
for: 2m
labels:
severity: warning
annotations:
summary: "Unusual spike in critical alerts"

- alert: AgentUnresponsive
expr: up{job=~"agent-.*"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Agent service {{ $labels.instance }} is down"

- alert: RustCoreHighLatency
expr: histogram_quantile(0.99, aurora_event_duration_seconds_bucket) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "Rust Core p99 latency exceeds 100ms"

- alert: EventIngestionBackpressure
expr: aurora_redis_stream_lag > 10000
for: 5m
labels:
severity: warning
annotations:
summary: "Redis Stream consumer lag exceeds 10K messages"

- alert: OrchestratorHighErrorRate
expr: rate(aurorasoc_dispatch_errors_total[5m]) / rate(aurorasoc_dispatch_total[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "Orchestrator error rate exceeds 5%"

- alert: MQTTBrokerDown
expr: up{job="mosquitto"} == 0
for: 2m
labels:
severity: critical

- alert: NATSJetStreamStorageHigh
expr: nats_jetstream_storage_bytes / nats_jetstream_storage_limit_bytes > 0.85
for: 10m
labels:
severity: warning

- alert: DatabaseConnectionPoolExhausted
expr: aurorasoc_db_pool_available == 0
for: 1m
labels:
severity: critical

CPS-Specific Alerts

  - name: aurorasoc_cps
rules:
- alert: CPSAttestationFailureRate
expr: >
rate(aurora_attestation_failures_total[5m]) /
rate(aurora_attestation_total[5m]) > 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "CPS attestation failure rate exceeds 10%"

- alert: CPSDeviceOffline
expr: time() - aurora_device_last_seen_timestamp > 300
for: 5m
labels:
severity: warning
annotations:
summary: "CPS device {{ $labels.device_id }} offline for 5+ minutes"

- alert: CPSTamperDetected
expr: aurora_tamper_events_total > 0
for: 0m
labels:
severity: critical
annotations:
summary: "Physical tamper detected on {{ $labels.device_id }}"

- alert: CPSFirmwareRollback
expr: aurora_device_boot_count < aurora_device_boot_count offset 1h
for: 0m
labels:
severity: critical
annotations:
summary: "Firmware rollback detected on {{ $labels.device_id }}"

Grafana Dashboards

AuroraSOC ships with pre-configured Grafana dashboards:

DashboardPanelsPurpose
SOC OverviewAlert rates, case status, agent healthExecutive summary
Agent PerformanceTask duration, error rates, throughputAI agent monitoring
CPS/IoTDevice status, attestation rates, tamper eventsPhysical security
InfrastructureRedis, NATS, PostgreSQL, MQTT healthSystem health

Mosquitto MQTT Configuration

Production (infrastructure/mosquitto/mosquitto.conf)

# TLS listener (production)
listener 8883
cafile /mosquitto/certs/ca.crt
certfile /mosquitto/certs/server.crt
keyfile /mosquitto/certs/server.key
require_certificate true
use_identity_as_username true # x.509 CN becomes MQTT username
tls_version tlsv1.3

# Dev listener (no auth)
listener 1883
allow_anonymous true

# Persistence
persistence true
persistence_location /mosquitto/data/

# Limits
max_packet_size 262144 # 256KB

ACL (infrastructure/mosquitto/acl.conf)

# Device ACL — each device can only write to its own topics
pattern write aurora/sensors/%u/telemetry
pattern write aurora/sensors/%u/alerts
pattern write aurora/sensors/%u/status
pattern read aurora/command/%u/action

# Service accounts
user rust-core-svc
topic read aurora/sensors/#
topic read aurora/attestation/#

user cps-agent-svc
topic read aurora/sensors/#
topic write aurora/command/#

NATS JetStream Configuration

# infrastructure/nats/nats-server.conf
server_name: aurora-nats-1

jetstream {
store_dir: /data/nats
max_mem: 512MB
max_file: 10GB
}

accounts {
AURORA {
jetstream: enabled
users: [{ user: aurora, password: $NATS_PASSWORD }]
}
SYS {
users: [{ user: sys, password: $NATS_SYS_PASSWORD }]
}
}

max_connections: 1024
max_payload: 1MB