Monitoring & Observability
AuroraSOC uses a comprehensive observability stack: OpenTelemetry for tracing, Prometheus for metrics, Grafana for visualization, and structured logging via structlog.
Observability Stack
OpenTelemetry Configuration
Collector (infrastructure/otel/otel-collector.yml)
receivers:
otlp:
protocols:
grpc:
endpoint: "0.0.0.0:4317"
http:
endpoint: "0.0.0.0:4318"
processors:
memory_limiter:
check_interval: 1s
limit_mib: 512
batch:
timeout: 5s
send_batch_size: 1024
exporters:
prometheus:
endpoint: "0.0.0.0:8889"
namespace: "aurorasoc"
otlp/jaeger:
endpoint: "jaeger:4317"
tls:
insecure: true
logging:
loglevel: info
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp/jaeger, logging]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [prometheus]
Python Tracing Setup
# aurorasoc/core/tracing.py
def setup_tracing():
resource = Resource.create({"service.name": "aurorasoc-api", "service.version": "2.0.0"})
exporter = OTLPSpanExporter(endpoint=settings.observability.otlp_endpoint, insecure=True)
provider = TracerProvider(resource=resource)
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)
Structured Logging
# aurorasoc/core/logging.py
def setup_logging():
structlog.configure(
processors=[
structlog.contextvars.merge_contextvars,
structlog.processors.add_log_level,
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.JSONRenderer(),
],
logger_factory=structlog.PrintLoggerFactory(),
)
Output example:
{
"event": "alert_created",
"alert_id": "550e8400-...",
"severity": "critical",
"source": "suricata",
"dedup_hash": "a3f2b8...",
"level": "info",
"timestamp": "2024-01-15T10:30:00Z"
}
Prometheus Scrape Targets
Configuration (infrastructure/prometheus/prometheus.yml)
| Target | Port | Path | Interval |
|---|---|---|---|
aurorasoc-api | 8000 | /metrics | 15s |
orchestrator | 9000 | /metrics | 15s |
| Agent services (10) | 9001-9012 | /metrics | 15s |
rust-core | 50051 | /metrics | 15s |
| Redis | 6379 | /metrics | 15s |
| NATS | 8222 | /varz | 15s |
| PostgreSQL | 5432 | /metrics | 15s |
| Mosquitto | 1883 | /metrics | 15s |
Alert Rules
Security Operations Alerts
groups:
- name: aurorasoc_alerts
rules:
- alert: HighSeverityAlertSpike
expr: rate(aurorasoc_alerts_total{severity="critical"}[5m]) > 10
for: 2m
labels:
severity: warning
annotations:
summary: "Unusual spike in critical alerts"
- alert: AgentUnresponsive
expr: up{job=~"agent-.*"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Agent service {{ $labels.instance }} is down"
- alert: RustCoreHighLatency
expr: histogram_quantile(0.99, aurora_event_duration_seconds_bucket) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "Rust Core p99 latency exceeds 100ms"
- alert: EventIngestionBackpressure
expr: aurora_redis_stream_lag > 10000
for: 5m
labels:
severity: warning
annotations:
summary: "Redis Stream consumer lag exceeds 10K messages"
- alert: OrchestratorHighErrorRate
expr: rate(aurorasoc_dispatch_errors_total[5m]) / rate(aurorasoc_dispatch_total[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "Orchestrator error rate exceeds 5%"
- alert: MQTTBrokerDown
expr: up{job="mosquitto"} == 0
for: 2m
labels:
severity: critical
- alert: NATSJetStreamStorageHigh
expr: nats_jetstream_storage_bytes / nats_jetstream_storage_limit_bytes > 0.85
for: 10m
labels:
severity: warning
- alert: DatabaseConnectionPoolExhausted
expr: aurorasoc_db_pool_available == 0
for: 1m
labels:
severity: critical
CPS-Specific Alerts
- name: aurorasoc_cps
rules:
- alert: CPSAttestationFailureRate
expr: >
rate(aurora_attestation_failures_total[5m]) /
rate(aurora_attestation_total[5m]) > 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "CPS attestation failure rate exceeds 10%"
- alert: CPSDeviceOffline
expr: time() - aurora_device_last_seen_timestamp > 300
for: 5m
labels:
severity: warning
annotations:
summary: "CPS device {{ $labels.device_id }} offline for 5+ minutes"
- alert: CPSTamperDetected
expr: aurora_tamper_events_total > 0
for: 0m
labels:
severity: critical
annotations:
summary: "Physical tamper detected on {{ $labels.device_id }}"
- alert: CPSFirmwareRollback
expr: aurora_device_boot_count < aurora_device_boot_count offset 1h
for: 0m
labels:
severity: critical
annotations:
summary: "Firmware rollback detected on {{ $labels.device_id }}"
Grafana Dashboards
AuroraSOC ships with pre-configured Grafana dashboards:
| Dashboard | Panels | Purpose |
|---|---|---|
| SOC Overview | Alert rates, case status, agent health | Executive summary |
| Agent Performance | Task duration, error rates, throughput | AI agent monitoring |
| CPS/IoT | Device status, attestation rates, tamper events | Physical security |
| Infrastructure | Redis, NATS, PostgreSQL, MQTT health | System health |
Mosquitto MQTT Configuration
Production (infrastructure/mosquitto/mosquitto.conf)
# TLS listener (production)
listener 8883
cafile /mosquitto/certs/ca.crt
certfile /mosquitto/certs/server.crt
keyfile /mosquitto/certs/server.key
require_certificate true
use_identity_as_username true # x.509 CN becomes MQTT username
tls_version tlsv1.3
# Dev listener (no auth)
listener 1883
allow_anonymous true
# Persistence
persistence true
persistence_location /mosquitto/data/
# Limits
max_packet_size 262144 # 256KB
ACL (infrastructure/mosquitto/acl.conf)
# Device ACL — each device can only write to its own topics
pattern write aurora/sensors/%u/telemetry
pattern write aurora/sensors/%u/alerts
pattern write aurora/sensors/%u/status
pattern read aurora/command/%u/action
# Service accounts
user rust-core-svc
topic read aurora/sensors/#
topic read aurora/attestation/#
user cps-agent-svc
topic read aurora/sensors/#
topic write aurora/command/#
NATS JetStream Configuration
# infrastructure/nats/nats-server.conf
server_name: aurora-nats-1
jetstream {
store_dir: /data/nats
max_mem: 512MB
max_file: 10GB
}
accounts {
AURORA {
jetstream: enabled
users: [{ user: aurora, password: $NATS_PASSWORD }]
}
SYS {
users: [{ user: sys, password: $NATS_SYS_PASSWORD }]
}
}
max_connections: 1024
max_payload: 1MB