Monitoring & Observability

AuroraSOC uses a comprehensive observability stack: OpenTelemetry for tracing, Prometheus for metrics, Grafana for visualization, and structured logging via structlog.

Observability Stack

OpenTelemetry Configuration

Collector (`infrastructure/otel/otel-collector.yml`)

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: "0.0.0.0:4317"
      http:
        endpoint: "0.0.0.0:4318"

processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 512
  batch:
    timeout: 5s
    send_batch_size: 1024

exporters:
  prometheus:
    endpoint: "0.0.0.0:8889"
    namespace: "aurorasoc"
  otlp/jaeger:
    endpoint: "jaeger:4317"
    tls:
      insecure: true
  logging:
    loglevel: info

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp/jaeger, logging]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheus]

Python Tracing Setup

# aurorasoc/core/tracing.py
def setup_tracing():
    resource = Resource.create({"service.name": "aurorasoc-api", "service.version": "2.0.0"})
    exporter = OTLPSpanExporter(endpoint=settings.observability.otlp_endpoint, insecure=True)
    provider = TracerProvider(resource=resource)
    provider.add_span_processor(BatchSpanProcessor(exporter))
    trace.set_tracer_provider(provider)

Structured Logging

# aurorasoc/core/logging.py
def setup_logging():
    structlog.configure(
        processors=[
            structlog.contextvars.merge_contextvars,
            structlog.processors.add_log_level,
            structlog.processors.TimeStamper(fmt="iso"),
            structlog.processors.JSONRenderer(),
        ],
        logger_factory=structlog.PrintLoggerFactory(),
    )

Output example:

{
  "event": "alert_created",
  "alert_id": "550e8400-...",
  "severity": "critical",
  "source": "suricata",
  "dedup_hash": "a3f2b8...",
  "level": "info",
  "timestamp": "2024-01-15T10:30:00Z"
}

Prometheus Scrape Targets

Configuration (`infrastructure/prometheus/prometheus.yml`)

Target	Port	Path	Interval
`aurorasoc-api`	8000	`/metrics`	15s
`agent-task-worker`	9201	`/metrics`	15s
`orchestrator`	9000	`/metrics`	15s
Agent services (10)	9001-9012	`/metrics`	15s
`rust-core`	50051	`/metrics`	15s
Redis	6379	`/metrics`	15s
NATS	8222	`/varz`	15s
PostgreSQL	5432	`/metrics`	15s
Mosquitto	1883	`/metrics`	15s

New API Correlation Metrics (Gap 9)

The FastAPI service now exports additional metrics for agent-result correlation and startup A2A connectivity health:

Metric	Type	Description
`aurora_agent_result_correlation_latency_ms`	Histogram	Worker result publish-to-future resolution latency
`aurora_agent_results_unmatched_total`	Counter	Result events with no correlation ID or no waiting future
`aurora_agent_result_futures_pending`	Gauge	Current investigation futures waiting for worker results
`aurora_a2a_startup_connectivity_total`	Counter	A2A startup probe outcomes (`reachable`, `server_error`, `unreachable`)
`aurora_a2a_startup_connectivity_duration_ms`	Histogram	A2A startup probe duration by agent

Agent Task Worker Metrics (Gap 9)

The Redis Streams agent-task-worker process now exports its own Prometheus metrics endpoint (default :9201/metrics) and emits worker-specific execution signals:

Metric	Type	Description
`aurora_agent_task_worker_tasks_total`	Counter	Worker task outcomes by `agent_type` and `outcome`
`aurora_agent_task_worker_retries_total`	Counter	Requeue operations performed by the worker
`aurora_agent_task_worker_dead_letters_total`	Counter	Tasks moved to dead-letter stream
`aurora_agent_task_worker_failures_total`	Counter	Execution failures by exception class
`aurora_agent_task_worker_task_duration_ms`	Histogram	End-to-end task handling latency by outcome

Alert Rules

Security Operations Alerts

groups:
  - name: aurorasoc_alerts
    rules:
      - alert: HighSeverityAlertSpike
        expr: rate(aurorasoc_alerts_total{severity="critical"}[5m]) > 10
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Unusual spike in critical alerts"

      - alert: AgentUnresponsive
        expr: up{job="aurorasoc-agents"} == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Agent service {{ $labels.instance }} is down"

      - alert: AgentTaskWorkerDown
        expr: up{job="aurorasoc-agent-task-worker"} == 0
        for: 2m
        labels:
          severity: warning

      - alert: RustCoreHighLatency
        expr: histogram_quantile(0.99, aurora_event_duration_seconds_bucket) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Rust Core p99 latency exceeds 100ms"

      - alert: EventIngestionBackpressure
        expr: aurora_redis_stream_lag > 10000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Redis Stream consumer lag exceeds 10K messages"

      - alert: OrchestratorHighErrorRate
        expr: rate(aurorasoc_dispatch_errors_total[5m]) / rate(aurorasoc_dispatch_total[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Orchestrator error rate exceeds 5%"

      - alert: MQTTBrokerDown
        expr: up{job="mosquitto"} == 0
        for: 2m
        labels:
          severity: critical

      - alert: NATSJetStreamStorageHigh
        expr: nats_jetstream_storage_bytes / nats_jetstream_storage_limit_bytes > 0.85
        for: 10m
        labels:
          severity: warning

      - alert: DatabaseConnectionPoolExhausted
        expr: aurorasoc_db_pool_available == 0
        for: 1m
        labels:
          severity: critical

      - alert: AgentResultCorrelationLatencyHigh
        expr: histogram_quantile(0.95, rate(aurora_agent_result_correlation_latency_ms_bucket[10m])) > 5000
        for: 10m
        labels:
          severity: warning

      - alert: AgentResultsUnmatchedGrowing
        expr: increase(aurora_agent_results_unmatched_total[15m]) > 20
        for: 10m
        labels:
          severity: warning

      - alert: InvestigationFuturesBacklog
        expr: aurora_agent_result_futures_pending > 200
        for: 10m
        labels:
          severity: warning

      - alert: A2AStartupConnectivityFailures
        expr: sum(increase(aurora_a2a_startup_connectivity_total{status=~"unreachable|server_error"}[1h])) > 0
        for: 5m
        labels:
          severity: warning

      - alert: AgentTaskWorkerDeadLetterGrowing
        expr: increase(aurora_agent_task_worker_dead_letters_total[15m]) > 5
        for: 10m
        labels:
          severity: warning

      - alert: AgentTaskWorkerRetrySpike
        expr: increase(aurora_agent_task_worker_retries_total[15m]) > 50
        for: 10m
        labels:
          severity: warning

CPS-Specific Alerts

  - name: aurorasoc_cps
    rules:
      - alert: CPSAttestationFailureRate
        expr: >
          rate(aurora_attestation_failures_total[5m]) /
          rate(aurora_attestation_total[5m]) > 0.1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "CPS attestation failure rate exceeds 10%"

      - alert: CPSDeviceOffline
        expr: time() - aurora_device_last_seen_timestamp > 300
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "CPS device {{ $labels.device_id }} offline for 5+ minutes"

      - alert: CPSTamperDetected
        expr: aurora_tamper_events_total > 0
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: "Physical tamper detected on {{ $labels.device_id }}"

      - alert: CPSFirmwareRollback
        expr: aurora_device_boot_count < aurora_device_boot_count offset 1h
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: "Firmware rollback detected on {{ $labels.device_id }}"

Grafana Dashboards

AuroraSOC ships with pre-configured Grafana dashboards:

Dashboard	Panels	Purpose
SOC Overview	Alert rates, case status, agent health	Executive summary
Agent Performance	Task duration, error rates, throughput	AI agent monitoring
CPS/IoT	Device status, attestation rates, tamper events	Physical security
Infrastructure	Redis, NATS, PostgreSQL, MQTT health	System health

Mosquitto MQTT Configuration

Production (`infrastructure/mosquitto/mosquitto.conf`)

# TLS listener (production)
listener 8883
cafile /mosquitto/certs/ca.crt
certfile /mosquitto/certs/server.crt
keyfile /mosquitto/certs/server.key
require_certificate true
use_identity_as_username true  # x.509 CN becomes MQTT username
tls_version tlsv1.3

# Dev listener (no auth)
listener 1883
allow_anonymous true

# Persistence
persistence true
persistence_location /mosquitto/data/

# Limits
max_packet_size 262144  # 256KB

ACL (`infrastructure/mosquitto/acl.conf`)

# Device ACL — each device can only write to its own topics
pattern write aurora/sensors/%u/telemetry
pattern write aurora/sensors/%u/alerts
pattern write aurora/sensors/%u/status
pattern read aurora/command/%u/action

# Service accounts
user rust-core-svc
topic read aurora/sensors/#
topic read aurora/attestation/#

user cps-agent-svc
topic read aurora/sensors/#
topic write aurora/command/#

NATS JetStream Configuration

# infrastructure/nats/nats-server.conf
server_name: aurora-nats-1

jetstream {
    store_dir: /data/nats
    max_mem: 512MB
    max_file: 10GB
}

accounts {
    AURORA {
        jetstream: enabled
        users: [{ user: aurora, password: $NATS_PASSWORD }]
    }
    SYS {
        users: [{ user: sys, password: $NATS_SYS_PASSWORD }]
    }
}

max_connections: 1024
max_payload: 1MB

Observability Stack​

OpenTelemetry Configuration​

Collector (infrastructure/otel/otel-collector.yml)​

Python Tracing Setup​

Structured Logging​

Prometheus Scrape Targets​

Configuration (infrastructure/prometheus/prometheus.yml)​

New API Correlation Metrics (Gap 9)​

Agent Task Worker Metrics (Gap 9)​

Alert Rules​

Security Operations Alerts​

CPS-Specific Alerts​

Grafana Dashboards​

Mosquitto MQTT Configuration​

Production (infrastructure/mosquitto/mosquitto.conf)​

ACL (infrastructure/mosquitto/acl.conf)​

NATS JetStream Configuration​