إنتقل إلى المحتوى الرئيسي

Observability overview

What this page is

The observability surfaces an engineer queries when something in AuroraSOC is wrong: structured logs, decision logs on Redis Streams, Prometheus metrics, and the audit trail. This page points at the conventions and the locations.

Why it exists this way

The architecture document treats observability as a first-class plane, not an afterthought. Engineers should be able to answer "what did the system decide and why" from logs and metrics alone, without ad-hoc print debugging.

How it works

Structured logs live behind packages/backend/aurorasoc/core/logging.py. get_logger(__name__) returns a structlog-style logger that emits JSON in production. Every log line carries context propagated through the call chain: agent_id, case_id, user_id, investigation_id. Adding a new context key is done at the boundary that knows it; downstream callers inherit it automatically.

Decision logs are an append-only Redis Stream owned by packages/backend/aurorasoc/events/redis_streams.py. Every agent decision, every operator approval, and every runtime-mode change publishes an event on a domain stream (decisions:agents, decisions:approvals, decisions:system). The streams are the source of truth for auditors; the logs are the source of truth for engineers.

Metrics are Prometheus counters and gauges. Per-component modules expose their counters as module-level globals; aurorasoc.core.metrics is the registry. Counters worth knowing:

  • alerts_deduplicated_total, alert dedup hits.
  • network_attack_detections_total, Suricata attack classifications.
  • approval_latency_seconds, wall-clock seconds operators took to act on a pending approval.
  • inference_pool_size, current vLLM/Ollama replica count per agent.
  • buffer_truncations_total, Linux EDR sensor dropping oldest disk-buffer records (see Transport and buffer).

Audit trail is the investigation_events table from Investigation persistence plus the equivalent case_events and approval_events tables. Engineers grep there for the "what happened" question when the logs do not have the full picture.

What goes wrong

  • A log line is missing the case_id you need to filter on , the context propagation broke at some boundary. The fix is upstream from the missing line, in the function that opened the case context but failed to bind it on the logger.
  • A counter exists in code but is not in Prometheus, the module is not registered in aurorasoc.core.metrics or the scrape target is wrong. Both errors are visible in the /metrics endpoint's plaintext output.
  • Decision stream filling faster than consumers drain, Redis starts evicting under memory pressure. The maxlen on each stream is set in events/redis_streams.py; raise it intentionally and resize the Redis tier accordingly.