Observability overview
What this page is
The observability surfaces an engineer queries when something in AuroraSOC is wrong: structured logs, decision logs on Redis Streams, Prometheus metrics, and the audit trail. This page points at the conventions and the locations.
Why it exists this way
The architecture document treats observability as a first-class plane, not an afterthought. Engineers should be able to answer "what did the system decide and why" from logs and metrics alone, without ad-hoc print debugging.
How it works
Structured logs live behind
packages/backend/aurorasoc/core/logging.py.
get_logger(__name__) returns a structlog-style logger that
emits JSON in production. Every log line carries context
propagated through the call chain: agent_id, case_id,
user_id, investigation_id. Adding a new context key is
done at the boundary that knows it; downstream callers inherit
it automatically.
Decision logs are an append-only Redis Stream owned by
packages/backend/aurorasoc/events/redis_streams.py.
Every agent decision, every operator approval, and every
runtime-mode change publishes an event on a domain stream
(decisions:agents, decisions:approvals,
decisions:system). The streams are the source of truth for
auditors; the logs are the source of truth for engineers.
Metrics are Prometheus counters and gauges. Per-component
modules expose their counters as module-level globals;
aurorasoc.core.metrics is the registry. Counters worth
knowing:
alerts_deduplicated_total, alert dedup hits.network_attack_detections_total, Suricata attack classifications.approval_latency_seconds, wall-clock seconds operators took to act on a pending approval.inference_pool_size, current vLLM/Ollama replica count per agent.buffer_truncations_total, Linux EDR sensor dropping oldest disk-buffer records (see Transport and buffer).
Audit trail is the investigation_events table from
Investigation persistence
plus the equivalent case_events and approval_events
tables. Engineers grep there for the "what happened" question
when the logs do not have the full picture.
What goes wrong
- A log line is missing the
case_idyou need to filter on , the context propagation broke at some boundary. The fix is upstream from the missing line, in the function that opened the case context but failed to bind it on the logger. - A counter exists in code but is not in Prometheus, the
module is not registered in
aurorasoc.core.metricsor the scrape target is wrong. Both errors are visible in the/metricsendpoint's plaintext output. - Decision stream filling faster than consumers drain, Redis
starts evicting under memory pressure. The maxlen on each
stream is set in
events/redis_streams.py; raise it intentionally and resize the Redis tier accordingly.