Skip to main content

Fleet telemetry and containment pipeline

What this page is

This page describes how a live EDR sensor (the Windows agent and the Rust collector in front of it) reports into the backend and how response actions flow back out to it. It covers the collector bridge, the agent-facing fleet endpoints, the event and heartbeat path, and the containment command channel. For the operator-facing investigation API, see the EDR API reference.

Why it exists this way

The Windows agent streams OCSF events to the collector over gRPC. The collector already forwards network flow records to the backend; the fleet pipeline extends that same bridge so one component is responsible for all sensor-to-backend traffic. The backend keeps the durable state (fleet health, risk high-water mark, queued commands) because the agent is allowed to restart, lose its buffer, or go offline at any time, and the console must still show an accurate, non-fabricated view of the endpoint.

Containment is queued rather than pushed. A fleet endpoint is addressed as edr:<sensor_id>; the backend writes the command to a Redis-backed queue and the collector delivers it on the agent's next health check, which applies the Windows Firewall isolation rule. This keeps the backend from holding a live connection to every agent and tolerates an agent that is briefly unreachable.

How it works

The collector bridge (crates/collector/src/bridge.rs) drives four calls against /api/v1/edr/{sensor_id}:

Bridge callEndpointPurpose
register_heartbeatPOST .../heartbeatmark the sensor online; store identity, capabilities, isolation state, metrics
forward_eventsPOST .../eventspush a compact recent-activity summary for the detail panel
fetch_pending_commandsGET .../pending-commandspull queued response actions
fetch_risk_scoreGET .../riskseed the agent's peak-risk high-water mark after a restart

On the backend, services/edr_ingest.py accepts the telemetry, publishes EDR events on the aurora.edr.events.* NATS subject, and (when configured) writes them to ClickHouse for the SIEM hot tier. services/edr_investigation.py serves the investigation views and can talk to a sensor directly over gRPC when EDR_SENSOR_GRPC_ADDRESS is set. Isolation requests from the console resolve the endpoint from live inventory, then call _dispatch_fleet_containment, which queues windows_isolate for edr: sensors and otherwise falls back to the configured external action backend.

Two optional integrations are wired through compose env vars on the api service: AURORA_CLICKHOUSE_URL (SIEM hot tier) and EDR_SENSOR_GRPC_ADDRESS (direct sensor gRPC). Both default to empty, which cleanly disables the feature rather than failing.

What goes wrong

  • Endpoint shows offline despite a running agent. The collector heartbeat is throttled; check that the collector can reach the API and that the agent is streaming OCSF events, which is what drives the heartbeat refresh.
  • Isolation returns 503 in real mode. The endpoint must exist in live inventory and a fleet sensor or action backend must be reachable. The backend refuses to simulate enforcement rather than fake a contained host.
  • No SIEM rows for EDR events. AURORA_CLICKHOUSE_URL is unset, so the ClickHouse writer is a no-op. Set it on the api service to enable the hot tier.
  • Risk score resets to zero after an agent restart. The agent seeds from GET .../risk; if that call fails the local high-water mark starts cold until the next risky process is observed.