إنتقل إلى المحتوى الرئيسي

Transport and offline buffer

What this page is

The wire path between the Windows sensor and the Collector, the disk ring buffer that takes over when the wire is unavailable, the enrollment protocol that provisions mTLS material, and the reconnect backoff strategy.

Why it exists this way

The architecture document requires no telemetry loss across maintenance windows, network partitions, or Collector restarts. A purely in-memory queue is unacceptable on Windows endpoints that can go offline for hours during patching cycles, VPN drops, or laptop sleep. The disk ring buffer gives the sensor a 500 MB to 5 GB local window without making the endpoint a long-term store.

gRPC over HTTP/2 with mTLS was chosen over a custom binary protocol for the same reasons as the Linux sensor: tonic is well-maintained, client streaming matches the event-forwarding model, and the Collector ecosystem (Envoy front, eventual sidecar mesh) understands HTTP/2 natively. Reusing the same proto definitions across Linux and Windows keeps the Collector-side code path uniform.

How it works

Four Rust pieces collaborate:

  • crates/edr-windows/proto/collector.proto declares Collector with two RPCs: client-streaming StreamEvents for telemetry and unary HealthCheck for liveness probes. This proto file is shared verbatim with the Linux sensor via a workspace-level Proto directory.
  • edr_windows::enrollment provisions the mTLS identity the same way as the Linux sensor: a one-time enrollment token is exchanged for a long-lived client certificate. The certificate is stored in the Windows Certificate Store (LocalMachine\My) with private key export disallowed. The enrollment endpoint is the same Collector gRPC endpoint at https://collector.<region>.example:9443.
  • edr_windows::transport opens an HTTP/2 connection with tonic::transport::ClientTlsConfig, performs a health check on connect, then batches up to 32 events per gRPC message. Reconnect uses exponential backoff (1 second initial, 60 seconds maximum) capped so a long outage still produces one connect attempt per minute. Certificate expiry is monitored and triggers automatic re-enrollment 7 days before expiry.
  • edr_windows::buffer implements a disk ring buffer with NDJSON records, monotonic sequence numbers, and capacity enforcement via oldest-record truncation. The default size is 1 GB on Windows (larger than Linux's 500 MB because Windows ETW produces higher event volume from Kernel-File and Network providers). The buffer path is %ProgramData%\AuroraEDR\buffer\.

Offline behaviour

When the transport cannot reach the Collector, the sensor enters Offline state:

  1. All new ETW events are written to the disk buffer.
  2. A background task attempts reconnection on exponential backoff.
  3. On reconnection, buffered events are drained in FIFO order ahead of any newly-produced events.
  4. Sequence numbers let the Collector deduplicate if a batch was partially acknowledged before the connection dropped.

Certificate rotation

Client certificates have a 30-day validity period. The sensor's watchdog task checks certificate expiry every 6 hours. When less than 7 days remain, it triggers the enrollment flow with the current certificate as the authentication credential (rotating enrollment, not re-provisioning). The old certificate is revoked on the Collector side once the new one is acknowledged.

What goes wrong

  • TransportError::Connect, Collector unreachable. The daemon falls into the offline ring buffer; telemetry queues on disk until reconnect succeeds.
  • TransportError::Auth, mTLS handshake failed. Usually means the client certificate expired before the sensor renewed it. The enrollment flow must run before the sensor can reconnect; the buffer fills until the operator intervenes or the automatic renewal succeeds.
  • Ring buffer full, oldest records are truncated. A counter buffer_truncations_total increments in the metrics telemetry so the team sees the loss. On Windows, the buffer can grow to 5 GB before truncation begins because disk space is typically abundant on endpoints with 256 GB+ SSDs.
  • EnrollmentError::TokenExpired, the enrollment token passed its 72-hour TTL. The sensor retries enrollment with exponentially increasing backoff until the operator provisions a new token.