إنتقل إلى المحتوى الرئيسي

Transport and offline buffer

What this page is

The wire path between the sensor and the Collector, the disk ring buffer that takes over when the wire is unavailable, and the rules for draining the buffer when connectivity returns.

Why it exists this way

The architecture document requires no telemetry loss across maintenance windows, network partitions, or Collector restarts. A purely in-memory queue is unacceptable on endpoints that can go offline for hours. The disk ring buffer (ADR 007 follow-up item 3) gives the sensor a 500 MB local window without making the endpoint a long-term store.

gRPC over HTTP/2 with mTLS was chosen over a custom binary protocol for tooling reasons: tonic is well-maintained, client streaming is exactly the right shape, and the Collector ecosystem (envoy front, eventual sidecar mesh) understands HTTP/2 natively.

How it works

Three Rust pieces collaborate:

  • crates/edr-linux/proto/collector.proto declares Collector with two RPCs: client-streaming StreamEvents for telemetry and unary HealthCheck for liveness probes.
  • edr_linux::transport opens an HTTP/2 connection with tonic::transport::ClientTlsConfig, performs a health check on connect, then batches up to 32 events per gRPC message. Reconnect uses exponential backoff (1 second initial, 60 seconds maximum) capped so a long outage still produces one connect attempt per minute.
  • edr_linux::buffer implements a 500 MB disk ring buffer with NDJSON records, monotonic sequence numbers, and capacity enforcement via oldest-record truncation. Records carry Deserialize so the buffer round-trips ProcessActivity and child types.

When the transport reconnects, the buffer is drained in order ahead of any newly-produced events. Sequence numbers let the Collector deduplicate if a batch was partially acknowledged before the connection dropped.

What goes wrong

  • TransportError::Connect, Collector unreachable. The daemon falls into the offline ring buffer; telemetry queues on disk until reconnect succeeds.
  • TransportError::Auth, mTLS handshake failed; usually means the client certificate expired. The enrollment flow has to renew before the sensor can reconnect; the buffer fills until the operator intervenes.
  • Ring buffer full, oldest records are truncated. A counter buffer_truncations_total increments so the team sees the loss in metrics; this is the worst-case loss scenario the design tolerates and it is the signal that fleet-wide Collector capacity needs to grow.