Transport and offline buffer
What this page is
The wire path between the sensor and the Collector, the disk ring buffer that takes over when the wire is unavailable, and the rules for draining the buffer when connectivity returns.
Why it exists this way
The architecture document requires no telemetry loss across maintenance windows, network partitions, or Collector restarts. A purely in-memory queue is unacceptable on endpoints that can go offline for hours. The disk ring buffer (ADR 007 follow-up item 3) gives the sensor a 500 MB local window without making the endpoint a long-term store.
gRPC over HTTP/2 with mTLS was chosen over a custom binary
protocol for tooling reasons: tonic is well-maintained,
client streaming is exactly the right shape, and the Collector
ecosystem (envoy front, eventual sidecar mesh) understands
HTTP/2 natively.
How it works
Three Rust pieces collaborate:
- crates/edr-linux/proto/collector.proto
declares
Collectorwith two RPCs: client-streamingStreamEventsfor telemetry and unaryHealthCheckfor liveness probes. edr_linux::transportopens an HTTP/2 connection withtonic::transport::ClientTlsConfig, performs a health check on connect, then batches up to 32 events per gRPC message. Reconnect uses exponential backoff (1 second initial, 60 seconds maximum) capped so a long outage still produces one connect attempt per minute.edr_linux::bufferimplements a 500 MB disk ring buffer with NDJSON records, monotonic sequence numbers, and capacity enforcement via oldest-record truncation. Records carryDeserializeso the buffer round-tripsProcessActivityand child types.
When the transport reconnects, the buffer is drained in order ahead of any newly-produced events. Sequence numbers let the Collector deduplicate if a batch was partially acknowledged before the connection dropped.
What goes wrong
TransportError::Connect, Collector unreachable. The daemon falls into the offline ring buffer; telemetry queues on disk until reconnect succeeds.TransportError::Auth, mTLS handshake failed; usually means the client certificate expired. The enrollment flow has to renew before the sensor can reconnect; the buffer fills until the operator intervenes.- Ring buffer full, oldest records are truncated. A counter
buffer_truncations_totalincrements so the team sees the loss in metrics; this is the worst-case loss scenario the design tolerates and it is the signal that fleet-wide Collector capacity needs to grow.