Skip to main content

MCP boundary protection

What this page is

The result contract and the transport-level guards at the MCP server boundary: the uniform tool error envelope, the per-call execution budget, per-IP rate limiting, OpenTelemetry tracing, and the in-flight drain on shutdown. This is the safety layer between agents and the side-effecting tools they call.

Why it exists this way

Every privileged agent action runs through an MCP tool that extends AuroraTool (packages/backend/aurorasoc/tools/base.py). Two gaps motivated this work. Tool failure shapes had diverged and sometimes leaked the upstream exception type to the model, and there was no time budget, so a hung upstream could wedge an agent step forever. The architecture document fixes the action-class taxonomy (ADR 011) and the audit envelope (ADR 012) but was silent on what a tool returns to the agent and on protecting an individual server process from a request flood.

How it works

Error and timeout contract (ADR 047)

Tools share one error envelope built by aurorasoc.tools.errors.tool_error_envelope(code, message, **extra), which returns {"status": "error", "error_code": <ToolErrorCode>, "error": <message>}. status and error stay for backward compatibility; error_code is the stable, machine-matchable category. ToolErrorCode is a StrEnum with invalid_input, upstream_unavailable, timeout, permission_denied, not_found, and internal. Agent-facing messages are generic; the concrete cause is logged server-side and never returned.

AuroraTool._run is an error boundary. An unexpected exception is logged once with its real type, then a sanitized AuroraToolError is raised outside the except block so neither __cause__ nor __context__ carries the original detail into the framework's error wrapper. AuroraTool.timeout_seconds sets an optional asyncio.wait_for budget; the timeout message keeps the literal token timeout so the tool-invocation auditor classifies it correctly. See ADR 047.

Default budget and rate limiting (ADR 049)

The per-call budget became a boundary guarantee rather than an opt-in. AuroraTool resolves it as: the explicit timeout_seconds if set; else None (no budget) when the tool is marked is_hitl; else the configured MCP_TOOL_TIMEOUT_SECONDS default. The five human-in-the-loop tools (block_ip, isolate_endpoint, edr_windows_isolate, edr_windows_kill_process, request_human_approval) are is_hitl, so they keep waiting on an analyst without a budget.

A pure-ASGI MCPRateLimitMiddleware throttles requests per client IP using the shared RedisRateLimiter. It is opt-in via MCP_RATE_LIMIT_ENABLED (default off), exempts health-check paths, and fails open when Redis is unavailable so a Redis blip degrades protection rather than blacking out the server. The launcher wraps the ASGI app with it after the bearer-token middleware. See ADR 049.

Tracing and graceful shutdown

MCP tool invocations are traced with OpenTelemetry spans, and the server drains in-flight calls on shutdown rather than dropping them. The launcher builds the FastMCP ASGI app so the boundary guards (token middleware, mTLS context, rate limiting) apply to every request.

Settings

MCPSettings knobs: tool_timeout_seconds (default 60), rate_limit_enabled (default false), rate_limit_requests (default 600), rate_limit_window_seconds (default 60). The dev stack is unaffected because rate limiting defaults off.

What goes wrong

  • A tool returns a bare {"error": ...} instead of the envelope: it has not been migrated to tool_error_envelope. Agents cannot branch on error_code.
  • A non-HITL tool with no explicit budget now times out at the default: raise timeout_seconds for a self-bounded long-running tool, or mark it is_hitl only if it genuinely blocks on an analyst.
  • Rate limiting silently does nothing: it is off unless MCP_RATE_LIMIT_ENABLED is set, and it fails open when Redis is down.