MCP boundary protection
What this page is
The result contract and the transport-level guards at the MCP server boundary: the uniform tool error envelope, the per-call execution budget, per-IP rate limiting, OpenTelemetry tracing, and the in-flight drain on shutdown. This is the safety layer between agents and the side-effecting tools they call.
Why it exists this way
Every privileged agent action runs through an MCP tool that extends
AuroraTool
(packages/backend/aurorasoc/tools/base.py).
Two gaps motivated this work. Tool failure shapes had diverged and
sometimes leaked the upstream exception type to the model, and there
was no time budget, so a hung upstream could wedge an agent step
forever. The architecture document fixes the action-class taxonomy
(ADR 011) and the audit envelope (ADR 012) but was silent on what a
tool returns to the agent and on protecting an individual server
process from a request flood.
How it works
Error and timeout contract (ADR 047)
Tools share one error envelope built by
aurorasoc.tools.errors.tool_error_envelope(code, message, **extra),
which returns {"status": "error", "error_code": <ToolErrorCode>, "error": <message>}. status and error stay for backward
compatibility; error_code is the stable, machine-matchable category.
ToolErrorCode is a StrEnum with invalid_input,
upstream_unavailable, timeout, permission_denied, not_found,
and internal. Agent-facing messages are generic; the concrete cause
is logged server-side and never returned.
AuroraTool._run is an error boundary. An unexpected exception is
logged once with its real type, then a sanitized AuroraToolError is
raised outside the except block so neither __cause__ nor
__context__ carries the original detail into the framework's error
wrapper. AuroraTool.timeout_seconds sets an optional
asyncio.wait_for budget; the timeout message keeps the literal token
timeout so the tool-invocation auditor classifies it correctly. See
ADR 047.
Default budget and rate limiting (ADR 049)
The per-call budget became a boundary guarantee rather than an opt-in.
AuroraTool resolves it as: the explicit timeout_seconds if set;
else None (no budget) when the tool is marked is_hitl; else the
configured MCP_TOOL_TIMEOUT_SECONDS default. The five
human-in-the-loop tools (block_ip, isolate_endpoint,
edr_windows_isolate, edr_windows_kill_process,
request_human_approval) are is_hitl, so they keep waiting on an
analyst without a budget.
A pure-ASGI MCPRateLimitMiddleware throttles requests per client IP
using the shared RedisRateLimiter. It is opt-in via
MCP_RATE_LIMIT_ENABLED (default off), exempts health-check paths, and
fails open when Redis is unavailable so a Redis blip degrades
protection rather than blacking out the server. The launcher wraps the
ASGI app with it after the bearer-token middleware. See
ADR 049.
Tracing and graceful shutdown
MCP tool invocations are traced with OpenTelemetry spans, and the server drains in-flight calls on shutdown rather than dropping them. The launcher builds the FastMCP ASGI app so the boundary guards (token middleware, mTLS context, rate limiting) apply to every request.
Settings
MCPSettings knobs: tool_timeout_seconds (default 60),
rate_limit_enabled (default false), rate_limit_requests
(default 600), rate_limit_window_seconds (default 60). The dev stack
is unaffected because rate limiting defaults off.
What goes wrong
- A tool returns a bare
{"error": ...}instead of the envelope: it has not been migrated totool_error_envelope. Agents cannot branch onerror_code. - A non-HITL tool with no explicit budget now times out at the default:
raise
timeout_secondsfor a self-bounded long-running tool, or mark itis_hitlonly if it genuinely blocks on an analyst. - Rate limiting silently does nothing: it is off unless
MCP_RATE_LIMIT_ENABLEDis set, and it fails open when Redis is down.