Skip to main content

Security and autonomy hardening

This page summarizes the expert-level hardening added across ADRs 035 to 041 and where each capability lives in the codebase. Each section links to its ADR for the full rationale.

Report design and AI chat (ADR 035)

Reports render as a clean, white, print-first corporate document for both HTML and PDF, with a structured Document Control block, numbered sections, and consistent tables. The design tokens live in packages/backend/aurorasoc/tools/document/templates/base.html.j2 and the PDF cover palette in tools/document/server.py. The report-generation AI chat supports multi-turn refinement (pass prior messages), validates the model output into substantive sections with one corrective retry, and distinguishes an unreachable model (deterministic fallback) from weak output.

Prompt-injection input guardrail (ADR 036)

orchestrator/guardrails/input_sanitizer.py neutralizes control, zero-width, and bidi characters, normalizes structured alert fields, and wraps attacker-influenceable content in non-forgeable UNTRUSTED-DATA fences. Every agent prompt carries an injection-resistance preamble via the agent factory, so fenced content is treated as data, never instructions. The red-team harness lives in tests/security/prompt_injection/.

AI chat adversarial defense

The interactive operator chat (POST /api/v1/chat/completions and /api/v1/chat/stream in api/main.py) applies the agent-plane guardrails plus an egress control, so a jailbreak or an indirect injection cannot turn the assistant into an attack or data-exfiltration channel. Threat model and the control for each:

  • Direct prompt injection and jailbreak ("ignore previous instructions", "you are now DAN", role or identity override). The shared system prompt prepends INJECTION_RESISTANCE_PREAMBLE and an operator-hardening directive that forbids revealing or modifying the system prompt and refuses role changes. Every user turn is run through neutralize() (control, zero-width, and bidi stripping with forged-fence defanging), and detect_injection() records a structured chat_injection_detected event for observability.
  • Indirect (second-order) injection via pasted logs, alerts, or tool output. The live-data grounding snapshot (recent alerts and cases) is wrapped in non-forgeable UNTRUSTED-DATA fences with fence(), so the model treats it as data to analyze, never as instructions.
  • System-prompt and secret exfiltration. orchestrator/guardrails/output_sanitizer.py (scrub_output, StreamScrubber) redacts secret-shaped strings (API keys, bearer tokens, JWTs, cloud credentials) and suppresses lines that reproduce the security directive, on both the non-streaming response and each streamed chunk (line-buffered so a secret that spans chunk boundaries is still caught).
  • Denial of wallet and prompt flooding. A per-user Redis sliding-window limiter (chat_limiter, 30 requests per minute) returns HTTP 429 with Retry-After.
  • Stale or fabricated time. The system prompt injects the current UTC date and time plus the active model and backend, so the assistant dates answers and reports correctly instead of guessing.

The report-from-chat marker (%%REPORT_REQUEST%%) is validated (well-formed JSON, bounded description) and failures surface as a report_error event rather than being dropped silently. See ADR 036 for the input-guardrail rationale; the chat red-team cases live in tests/security/prompt_injection/.

Pre-LLM triage filter (ADR 037)

detection/triage_filter.py scores each alert deterministically (severity, IOC reputation, asset criticality, false-positive history) before the LLM investigation. Clearly benign low-severity alerts auto-resolve with an audit reason and consume no inference; proceeding alerts carry a recommended automation tier.

Reversibility-aware autonomy and kill-switch (ADR 038)

orchestrator/actions/reversals.py records the reverse of each response action and the irreversible set. orchestrator/actions/post_exec_verification.py confirms an actuate or destructive action took effect, rolling it back on a negative verdict. orchestrator/kill_switch.py provides a global tier ceiling that the resolver applies to every call; operators engage and release it through POST /api/v1/admin/emergency-pause and /resume.

Observability (autonomy metrics, decision explainer)

services/autonomy_metrics.py exposes Prometheus metrics for guardrail denials, pre-LLM filter outcomes, canary promotions and rollbacks, and per-agent tier rank. services/decision_explainer.py renders plain-language reasons and remediation for guardrail decisions. The Grafana dashboard is infra/grafana/dashboards/agent-autonomy.json.

Web-defense hardening (ADR 039)

The inline web defense (ADR 032) gains a configurable fail mode (WEB_DEFENSE_FAIL_MODE=open|closed), a verdict cache keyed by the full inspection surface (method, path, query, inspected body, inspected headers), a per-client sliding-window rate limiter, and client reputation tracking. The runtime controls live in services/web_defense_runtime.py. Client identity for rate limiting and reputation must come from an infrastructure-verified peer header, never solely from client-supplied X-Forwarded-For.

Detection efficacy (ADR 040)

The Sigma corpus expands with curated rules and is measured two ways: an ATT&CK coverage generator (tools/scripts/detection/attack_coverage.py) emits a technique-to-rule matrix (see Detection ATT&CK coverage), and a purple-team harness (tests/detection/test_purple_team.py) drives canonical attack events through the matcher to assert true-positive coverage.

Production Vault auto-unseal (ADR 041)

infra/vault/vault-prod.hcl adds a transit auto-unseal stanza (recommended for self-hosted and air-gapped) with cloud KMS alternatives, removing the unseal-share distribution risk in production while keeping the dev Shamir default.

Em-dash prohibition

Em dashes are prohibited across the repository. The guard tools/scripts/codegen/check_no_em_dashes.py runs in CI and just lint, and tools/scripts/codegen/strip_em_dashes.py removes any that slip in.