Security and autonomy hardening
This page summarizes the expert-level hardening added across ADRs 035 to 041 and where each capability lives in the codebase. Each section links to its ADR for the full rationale.
Report design and AI chat (ADR 035)
Reports render as a clean, white, print-first corporate document for both HTML
and PDF, with a structured Document Control block, numbered sections, and
consistent tables. The design tokens live in
packages/backend/aurorasoc/tools/document/templates/base.html.j2 and the PDF
cover palette in tools/document/server.py. The report-generation AI chat
supports multi-turn refinement (pass prior messages), validates the model
output into substantive sections with one corrective retry, and distinguishes an
unreachable model (deterministic fallback) from weak output.
Prompt-injection input guardrail (ADR 036)
orchestrator/guardrails/input_sanitizer.py neutralizes control, zero-width, and
bidi characters, normalizes structured alert fields, and wraps
attacker-influenceable content in non-forgeable UNTRUSTED-DATA fences. Every
agent prompt carries an injection-resistance preamble via the agent factory, so
fenced content is treated as data, never instructions. The red-team harness lives
in tests/security/prompt_injection/.
AI chat adversarial defense
The interactive operator chat (POST /api/v1/chat/completions and
/api/v1/chat/stream in api/main.py) applies the agent-plane guardrails plus an
egress control, so a jailbreak or an indirect injection cannot turn the assistant
into an attack or data-exfiltration channel. Threat model and the control for each:
- Direct prompt injection and jailbreak ("ignore previous instructions", "you
are now DAN", role or identity override). The shared system prompt prepends
INJECTION_RESISTANCE_PREAMBLEand an operator-hardening directive that forbids revealing or modifying the system prompt and refuses role changes. Every user turn is run throughneutralize()(control, zero-width, and bidi stripping with forged-fence defanging), anddetect_injection()records a structuredchat_injection_detectedevent for observability. - Indirect (second-order) injection via pasted logs, alerts, or tool output.
The live-data grounding snapshot (recent alerts and cases) is wrapped in
non-forgeable
UNTRUSTED-DATAfences withfence(), so the model treats it as data to analyze, never as instructions. - System-prompt and secret exfiltration.
orchestrator/guardrails/output_sanitizer.py(scrub_output,StreamScrubber) redacts secret-shaped strings (API keys, bearer tokens, JWTs, cloud credentials) and suppresses lines that reproduce the security directive, on both the non-streaming response and each streamed chunk (line-buffered so a secret that spans chunk boundaries is still caught). - Denial of wallet and prompt flooding. A per-user Redis sliding-window limiter
(
chat_limiter, 30 requests per minute) returns HTTP 429 withRetry-After. - Stale or fabricated time. The system prompt injects the current UTC date and time plus the active model and backend, so the assistant dates answers and reports correctly instead of guessing.
The report-from-chat marker (%%REPORT_REQUEST%%) is validated (well-formed JSON,
bounded description) and failures surface as a report_error event rather than
being dropped silently. See ADR 036 for the input-guardrail rationale; the chat
red-team cases live in tests/security/prompt_injection/.
Pre-LLM triage filter (ADR 037)
detection/triage_filter.py scores each alert deterministically (severity, IOC
reputation, asset criticality, false-positive history) before the LLM
investigation. Clearly benign low-severity alerts auto-resolve with an audit
reason and consume no inference; proceeding alerts carry a recommended automation
tier.
Reversibility-aware autonomy and kill-switch (ADR 038)
orchestrator/actions/reversals.py records the reverse of each response action
and the irreversible set. orchestrator/actions/post_exec_verification.py
confirms an actuate or destructive action took effect, rolling it back on a
negative verdict. orchestrator/kill_switch.py provides a global tier ceiling
that the resolver applies to every call; operators engage and release it through
POST /api/v1/admin/emergency-pause and /resume.
Observability (autonomy metrics, decision explainer)
services/autonomy_metrics.py exposes Prometheus metrics for guardrail denials,
pre-LLM filter outcomes, canary promotions and rollbacks, and per-agent tier
rank. services/decision_explainer.py renders plain-language reasons and
remediation for guardrail decisions. The Grafana dashboard is
infra/grafana/dashboards/agent-autonomy.json.
Web-defense hardening (ADR 039)
The inline web defense (ADR 032) gains a configurable fail mode
(WEB_DEFENSE_FAIL_MODE=open|closed), a verdict cache keyed by the full
inspection surface (method, path, query, inspected body, inspected headers), a
per-client sliding-window rate limiter, and client reputation tracking. The
runtime controls live in services/web_defense_runtime.py. Client identity for
rate limiting and reputation must come from an infrastructure-verified peer
header, never solely from client-supplied X-Forwarded-For.
Detection efficacy (ADR 040)
The Sigma corpus expands with curated rules and is measured two ways: an ATT&CK
coverage generator (tools/scripts/detection/attack_coverage.py) emits a
technique-to-rule matrix (see Detection ATT&CK coverage),
and a purple-team harness (tests/detection/test_purple_team.py) drives canonical
attack events through the matcher to assert true-positive coverage.
Production Vault auto-unseal (ADR 041)
infra/vault/vault-prod.hcl adds a transit auto-unseal stanza (recommended for
self-hosted and air-gapped) with cloud KMS alternatives, removing the unseal-share
distribution risk in production while keeping the dev Shamir default.
Em-dash prohibition
Em dashes are prohibited across the repository. The guard
tools/scripts/codegen/check_no_em_dashes.py runs in CI and just lint, and
tools/scripts/codegen/strip_em_dashes.py removes any that slip in.