LLM evaluation harness

What this page is

The shadow-then-canary harness engineers use to promote a new model or prompt without manually grading every output. Covers the dataset loader, the shadow runner, the canary gate, and the eval_runs table that backs the promotion verdict.

Why it exists this way

ADR 009 records the decision shape. Model and prompt changes used to ship behind a manual A/B that nobody watched closely enough to catch subtle regressions. The harness gives the team a measurable promotion path and an auto-rollback signal that cannot be ignored.

How it works

Three Python modules collaborate, all under packages/backend/aurorasoc/eval/:

dataset.py, load_threat_hunter_holdout picks a deterministic cohort of closed investigations seeded by the fixed integer THREAT_HUNTER_COHORT_SEED. Successive runs reproduce the same cohort. Rotating the seed requires a new ADR; deletions in the underlying investigations table are rare in production and acceptable to handle by rotation.
shadow.py, ShadowRunner wraps an agent dispatch. The candidate model runs in parallel with the production model; only the production output is acted on. Both completions land in the eval_runs table with a structural-agreement score (score_agreement: Jaccard of JSON object keys plus exact match on the action field, weighted 30/70 toward the action match). Persistence failures are logged and swallowed so a transient eval_runs outage cannot abort the production dispatch.
canary.py, CanaryGate.evaluate inspects the rolling shadow window for a given agent and candidate model and returns a typed CanaryDecision: hold, promote, or rollback. The gate distinguishes inference_error (downstream outage; hold, do not roll back) from shadow_disagreement_breach (agreement rate below floor; roll back). The orchestrator owns acting on the verdict so a single source keeps promotion state consistent.

The schema lives in migration 021. Composite indices on (agent_id, candidate_model, created_at) support the rolling-window query.

ThreatHunter is wired end to end; the other eight agents inherit the same plumbing once their per-agent dataset is curated. The follow-up is tracked as TODO(#0009-eval-rollout): wire remaining agents in the code.

What goes wrong

Sample rate set too low, fewer than minimum_samples inside the rolling window. The gate returns INSUFFICIENT_SAMPLES and holds. Raise the rate or wait for the window to fill.
High error rate masks a real regression, the gate classifies as INFERENCE_ERROR and holds rather than rolling back. This is by design: we do not roll back the model when the cause is a downstream outage. The signal is visible in the operator console; the runbook is to fix the downstream first.
Auto-rollback fired but the regression was operator-graded acceptable. The operator can override the verdict via the console; the operator verdict on eval_runs.operator_verdict takes precedence over the structural score on the next gate evaluation.

What this page is​

Why it exists this way​

How it works​

What goes wrong​

What this page is

Why it exists this way

How it works

What goes wrong