LLM evaluation harness
What this page is
The shadow-then-canary harness engineers use to promote a new
model or prompt without manually grading every output. Covers
the dataset loader, the shadow runner, the canary gate, and the
eval_runs table that backs the promotion verdict.
Why it exists this way
ADR 009 records the decision shape. Model and prompt changes used to ship behind a manual A/B that nobody watched closely enough to catch subtle regressions. The harness gives the team a measurable promotion path and an auto-rollback signal that cannot be ignored.
How it works
Three Python modules collaborate, all under packages/backend/aurorasoc/eval/:
-
dataset.py,load_threat_hunter_holdoutpicks a deterministic cohort of closed investigations seeded by the fixed integerTHREAT_HUNTER_COHORT_SEED. Successive runs reproduce the same cohort. Rotating the seed requires a new ADR; deletions in the underlyinginvestigationstable are rare in production and acceptable to handle by rotation. -
shadow.py,ShadowRunnerwraps an agent dispatch. The candidate model runs in parallel with the production model; only the production output is acted on. Both completions land in theeval_runstable with a structural-agreement score (score_agreement: Jaccard of JSON object keys plus exact match on theactionfield, weighted 30/70 toward the action match). Persistence failures are logged and swallowed so a transienteval_runsoutage cannot abort the production dispatch. -
canary.py,CanaryGate.evaluateinspects the rolling shadow window for a given agent and candidate model and returns a typedCanaryDecision: hold, promote, or rollback. The gate distinguishesinference_error(downstream outage; hold, do not roll back) fromshadow_disagreement_breach(agreement rate below floor; roll back). The orchestrator owns acting on the verdict so a single source keeps promotion state consistent.
The schema lives in
migration 021.
Composite indices on (agent_id, candidate_model, created_at)
support the rolling-window query.
ThreatHunter is wired end to end; the other eight agents
inherit the same plumbing once their per-agent dataset is
curated. The follow-up is tracked as
TODO(#0009-eval-rollout): wire remaining agents in the
code.
What goes wrong
- Sample rate set too low, fewer than
minimum_samplesinside the rolling window. The gate returnsINSUFFICIENT_SAMPLESand holds. Raise the rate or wait for the window to fill. - High error rate masks a real regression, the gate
classifies as
INFERENCE_ERRORand holds rather than rolling back. This is by design: we do not roll back the model when the cause is a downstream outage. The signal is visible in the operator console; the runbook is to fix the downstream first. - Auto-rollback fired but the regression was operator-graded
acceptable. The operator can override the verdict via the
console; the operator verdict on
eval_runs.operator_verdicttakes precedence over the structural score on the next gate evaluation.