إنتقل إلى المحتوى الرئيسي

LLM evaluation harness

What this page is

The shadow-then-canary harness engineers use to promote a new model or prompt without manually grading every output. Covers the dataset loader, the shadow runner, the canary gate, and the eval_runs table that backs the promotion verdict.

Why it exists this way

ADR 009 records the decision shape. Model and prompt changes used to ship behind a manual A/B that nobody watched closely enough to catch subtle regressions. The harness gives the team a measurable promotion path and an auto-rollback signal that cannot be ignored.

How it works

Three Python modules collaborate, all under packages/backend/aurorasoc/eval/:

  • dataset.py, load_threat_hunter_holdout picks a deterministic cohort of closed investigations seeded by the fixed integer THREAT_HUNTER_COHORT_SEED. Successive runs reproduce the same cohort. Rotating the seed requires a new ADR; deletions in the underlying investigations table are rare in production and acceptable to handle by rotation.

  • shadow.py, ShadowRunner wraps an agent dispatch. The candidate model runs in parallel with the production model; only the production output is acted on. Both completions land in the eval_runs table with a structural-agreement score (score_agreement: Jaccard of JSON object keys plus exact match on the action field, weighted 30/70 toward the action match). Persistence failures are logged and swallowed so a transient eval_runs outage cannot abort the production dispatch.

  • canary.py, CanaryGate.evaluate inspects the rolling shadow window for a given agent and candidate model and returns a typed CanaryDecision: hold, promote, or rollback. The gate distinguishes inference_error (downstream outage; hold, do not roll back) from shadow_disagreement_breach (agreement rate below floor; roll back). The orchestrator owns acting on the verdict so a single source keeps promotion state consistent.

The schema lives in migration 021. Composite indices on (agent_id, candidate_model, created_at) support the rolling-window query.

ThreatHunter is wired end to end; the other eight agents inherit the same plumbing once their per-agent dataset is curated. The follow-up is tracked as TODO(#0009-eval-rollout): wire remaining agents in the code.

What goes wrong

  • Sample rate set too low, fewer than minimum_samples inside the rolling window. The gate returns INSUFFICIENT_SAMPLES and holds. Raise the rate or wait for the window to fill.
  • High error rate masks a real regression, the gate classifies as INFERENCE_ERROR and holds rather than rolling back. This is by design: we do not roll back the model when the cause is a downstream outage. The signal is visible in the operator console; the runbook is to fix the downstream first.
  • Auto-rollback fired but the regression was operator-graded acceptable. The operator can override the verdict via the console; the operator verdict on eval_runs.operator_verdict takes precedence over the structural score on the next gate evaluation.