Evaluation Metrics & Benchmarks

In cybersecurity operations, hallucination is unacceptable. If an LLM recommends quarantining a critical subnet because it misunderstood a benign alert, the operational damage can be severe.

To ensure AI readiness before deploying to production, AuroraSOC employs an automated evaluation suite. This document details how we benchmark fine-tuned models across specialist domains using structured keyword scoring.

The Evaluation Philosophy

We do not evaluate AuroraSOC models on generic conversational benchmarks like MMLU or human chat preference. The only metric that matters is: Does the model produce the correct security directives given a specific technical prompt?

To answer this, training/scripts/evaluate_model.py runs a suite of challenging, realistic SOC scenarios across all domains (Alert Analysis, Threat Hunting, Incident Response, etc.).

Analogy: We are not testing the model's ability to hold a pleasant conversation; we are putting it through a functional incident response drill. If it doesn't shout the correct technical keywords when shown an active ransomware event, it fails the drill.

Defining a Benchmark

Benchmarks are defined explicitly in evaluate_model.py. Each test case specifies a scenario, a set of expected keywords, and minimum required hits.

Here is an example structure:

{
    "id": "incident_response_01",
    "domain": "incident_response",
    "difficulty": "intermediate",
    "prompt": (
        "A ransomware incident has been detected. File encryption is actively occurring "
        "on server SRV-FILE-01. The ransom note references 'LockBit 3.0'. Initial access "
        "appears to be through a phishing email received 3 days ago. Provide an incident "
        "response plan following NIST SP 800-61."
    ),
    "expected_keywords": [
        "contain",
        "isolat",
        "backup",
        "lockbit",
        "phishing",
        "lateral",
        "eradicat",
        "recovery",
        "forensic",
    ],
    "min_keyword_matches": 4,
}

By keeping keywords stemmed (e.g., isolat matches isolate, isolated, and isolation), we evaluate the model's analytical competence without penalizing stylistic variations in phrasing. If the model does not formulate a plan that includes isolating the host and considering lateral movement, it fails.

Running the Evaluation Suite

The evaluate_model.py script utilizes the exact same serving backends as our agent fleet. It supports evaluating directly against vLLM, Ollama, or a local Unsloth checkpoint loaded in memory.

Against a Local Checkpoint

This is useful immediately after finetune_granite.py finishes, before you start up a full serving engine.

python training/scripts/evaluate_model.py \
    --model training/checkpoints/granite_soc_lora

Against vLLM (Production Parity Test)

This tests the model exactly as it will run in production.

python training/scripts/evaluate_model.py \
    --model vllm:granite-soc-specialist \
    --vllm-base-url http://localhost:8000/v1

Against Ollama

python training/scripts/evaluate_model.py \
    --model ollama:granite-soc:latest

Understanding the Results

When the evaluation completes, the script outputs a strict summary to stdout and logs a detailed eval_results.json artifact.

The evaluation runner acts locally, but is strict:

Keyword Hit Rate: The percentage of total defined keywords the model struck.
Pass/Fail: Calculated precisely by crossing the min_keyword_matches threshold.
Response Time: Time-to-last-token is tracked. Under high cognitive load, a model that takes 45 seconds to diagnose an alert may be functionally useless in an automated pipeline.

A sample JSON output metric:

{
  "benchmark_id": "threat_hunt_01",
  "domain": "threat_hunting",
  "passed": true,
  "keyword_hits": 6,
  "keyword_total": 8,
  "keyword_hit_rate": 0.75,
  "response_time_s": 2.4,
  "missed_keywords": ["SPN", "krbtgt"]
}

Continuous Baseline Testing

If you choose to alter the fine-tuning dataset, swap to a new foundation model architecture (e.g., Granite to Mistral), or change vLLM quantization strategies, you must run the evaluation suite to establish a pass rate against the previous baseline.

A model should not be graduated to your production SOC environment unless it scores a pass_rate >= 0.85 across the entire benchmark suite.

The Evaluation Philosophy​

Defining a Benchmark​

Running the Evaluation Suite​

Against a Local Checkpoint​

Against vLLM (Production Parity Test)​

Against Ollama​

Understanding the Results​

Continuous Baseline Testing​