انتقل إلى المحتوى الرئيسي

Evaluation & Export

After training, you need to evaluate the model's quality and export it into a deployable format. This page covers both the evaluation benchmark suite and the model export/serving pipeline.

Evaluation Overview

The evaluation script (training/scripts/evaluate_model.py) tests the model against 8 structured benchmarks across 12 security domains, scoring responses based on keyword matching and response quality.

Why Evaluate?

  • Catch regressions — ensure the model didn't hallucinate or forget basic SOC knowledge
  • Compare variants — generic vs. per-agent specialists, different LoRA ranks, different base models
  • Gate deployments — only deploy models that pass a minimum score threshold

Evaluation Domains

The evaluation covers 12 security domains:

DomainWhat It Tests
alert_triageClassifying alert severity, IOC extraction
threat_huntingHypothesis-driven hunting, KQL queries
malware_analysisReverse engineering, YARA rules, PE analysis
incident_responseNIST playbooks, containment plans
network_securityFlow analysis, Suricata rules, DNS tunneling
vulnerability_managementCVE analysis, CVSS scoring, patch prioritization
forensicsMemory forensics, disk analysis, evidence handling
threat_intelligenceAPT attribution, STIX/TAXII, diamond model
complianceCIS benchmarks, NIST 800-53, PCI DSS
cloud_securityAWS/Azure IAM, container escapes, cloud trails
endpoint_securityEDR triage, process analysis, persistence detection
ics_ot_securityModbus/DNP3 analysis, Purdue model, IEC 62443

Evaluation Benchmarks

Each benchmark is a structured test with a prompt, expected domain, and keyword-based scoring:

Alert Triage Benchmark

Prompt: "Analyze this alert: ET TROJAN Cobalt Strike Beacon C2 Activity
detected from 10.0.1.50 to 185.220.101.42:443"
Expected keywords: cobalt strike, beacon, c2, command and control,
severity, critical, high, ioc, indicator

Threat Hunting Benchmark

Prompt: "Create a threat hunting hypothesis for detecting lateral movement
via PsExec in an enterprise Windows environment"
Expected keywords: psexec, lateral movement, hypothesis, windows,
event log, sysmon, network, smb, admin share

Malware Analysis Benchmark

Prompt: "You received a suspicious PE file with high entropy sections and
imports from ws2_32.dll. Perform initial triage analysis."
Expected keywords: entropy, pe, portable executable, ws2_32, network,
packed, obfuscated, sandbox, import, section

Other Benchmarks

  • Incident Response — NIST-compliant response to ransomware
  • Network Security — DNS tunneling detection
  • Vulnerability Management — CVE assessment
  • Forensics — volatile memory acquisition
  • Threat Intelligence — APT campaign analysis

Scoring System

Each response is scored using keyword-based matching:

def score_response(response: str, benchmark: dict) -> dict:
response_lower = response.lower()
matched = [kw for kw in benchmark["expected_keywords"]
if kw in response_lower]
keyword_score = len(matched) / len(benchmark["expected_keywords"])

# Response quality heuristics
length_score = min(len(response.split()) / 100, 1.0)
structure_score = any(marker in response
for marker in ["1.", "- ", "## ", "**"])

final_score = (keyword_score * 0.5 +
length_score * 0.3 +
(0.2 if structure_score else 0.0))

return {
"keyword_score": keyword_score,
"keywords_matched": matched,
"keywords_missed": [kw for kw in benchmark["expected_keywords"]
if kw not in response_lower],
"length_score": length_score,
"structure_score": structure_score,
"final_score": final_score,
}
Score ComponentWeightWhat It Measures
Keyword match50%Did the response mention expected security concepts?
Response length30%Is the response substantive (>100 words = max)?
Structure20%Does it use structured formatting (lists, headers)?

Minimum Passing Scores

Overall ScoreVerdict
≥ 0.70Pass — ready for deployment
0.50 – 0.69Marginal — may need more training data or epochs
< 0.50Fail — investigate training loss, data quality, or config

Running Evaluation

Evaluate an Ollama Model

# Evaluate a model served by Ollama
python training/scripts/evaluate_model.py ollama:granite-soc:latest

# With custom Ollama URL
OLLAMA_HOST=http://localhost:11434 \
python training/scripts/evaluate_model.py ollama:granite-soc:latest

Evaluate a Local Checkpoint

# Evaluate directly from a training checkpoint (no Ollama needed)
python training/scripts/evaluate_model.py training/output/generic/

# Evaluate a per-agent specialist
python training/scripts/evaluate_model.py training/output/threat_hunter/

Evaluate via Docker

docker compose -f docker-compose.training.yml run eval

Evaluate via Makefile

make train-eval

Evaluation Output

The script produces a detailed report:

╔══════════════════════════════════════════╗
║ AuroraSOC Model Evaluation Report ║
╠══════════════════════════════════════════╣
║ Model: granite-soc:latest ║
║ Benchmarks: 8 | Domains: 12 ║
╠══════════════════════════════════════════╣
║ Benchmark Score Status ║
║ ────────────────────── ───── ────── ║
║ Alert Triage 0.82 PASS ✓ ║
║ Threat Hunting 0.78 PASS ✓ ║
║ Malware Analysis 0.75 PASS ✓ ║
║ Incident Response 0.80 PASS ✓ ║
║ Network Security 0.71 PASS ✓ ║
║ Vulnerability Mgmt 0.68 MARGINAL ║
║ Forensics 0.74 PASS ✓ ║
║ Threat Intelligence 0.77 PASS ✓ ║
╠══════════════════════════════════════════╣
║ Average Score: 0.756 │ Overall: PASS ║
╚══════════════════════════════════════════╝

Model Export Pipeline

After training and evaluation, the export pipeline converts the model into deployable formats:

Export Formats

FormatFileBackendUse Case
LoRA adapteradapter_model.safetensorsN/A (requires merge)Resume training, share fine-tune
FP16 mergedmodel-*.safetensorsvLLMProduction serving via vLLM
GGUF quantizedunsloth.Q8_0.ggufOllamaLocal/edge deployment
HuggingFace HubRemoteAnySharing, collaboration

Automatic Export (During Training)

The training script exports automatically based on the YAML config export section:

export:
save_lora: true # Always save — lightweight, resumable
save_merged_16bit: false # Only if you need vLLM deployment
save_gguf: true # For Ollama deployment
gguf_quantization_methods:
- "q8_0" # Standard quality
push_to_hub: false # Set true + HF_TOKEN to upload
hub_model_name: "" # HuggingFace repo name

Manual Export (From Checkpoint)

Re-export a model without re-training:

# Export from a LoRA checkpoint
python training/scripts/finetune_granite.py \
--config training/configs/granite_soc_finetune.yaml \
--export-only \
--agent threat_hunter

--export-only skips training entirely and just runs the export pipeline on the existing checkpoint.

GGUF Quantization Methods

MethodQualitySpeedSizeWhen to Use
q8_0HighestBaseline~2-4 GBDefault — best quality/size tradeoff
q4_k_mGood~20% faster~1-2 GBConstrained edge devices
q5_k_mBetter~10% faster~1.5-3 GBBalance between q4 and q8
f16PerfectSlowest~4-8 GBWhen quality is paramount

Serving Models

Ollama (GGUF — Local/Edge)

Import a single model:

python training/scripts/serve_model.py ollama \
--gguf training/output/generic/unsloth.Q8_0.gguf \
--name granite-soc:latest

Import all per-agent specialists:

python training/scripts/serve_model.py ollama-all \
--output-dir training/output

What serve_model.py ollama does:

  1. Generates a Modelfile with the Granite 4 chat template:

    FROM /path/to/unsloth.Q8_0.gguf

    TEMPLATE """{{- if .System }}<|start_of_role|>system<|end_of_role|>
    {{ .System }}<|end_of_text|>
    {{- end }}
    <|start_of_role|>user<|end_of_role|>
    {{ .Prompt }}<|end_of_text|>
    <|start_of_role|>assistant<|end_of_role|>
    {{ .Response }}<|end_of_text|>"""

    PARAMETER temperature 0.1
    PARAMETER top_p 0.95
    PARAMETER stop "<|end_of_text|>"
    PARAMETER stop "<|start_of_role|>"
  2. Runs ollama create <name> -f Modelfile

  3. Verifies with ollama list

Why the chat template matters: Ollama doesn't know Granite 4's unique <|start_of_role|> format by default. The Modelfile's TEMPLATE tells Ollama how to format multi-turn conversations. Without this, the model receives garbled input and produces poor responses.

vLLM (FP16 — Production)

For high-throughput production serving:

# Start vLLM with the merged FP16 model
python training/scripts/serve_model.py vllm \
--model-path training/output/generic/merged_fp16

# Or via Docker Compose
docker compose -f docker-compose.training.yml up vllm

vLLM is preferred for production because:

  • Continuous batching — serves multiple requests simultaneously
  • PagedAttention — efficient GPU memory management
  • OpenAI-compatible API — drop-in replacement, same /v1/chat/completions endpoint
  • Higher throughput — 10-100× more requests/second than Ollama

When to Use Each Backend

FactorOllamavLLM
DeploymentLocal, edge, developmentProduction, cloud
Model formatGGUF (quantized)FP16 (full precision)
GPU requirementOptional (CPU fallback)Required
ThroughputLow (single request)High (batched)
Setup complexitySimple (ollama create)Moderate (GPU config)
QualityGood (q8_0) to Great (f16)Perfect (FP16)

Comparing Models

A recommended evaluation workflow:

# 1. Evaluate the generic model
python training/scripts/evaluate_model.py ollama:granite-soc:latest > eval_generic.txt

# 2. Evaluate a per-agent specialist
python training/scripts/evaluate_model.py ollama:granite-soc-threat-hunter:latest > eval_specialist.txt

# 3. Compare
diff eval_generic.txt eval_specialist.txt

Or evaluate against the base (un-fine-tuned) model:

# Pull base Granite 4
ollama pull granite3.2:2b

# Evaluate base
python training/scripts/evaluate_model.py ollama:granite3.2:2b > eval_base.txt

# Compare fine-tuned vs. base
diff eval_base.txt eval_generic.txt

Troubleshooting

Low Evaluation Scores

SymptomCauseFix
All keyword scores < 0.3Model not learningCheck training loss — should decrease. Try more epochs.
Good keywords, bad structureModel outputs unformatted textAdd structured examples to training data
Good on some domains, bad on othersUnbalanced training dataCheck domain distribution in dataset
Scores vary wildly between runsTemperature too highSet temperature=0.1 for deterministic eval

Export Failures

ErrorCauseFix
CUDA out of memory during mergeFP16 merge needs full model in memoryUse a machine with more VRAM, or export GGUF only
GGUF file too largeUsing f16 quantizationSwitch to q8_0 or q4_k_m
Ollama create failsModelfile path issueUse absolute path to GGUF file

Next Steps