Evaluation & Export
After training, you need to evaluate the model's quality and export it into a deployable format. This page covers both the evaluation benchmark suite and the model export/serving pipeline.
Evaluation Overview
The evaluation script (training/scripts/evaluate_model.py) tests the model against 8 structured benchmarks across 12 security domains, scoring responses based on keyword matching and response quality.
Why Evaluate?
- Catch regressions — ensure the model didn't hallucinate or forget basic SOC knowledge
- Compare variants — generic vs. per-agent specialists, different LoRA ranks, different base models
- Gate deployments — only deploy models that pass a minimum score threshold
Evaluation Domains
The evaluation covers 12 security domains:
| Domain | What It Tests |
|---|---|
alert_triage | Classifying alert severity, IOC extraction |
threat_hunting | Hypothesis-driven hunting, KQL queries |
malware_analysis | Reverse engineering, YARA rules, PE analysis |
incident_response | NIST playbooks, containment plans |
network_security | Flow analysis, Suricata rules, DNS tunneling |
vulnerability_management | CVE analysis, CVSS scoring, patch prioritization |
forensics | Memory forensics, disk analysis, evidence handling |
threat_intelligence | APT attribution, STIX/TAXII, diamond model |
compliance | CIS benchmarks, NIST 800-53, PCI DSS |
cloud_security | AWS/Azure IAM, container escapes, cloud trails |
endpoint_security | EDR triage, process analysis, persistence detection |
ics_ot_security | Modbus/DNP3 analysis, Purdue model, IEC 62443 |
Evaluation Benchmarks
Each benchmark is a structured test with a prompt, expected domain, and keyword-based scoring:
Alert Triage Benchmark
Prompt: "Analyze this alert: ET TROJAN Cobalt Strike Beacon C2 Activity
detected from 10.0.1.50 to 185.220.101.42:443"
Expected keywords: cobalt strike, beacon, c2, command and control,
severity, critical, high, ioc, indicator
Threat Hunting Benchmark
Prompt: "Create a threat hunting hypothesis for detecting lateral movement
via PsExec in an enterprise Windows environment"
Expected keywords: psexec, lateral movement, hypothesis, windows,
event log, sysmon, network, smb, admin share
Malware Analysis Benchmark
Prompt: "You received a suspicious PE file with high entropy sections and
imports from ws2_32.dll. Perform initial triage analysis."
Expected keywords: entropy, pe, portable executable, ws2_32, network,
packed, obfuscated, sandbox, import, section
Other Benchmarks
- Incident Response — NIST-compliant response to ransomware
- Network Security — DNS tunneling detection
- Vulnerability Management — CVE assessment
- Forensics — volatile memory acquisition
- Threat Intelligence — APT campaign analysis
Scoring System
Each response is scored using keyword-based matching:
def score_response(response: str, benchmark: dict) -> dict:
response_lower = response.lower()
matched = [kw for kw in benchmark["expected_keywords"]
if kw in response_lower]
keyword_score = len(matched) / len(benchmark["expected_keywords"])
# Response quality heuristics
length_score = min(len(response.split()) / 100, 1.0)
structure_score = any(marker in response
for marker in ["1.", "- ", "## ", "**"])
final_score = (keyword_score * 0.5 +
length_score * 0.3 +
(0.2 if structure_score else 0.0))
return {
"keyword_score": keyword_score,
"keywords_matched": matched,
"keywords_missed": [kw for kw in benchmark["expected_keywords"]
if kw not in response_lower],
"length_score": length_score,
"structure_score": structure_score,
"final_score": final_score,
}
| Score Component | Weight | What It Measures |
|---|---|---|
| Keyword match | 50% | Did the response mention expected security concepts? |
| Response length | 30% | Is the response substantive (>100 words = max)? |
| Structure | 20% | Does it use structured formatting (lists, headers)? |
Minimum Passing Scores
| Overall Score | Verdict |
|---|---|
| ≥ 0.70 | Pass — ready for deployment |
| 0.50 – 0.69 | Marginal — may need more training data or epochs |
| < 0.50 | Fail — investigate training loss, data quality, or config |
Running Evaluation
Evaluate an Ollama Model
# Evaluate a model served by Ollama
python training/scripts/evaluate_model.py ollama:granite-soc:latest
# With custom Ollama URL
OLLAMA_HOST=http://localhost:11434 \
python training/scripts/evaluate_model.py ollama:granite-soc:latest
Evaluate a Local Checkpoint
# Evaluate directly from a training checkpoint (no Ollama needed)
python training/scripts/evaluate_model.py training/output/generic/
# Evaluate a per-agent specialist
python training/scripts/evaluate_model.py training/output/threat_hunter/
Evaluate via Docker
docker compose -f docker-compose.training.yml run eval
Evaluate via Makefile
make train-eval
Evaluation Output
The script produces a detailed report:
╔══════════════════════════════════════════╗
║ AuroraSOC Model Evaluation Report ║
╠══════════════════════════════════════════╣
║ Model: granite-soc:latest ║
║ Benchmarks: 8 | Domains: 12 ║
╠══════════════════════════════════════════╣
║ Benchmark Score Status ║
║ ────────────────────── ───── ────── ║
║ Alert Triage 0.82 PASS ✓ ║
║ Threat Hunting 0.78 PASS ✓ ║
║ Malware Analysis 0.75 PASS ✓ ║
║ Incident Response 0.80 PASS ✓ ║
║ Network Security 0.71 PASS ✓ ║
║ Vulnerability Mgmt 0.68 MARGINAL ║
║ Forensics 0.74 PASS ✓ ║
║ Threat Intelligence 0.77 PASS ✓ ║
╠══════════════════════════════════════════╣
║ Average Score: 0.756 │ Overall: PASS ║
╚══════════════════════════════════════════╝
Model Export Pipeline
After training and evaluation, the export pipeline converts the model into deployable formats:
Export Formats
| Format | File | Backend | Use Case |
|---|---|---|---|
| LoRA adapter | adapter_model.safetensors | N/A (requires merge) | Resume training, share fine-tune |
| FP16 merged | model-*.safetensors | vLLM | Production serving via vLLM |
| GGUF quantized | unsloth.Q8_0.gguf | Ollama | Local/edge deployment |
| HuggingFace Hub | Remote | Any | Sharing, collaboration |
Automatic Export (During Training)
The training script exports automatically based on the YAML config export section:
export:
save_lora: true # Always save — lightweight, resumable
save_merged_16bit: false # Only if you need vLLM deployment
save_gguf: true # For Ollama deployment
gguf_quantization_methods:
- "q8_0" # Standard quality
push_to_hub: false # Set true + HF_TOKEN to upload
hub_model_name: "" # HuggingFace repo name
Manual Export (From Checkpoint)
Re-export a model without re-training:
# Export from a LoRA checkpoint
python training/scripts/finetune_granite.py \
--config training/configs/granite_soc_finetune.yaml \
--export-only \
--agent threat_hunter
--export-only skips training entirely and just runs the export pipeline on the existing checkpoint.
GGUF Quantization Methods
| Method | Quality | Speed | Size | When to Use |
|---|---|---|---|---|
q8_0 | Highest | Baseline | ~2-4 GB | Default — best quality/size tradeoff |
q4_k_m | Good | ~20% faster | ~1-2 GB | Constrained edge devices |
q5_k_m | Better | ~10% faster | ~1.5-3 GB | Balance between q4 and q8 |
f16 | Perfect | Slowest | ~4-8 GB | When quality is paramount |
Serving Models
Ollama (GGUF — Local/Edge)
Import a single model:
python training/scripts/serve_model.py ollama \
--gguf training/output/generic/unsloth.Q8_0.gguf \
--name granite-soc:latest
Import all per-agent specialists:
python training/scripts/serve_model.py ollama-all \
--output-dir training/output
What serve_model.py ollama does:
-
Generates a Modelfile with the Granite 4 chat template:
FROM /path/to/unsloth.Q8_0.gguf
TEMPLATE """{{- if .System }}<|start_of_role|>system<|end_of_role|>
{{ .System }}<|end_of_text|>
{{- end }}
<|start_of_role|>user<|end_of_role|>
{{ .Prompt }}<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>
{{ .Response }}<|end_of_text|>"""
PARAMETER temperature 0.1
PARAMETER top_p 0.95
PARAMETER stop "<|end_of_text|>"
PARAMETER stop "<|start_of_role|>" -
Runs
ollama create <name> -f Modelfile -
Verifies with
ollama list
Why the chat template matters: Ollama doesn't know Granite 4's unique <|start_of_role|> format by default. The Modelfile's TEMPLATE tells Ollama how to format multi-turn conversations. Without this, the model receives garbled input and produces poor responses.
vLLM (FP16 — Production)
For high-throughput production serving:
# Start vLLM with the merged FP16 model
python training/scripts/serve_model.py vllm \
--model-path training/output/generic/merged_fp16
# Or via Docker Compose
docker compose -f docker-compose.training.yml up vllm
vLLM is preferred for production because:
- Continuous batching — serves multiple requests simultaneously
- PagedAttention — efficient GPU memory management
- OpenAI-compatible API — drop-in replacement, same
/v1/chat/completionsendpoint - Higher throughput — 10-100× more requests/second than Ollama
When to Use Each Backend
| Factor | Ollama | vLLM |
|---|---|---|
| Deployment | Local, edge, development | Production, cloud |
| Model format | GGUF (quantized) | FP16 (full precision) |
| GPU requirement | Optional (CPU fallback) | Required |
| Throughput | Low (single request) | High (batched) |
| Setup complexity | Simple (ollama create) | Moderate (GPU config) |
| Quality | Good (q8_0) to Great (f16) | Perfect (FP16) |
Comparing Models
A recommended evaluation workflow:
# 1. Evaluate the generic model
python training/scripts/evaluate_model.py ollama:granite-soc:latest > eval_generic.txt
# 2. Evaluate a per-agent specialist
python training/scripts/evaluate_model.py ollama:granite-soc-threat-hunter:latest > eval_specialist.txt
# 3. Compare
diff eval_generic.txt eval_specialist.txt
Or evaluate against the base (un-fine-tuned) model:
# Pull base Granite 4
ollama pull granite3.2:2b
# Evaluate base
python training/scripts/evaluate_model.py ollama:granite3.2:2b > eval_base.txt
# Compare fine-tuned vs. base
diff eval_base.txt eval_generic.txt
Troubleshooting
Low Evaluation Scores
| Symptom | Cause | Fix |
|---|---|---|
| All keyword scores < 0.3 | Model not learning | Check training loss — should decrease. Try more epochs. |
| Good keywords, bad structure | Model outputs unformatted text | Add structured examples to training data |
| Good on some domains, bad on others | Unbalanced training data | Check domain distribution in dataset |
| Scores vary wildly between runs | Temperature too high | Set temperature=0.1 for deterministic eval |
Export Failures
| Error | Cause | Fix |
|---|---|---|
CUDA out of memory during merge | FP16 merge needs full model in memory | Use a machine with more VRAM, or export GGUF only |
GGUF file too large | Using f16 quantization | Switch to q8_0 or q4_k_m |
Ollama create fails | Modelfile path issue | Use absolute path to GGUF file |
Next Steps
- Configuration Reference — full YAML config reference
- LLM Integration: Serving Backends — deep dive on Ollama vs vLLM
- LLM Integration: Model Swap — enable fine-tuned models in AuroraSOC