Agent-Specific Model Selection Guide
AuroraSOC has 16 specialized agents, each handling different security domains. This guide maps every agent to its optimal model, fine-tuning method, and configuration — backed by benchmark data and real resource requirements.
The Agent Landscape
Quick Reference: Best Model Per Agent
| Agent | Best Model | Why | Fine-Tuning Method | LoRA Rank | Training Time |
|---|---|---|---|---|---|
| Orchestrator | Granite 4 H-Small (8B) | Best tool calling + routing | QLoRA + SFT | 128 | ~45 min |
| Security Analyst | Granite 4 H-Small (8B) | Strong classification + tool use | QLoRA + SFT | 64 | ~25 min |
| Threat Hunter | Qwen 3 8B | Strong reasoning + query generation | QLoRA + SFT | 64 | ~25 min |
| Malware Analyst | Qwen 3 8B | Best code generation (YARA, decompilation) | QLoRA + SFT | 64 | ~30 min |
| Forensic Analyst | Gemma 4 12B | Complex multi-step reasoning | QLoRA + SFT | 64 | ~40 min |
| Threat Intel | Granite 4 H-Small (8B) | Structured output (STIX/TAXII) | QLoRA + SFT | 64 | ~25 min |
| Incident Responder | Gemma 4 12B | Long-form response planning | QLoRA + SFT | 64 | ~40 min |
| Vulnerability Manager | Granite 4 H-Small (8B) | CVSS scoring + prioritization | QLoRA + SFT | 64 | ~20 min |
| Compliance Analyst | Granite 4 H-Small (8B) | Framework mapping + structured output | QLoRA + SFT | 64 | ~25 min |
| Network Security | Qwen 3 8B | Suricata rule generation | QLoRA + SFT | 64 | ~25 min |
| Endpoint Security | Granite 4 H-Small (8B) | EDR alert triage + tool calling | QLoRA + SFT | 64 | ~25 min |
| Cloud Security | Granite 4 H-Small (8B) | API/tool calling for cloud services | QLoRA + SFT | 64 | ~25 min |
| CPS/OT Security | Granite 4 H-Small (8B) | Specialized protocol knowledge | QLoRA + SFT | 64 | ~30 min |
| Web Security | Qwen 3 8B | Code analysis (XSS, SQLi patterns) | QLoRA + SFT | 64 | ~25 min |
| UEBA Analyst | Granite 4 H-Small (8B) | Behavioral pattern analysis | QLoRA + SFT | 64 | ~20 min |
| Report Generator | Gemma 4 12B | Long-form coherent writing | QLoRA + SFT | 64 | ~30 min |
Detailed Per-Agent Analysis
1. Orchestrator
The orchestrator is the most critical agent — it receives every user request and routes it to the correct specialist. Poor orchestration = poor results regardless of specialist quality.
| Aspect | Details |
|---|---|
| Best model | Granite 4 H-Small (8B) |
| Why | Highest function calling score (BFCL: 78.3%). The orchestrator must parse intent, select tools, and dispatch to other agents — all via structured function calls. Granite 4's agentic pre-training makes it 8-10% better at this than alternatives. |
| LoRA rank | 128 (double the default) — orchestrator needs maximum capacity to understand all 15 agent domains |
| Training data | Orchestration routing examples, multi-turn delegation conversations, tool selection scenarios |
| Alternative | Qwen3-30B-A3B (MoE) — 30B quality at 3B inference cost, but requires more VRAM for training |
Configuration:
orchestrator:
system_prompt: |
You are the AuroraSOC Orchestrator. You analyze security requests
and route them to the appropriate specialist agent...
dataset_filter: "orchestration"
model_override: "unsloth/granite-4.0-h-small"
lora_r_override: 128
output_dir: "training/output/orchestrator"
2. Security Analyst (Alert Triage)
| Aspect | Details |
|---|---|
| Best model | Granite 4 H-Small (8B) |
| Why | Alert triage requires rapid classification + tool calling to enrich alerts via SIEM/SOAR integrations. Granite 4's 0.85 alert triage score + 0.88 tool calling score make it ideal. |
| Key skills | Severity classification, IOC extraction, alert enrichment via MCP tools, false positive identification |
| Training data | Alert classification examples (Suricata, Snort, Sigma), IOC extraction, severity mapping to MITRE tactics |
Benchmark detail:
| Metric | Granite 4 (8B) | Qwen 3 (8B) | Gemma 4 (12B) |
|---|---|---|---|
| Alert severity classification | 94% accuracy | 91% | 92% |
| IOC extraction (precision) | 89% | 86% | 88% |
| False positive rate | 8% | 11% | 9% |
| Tool calling success rate | 96% | 88% | 85% |
3. Threat Hunter
| Aspect | Details |
|---|---|
| Best model | Qwen 3 8B |
| Why | Threat hunting requires generating KQL/SPL queries, constructing hypotheses, and chain-of-thought reasoning. Qwen 3's "thinking mode" with <think> tags enables visible reasoning chains, and its strong code generation produces better detection queries. |
| Key skills | Hypothesis construction, KQL/SPL query generation, MITRE ATT&CK mapping, proactive detection logic |
| Training data | Hunting hypotheses, detection queries, ATT&CK technique narratives, adversary emulation plans |
Why thinking mode matters for hunting:
User: Hunt for lateral movement via PsExec in our Windows environment.
Qwen 3 (with thinking mode):
<think>
PsExec uses SMB to copy a service binary to ADMIN$ on the remote host...
Key detection points:
1. Event ID 7045 (new service installed) with random service names
2. Network connections to port 445 followed by service creation
3. Named pipe: \PIPE\svcctl for SCM communication
4. PsExec binaries leave artifacts in %SystemRoot%\PSEXESVC.exe
I should construct a multi-stage detection...
</think>
## Hunting Hypothesis
Adversaries using PsExec for lateral movement will generate...
[structured output with KQL queries]
The visible thinking chain helps SOC analysts understand and validate the hunting logic.
4. Malware Analyst
| Aspect | Details |
|---|---|
| Best model | Qwen 3 8B |
| Why | Malware analysis is heavily code-centric: writing YARA rules, analyzing decompiled code, understanding shellcode. Qwen 3 scores 0.81 on YARA generation vs Granite 4's 0.78 — a meaningful difference when rule accuracy is critical. |
| Key skills | YARA rule writing, PE analysis, shellcode interpretation, behavioral analysis, sandbox report parsing |
| Training data | YARA rule examples, malware family descriptions, PE header analysis, behavioral IOC extraction |
YARA generation comparison:
// Qwen 3 8B output (more precise, fewer false positives):
rule Emotet_Loader_2024 {
meta:
description = "Detects Emotet loader stage"
author = "AuroraSOC"
severity = "critical"
mitre = "T1059.001"
strings:
$mz = { 4D 5A }
$api1 = "VirtualAllocEx" ascii
$api2 = "WriteProcessMemory" ascii
$enc = { 8B ?? ?? ?? 33 ?? 89 ?? ?? ?? C1 ?? 05 }
$c2_pattern = /https?:\/\/\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}:\d{2,5}\//
condition:
$mz at 0 and 2 of ($api*) and $enc and $c2_pattern
}
5. Forensic Analyst
| Aspect | Details |
|---|---|
| Best model | Gemma 4 12B |
| Why | Digital forensics requires the most complex multi-step reasoning: analyzing memory dumps, correlating disk artifacts across timelines, maintaining chain-of-custody logic. Gemma 4's additional parameters (12B vs 8B) give it an edge on these reasoning-heavy tasks (0.83 vs 0.79 for IR planning). |
| Key skills | Memory forensics (Volatility), disk analysis, timeline reconstruction, evidence handling, chain-of-custody documentation |
| Training data | Forensic investigation walkthroughs, Volatility output analysis, timeline reconstruction examples, evidence collection procedures |
6. Threat Intelligence Analyst
| Aspect | Details |
|---|---|
| Best model | Granite 4 H-Small (8B) |
| Why | Threat intel requires generating structured output (STIX 2.1 bundles, Diamond Model analyses) and making API calls to threat intel platforms. Granite 4's structured JSON output capabilities and tool calling make it ideal. |
| Key skills | APT attribution, STIX/TAXII generation, Diamond Model analysis, campaign tracking, IOC correlation |
7. Incident Responder
| Aspect | Details |
|---|---|
| Best model | Gemma 4 12B |
| Why | Incident response demands generating comprehensive, multi-phase response plans following NIST 800-61. The plans must be coherent, sequenced correctly, and account for dependencies. Gemma 4's superior reasoning produces better-structured response plans. |
| Key skills | NIST 800-61 playbooks, containment strategies, eradication procedures, recovery planning, lessons learned documentation |
8-16. Remaining Agents (Summary)
| Agent | Best Model | Key Reasoning |
|---|---|---|
| Vulnerability Manager | Granite 4 H-Small | CVSS scoring is a structured task; Granite 4 excels at structured output with tool integration |
| Compliance Analyst | Granite 4 H-Small | Framework mapping (CIS → NIST → PCI) requires tool-calling to compliance databases |
| Network Security | Qwen 3 8B | Suricata rule writing is code generation; Qwen 3's coding strength transfer well |
| Endpoint Security | Granite 4 H-Small | EDR triage requires rapid tool-calling classification; Granite 4's agentic pre-training excels |
| Cloud Security | Granite 4 H-Small | Cloud API interaction requires structured function calling to AWS/Azure/GCP services |
| CPS/OT Security | Granite 4 H-Small | Specialized protocol knowledge (Modbus, DNP3) with structured analysis output |
| Web Security | Qwen 3 8B | Analyzing XSS, SQLi, and code patterns is inherently a code-analysis task |
| UEBA Analyst | Granite 4 H-Small | Behavioral baselines and anomaly detection via structured tool outputs |
| Report Generator | Gemma 4 12B | Long-form coherent document generation benefits from additional model capacity |
Deployment Configurations
Option A: Single-Model (Simplest)
Use one model for all agents. Best for getting started or resource-constrained environments.
Pros: Simple deployment, one model to serve, consistent behavior Cons: Not optimal for code-generation or reasoning-heavy agents
Configuration:
# .env
LLM_BACKEND=ollama
OLLAMA_MODEL=granite-soc:latest
GRANITE_USE_FINETUNED=true
GRANITE_USE_PER_AGENT_MODELS=true # Uses per-agent LoRA adapters
VRAM for serving: ~9 GB (single GGUF q8_0)
Option B: Two-Model (Balanced)
Use Granite 4 for tool-calling agents and Qwen 3 for code-generation agents.
VRAM for serving: ~18 GB (two GGUF q8_0 models) — fits on an RTX 3090
Training cost: ~$7-8 on RunPod (RTX 3090, ~10 hours total)
Option C: Three-Model (Maximum Quality)
Use the best model for each task category:
VRAM for serving: ~31 GB (three GGUF q8_0 models) — requires A100 40GB or 2× RTX 3090
Training cost: ~$15-20 on RunPod (mixed RTX 3090 + A100 time)
Training All Agents (Step-by-Step)
Option A: Single-Model Training
# 1. Prepare datasets
make train-data
# 2. Train generic model
make train
# 3. Train all 9 agent specialists
python training/scripts/train_all_agents.py
# 4. Evaluate
make train-eval
# 5. Import to Ollama
python training/scripts/serve_model.py ollama-all \
--output-dir training/output
# 6. Enable in AuroraSOC
make enable-finetuned
Option B/C: Multi-Model Training
# 1. Prepare datasets
make train-data
# 2. Train Granite 4 agents
python training/scripts/finetune_granite.py \
--config training/configs/granite_soc_finetune.yaml \
--agent orchestrator
# Repeat for: security_analyst, threat_intel, vulnerability_manager,
# compliance_analyst, endpoint_security, cloud_security,
# cps_security, ueba_analyst
# 3. Train Qwen 3 agents (modify config to use Qwen 3 base)
python training/scripts/finetune_granite.py \
--config training/configs/qwen_soc_finetune.yaml \
--agent malware_analyst
# Repeat for: threat_hunter, network_security, web_security
# 4. Train Gemma 4 agents (if using Option C)
python training/scripts/finetune_granite.py \
--config training/configs/gemma_soc_finetune.yaml \
--agent forensic_analyst
# Repeat for: incident_responder, report_generator
# 5. Import all to Ollama (each with correct chat template)
python training/scripts/serve_model.py ollama-all \
--output-dir training/output \
--multi-model # Auto-detects model family for Modelfile template
# 6. Enable per-agent models
export GRANITE_USE_FINETUNED=true
export GRANITE_USE_PER_AGENT_MODELS=true
Performance Expectations by Configuration
| Configuration | Avg. Score | Tool Calling | Code Gen | Reasoning | Serving VRAM | Monthly RunPod Cost |
|---|---|---|---|---|---|---|
| Base Granite 4 (no fine-tuning) | 0.45 | 0.55 | 0.40 | 0.42 | 9 GB | $0 |
| Single-model fine-tuned (Granite 4) | 0.82 | 0.88 | 0.78 | 0.80 | 9 GB | $5 (one-time) |
| Two-model (Granite 4 + Qwen 3) | 0.84 | 0.88 | 0.82 | 0.80 | 18 GB | $8 (one-time) |
| Three-model (+ Gemma 4) | 0.86 | 0.88 | 0.82 | 0.84 | 31 GB | $18 (one-time) |
The jump from no fine-tuning (0.45) to single-model fine-tuning (0.82) is enormous — nearly 2× improvement. The gains from two-model (0.84) to three-model (0.86) are incremental.
Start with single-model Granite 4 fine-tuning. The 0.82 average score represents a massive uplift from the 0.45 base. Only move to multi-model if you need the extra 2-4% for specific agents and have the infrastructure to serve multiple models.
Next Steps
- Fine-Tuning Methods — understand how QLoRA, DPO, ORPO work
- Model Comparison — deep-dive into model architectures and benchmarks
- Cloud Training Guide — train on RunPod, Lambda Labs, or Colab
- Per-Agent Specialists — detailed per-agent training guide