Dataset Preparation
The quality of fine-tuned models depends entirely on the quality of training data. AuroraSOC includes a comprehensive data preparation pipeline that downloads public cybersecurity datasets, transforms them into supervised fine-tuning format, and creates per-agent domain splits.
Why This Step Matters
Fine-tuning a language model requires thousands of (prompt, response) pairs that demonstrate the exact behavior you want. Generic cybersecurity Q&A data won't teach a model to produce structured IOC extractions or map alerts to MITRE ATT&CK techniques. The preparation pipeline solves this by:
- Downloading authoritative sources — MITRE ATT&CK, MITRE CAR, Sigma rules, Atomic Red Team
- Transforming raw data into conversations — each data point becomes a system/user/assistant chat
- Adding SOC-specific system prompts — each training example includes the agent's persona prompt
- Splitting by domain — separate JSONL files for each agent's specialty
- Annotating metadata — domain, source, difficulty tags for filtering and analysis
Data Sources
The pipeline downloads from these public, freely-licensed sources:
| Source | What It Contains | License | Why It's Used |
|---|---|---|---|
| MITRE ATT&CK | Adversary techniques, tactics, procedures | Apache 2.0 | Teaches ATT&CK mapping, technique identification |
| MITRE CAR | Cyber Analytics Repository — detection rules | Apache 2.0 | Teaches detection engineering, analytics creation |
| Sigma Rules | Generic SIEM detection rules | LGPL-2.1 | Teaches log analysis, alert creation patterns |
| Atomic Red Team | Adversary emulation tests per technique | MIT | Teaches attack simulation, technique details |
| NVD CVE Feed | Vulnerability database (recent CVEs) | Public Domain | Teaches vulnerability analysis, CVSS scoring |
| OWASP Top 10 | Web application security risks | CC-BY-SA-4.0 | Teaches web security analysis |
Running the Pipeline
Quick Start
# Using Make (recommended)
make train-data
# Or directly
python training/scripts/prepare_datasets.py
Command-Line Options
python training/scripts/prepare_datasets.py [OPTIONS]
Options:
--skip-download Skip downloading (use cached raw files)
--skip-domains Skip per-agent domain splitting
--output-dir DIR Output directory (default: training/data)
--min-samples N Minimum samples per domain before augmentation
What Happens
The pipeline runs in four phases:
Phase 1: Download Raw Sources
Downloads all datasets to training/data/raw/:
training/data/raw/
├── mitre_attack/
│ └── enterprise-attack.json # Full ATT&CK knowledge base
├── mitre_car/
│ └── car-analytics.json # CAR detection analytics
├── sigma_rules/
│ └── sigma-rules-master.zip # 3000+ Sigma rules
├── atomic_red_team/
│ └── atomic-red-team-master.zip # 800+ atomic tests
├── nvd_cve_recent/
│ └── nvd-recent.json # Recent CVE entries
└── owasp_top10/
└── owasp-top10.md # OWASP Top 10 2021
If files already exist, they're skipped (idempotent operation).
Phase 2: Extract & Transform
Each source is parsed and converted into structured training samples:
- MITRE ATT&CK techniques → Questions about identifying, detecting, and mitigating each technique
- MITRE CAR analytics → Detection engineering scenarios (given this data source, write a detection rule)
- Sigma rules → Log analysis tasks (given this log, does this Sigma rule trigger?)
- Atomic Red Team tests → Attack simulation Q&A (what would this test generate? how to detect it?)
- CVE entries → Vulnerability assessment tasks (analyze severity, impact, remediation)
Phase 3: Format as Chat
Each sample is converted to the Granite 4 chat format:
{
"messages": [
{
"role": "system",
"content": "You are the AuroraSOC Security Analyst. Analyze security alerts, extract IOCs..."
},
{
"role": "user",
"content": "Analyze this technique: T1059.001 (PowerShell). What artifacts..."
},
{
"role": "assistant",
"content": "## Analysis: T1059.001 — PowerShell Command and Scripting Interpreter\n\n**Tactic:** Execution\n**Severity:** High\n\n### Key Artifacts:\n- Event ID 4104 (Script Block Logging)..."
}
],
"domain": "security_analysis",
"source": "mitre_attack",
"difficulty": "intermediate"
}
The domain field maps to AuroraSOC agent names (used for per-agent splitting):
| Domain | Mapped Agents |
|---|---|
security_analysis | SecurityAnalyst, EndpointSecurity |
threat_hunting | ThreatHunter, UEBAAnalyst |
malware_analysis | MalwareAnalyst |
incident_response | IncidentResponder |
network_security | NetworkSecurity |
cps_security | CPSSecurity |
threat_intelligence | ThreatIntel |
digital_forensics | ForensicAnalyst |
web_security | WebSecurity |
cloud_security | CloudSecurity |
vulnerability_management | VulnerabilityManager |
orchestration | Orchestrator |
Phase 4: Split by Domain
The pipeline creates domain-specific JSONL files by filtering on the domain field:
training/data/
├── soc_train.jsonl # All samples (for generic model)
├── soc_eval.jsonl # Held-out evaluation samples
└── domain/
├── security_analysis.jsonl # Alert analysis & IOC extraction
├── threat_hunting.jsonl # Hunting hypotheses & behavioral analysis
├── malware_analysis.jsonl # YARA, sandbox, behavioral signatures
├── incident_response.jsonl # Playbooks, containment, recovery
├── network_security.jsonl # Flow analysis, DDoS, DNS tunneling
├── cps_security.jsonl # ICS/OT, Modbus, IEC 62443
├── threat_intelligence.jsonl # APT tracking, STIX/TAXII
├── digital_forensics.jsonl # Disk/memory/network forensics
├── web_security.jsonl # OWASP, SQLi, XSS
├── cloud_security.jsonl # AWS/Azure/GCP misconfiguration
├── vulnerability_management.jsonl # CVE analysis, CVSS
└── orchestration.jsonl # Multi-agent routing scenarios
Adding Custom Training Data
You can supplement the public datasets with your own SOC data. Create a JSONL file where each line follows this format:
{
"messages": [
{"role": "system", "content": "Your agent system prompt here"},
{"role": "user", "content": "The user's security question or alert"},
{"role": "assistant", "content": "The ideal response from the agent"}
],
"domain": "security_analysis",
"source": "custom",
"difficulty": "intermediate"
}
Guidelines for Custom Data
-
Use the actual agent system prompts from
aurorasoc/agents/prompts.py— this ensures the model learns to respond in the context of the correct agent persona. -
Include diverse scenarios — don't just train on one type of alert. Cover:
- True positives (real attacks)
- False positives (benign activity that looks suspicious)
- Edge cases (ambiguous alerts that need context)
-
Format assistant responses consistently — use structured output with Markdown headers, bullet lists, and code blocks. The model will learn this formatting.
-
Tag difficulty levels —
basic,intermediate,advanced. This helps with curriculum learning if you train in stages. -
Include MITRE ATT&CK mappings where relevant — the agents are expected to map findings to techniques.
Merging Custom Data
Append your custom data to the main training file:
cat your_custom_data.jsonl >> training/data/soc_train.jsonl
Or for domain-specific data:
cat your_custom_alerts.jsonl >> training/data/domain/security_analysis.jsonl
Then re-run training — the prepare step doesn't need to be repeated.
Data Validation
To verify your training data is correctly formatted:
# Count samples
wc -l training/data/soc_train.jsonl
# Validate JSON format
python -c "
import json
with open('training/data/soc_train.jsonl') as f:
for i, line in enumerate(f):
try:
d = json.loads(line)
assert 'messages' in d, f'Line {i}: missing messages'
assert len(d['messages']) >= 2, f'Line {i}: need at least 2 messages'
except Exception as e:
print(f'Error on line {i}: {e}')
break
else:
print(f'All {i+1} samples valid')
"
# Check domain distribution
python -c "
import json
from collections import Counter
domains = Counter()
with open('training/data/soc_train.jsonl') as f:
for line in f:
d = json.loads(line)
domains[d.get('domain', 'unknown')] += 1
for domain, count in sorted(domains.items(), key=lambda x: -x[1]):
print(f' {domain}: {count}')
"
Next Steps
With your training data prepared:
- Local GPU Training — Train on your own hardware
- Docker Training — Reproducible containerized training
- Google Colab Training — Free cloud GPU