انتقل إلى المحتوى الرئيسي

Dataset Preparation

The quality of fine-tuned models depends entirely on the quality of training data. AuroraSOC includes a comprehensive data preparation pipeline that downloads public cybersecurity datasets, transforms them into supervised fine-tuning format, and creates per-agent domain splits.

Why This Step Matters

Fine-tuning a language model requires thousands of (prompt, response) pairs that demonstrate the exact behavior you want. Generic cybersecurity Q&A data won't teach a model to produce structured IOC extractions or map alerts to MITRE ATT&CK techniques. The preparation pipeline solves this by:

  1. Downloading authoritative sources — MITRE ATT&CK, MITRE CAR, Sigma rules, Atomic Red Team
  2. Transforming raw data into conversations — each data point becomes a system/user/assistant chat
  3. Adding SOC-specific system prompts — each training example includes the agent's persona prompt
  4. Splitting by domain — separate JSONL files for each agent's specialty
  5. Annotating metadata — domain, source, difficulty tags for filtering and analysis

Data Sources

The pipeline downloads from these public, freely-licensed sources:

SourceWhat It ContainsLicenseWhy It's Used
MITRE ATT&CKAdversary techniques, tactics, proceduresApache 2.0Teaches ATT&CK mapping, technique identification
MITRE CARCyber Analytics Repository — detection rulesApache 2.0Teaches detection engineering, analytics creation
Sigma RulesGeneric SIEM detection rulesLGPL-2.1Teaches log analysis, alert creation patterns
Atomic Red TeamAdversary emulation tests per techniqueMITTeaches attack simulation, technique details
NVD CVE FeedVulnerability database (recent CVEs)Public DomainTeaches vulnerability analysis, CVSS scoring
OWASP Top 10Web application security risksCC-BY-SA-4.0Teaches web security analysis

Running the Pipeline

Quick Start

# Using Make (recommended)
make train-data

# Or directly
python training/scripts/prepare_datasets.py

Command-Line Options

python training/scripts/prepare_datasets.py [OPTIONS]

Options:
--skip-download Skip downloading (use cached raw files)
--skip-domains Skip per-agent domain splitting
--output-dir DIR Output directory (default: training/data)
--min-samples N Minimum samples per domain before augmentation

What Happens

The pipeline runs in four phases:

Phase 1: Download Raw Sources

Downloads all datasets to training/data/raw/:

training/data/raw/
├── mitre_attack/
│ └── enterprise-attack.json # Full ATT&CK knowledge base
├── mitre_car/
│ └── car-analytics.json # CAR detection analytics
├── sigma_rules/
│ └── sigma-rules-master.zip # 3000+ Sigma rules
├── atomic_red_team/
│ └── atomic-red-team-master.zip # 800+ atomic tests
├── nvd_cve_recent/
│ └── nvd-recent.json # Recent CVE entries
└── owasp_top10/
└── owasp-top10.md # OWASP Top 10 2021

If files already exist, they're skipped (idempotent operation).

Phase 2: Extract & Transform

Each source is parsed and converted into structured training samples:

  • MITRE ATT&CK techniques → Questions about identifying, detecting, and mitigating each technique
  • MITRE CAR analytics → Detection engineering scenarios (given this data source, write a detection rule)
  • Sigma rules → Log analysis tasks (given this log, does this Sigma rule trigger?)
  • Atomic Red Team tests → Attack simulation Q&A (what would this test generate? how to detect it?)
  • CVE entries → Vulnerability assessment tasks (analyze severity, impact, remediation)

Phase 3: Format as Chat

Each sample is converted to the Granite 4 chat format:

{
"messages": [
{
"role": "system",
"content": "You are the AuroraSOC Security Analyst. Analyze security alerts, extract IOCs..."
},
{
"role": "user",
"content": "Analyze this technique: T1059.001 (PowerShell). What artifacts..."
},
{
"role": "assistant",
"content": "## Analysis: T1059.001 — PowerShell Command and Scripting Interpreter\n\n**Tactic:** Execution\n**Severity:** High\n\n### Key Artifacts:\n- Event ID 4104 (Script Block Logging)..."
}
],
"domain": "security_analysis",
"source": "mitre_attack",
"difficulty": "intermediate"
}

The domain field maps to AuroraSOC agent names (used for per-agent splitting):

DomainMapped Agents
security_analysisSecurityAnalyst, EndpointSecurity
threat_huntingThreatHunter, UEBAAnalyst
malware_analysisMalwareAnalyst
incident_responseIncidentResponder
network_securityNetworkSecurity
cps_securityCPSSecurity
threat_intelligenceThreatIntel
digital_forensicsForensicAnalyst
web_securityWebSecurity
cloud_securityCloudSecurity
vulnerability_managementVulnerabilityManager
orchestrationOrchestrator

Phase 4: Split by Domain

The pipeline creates domain-specific JSONL files by filtering on the domain field:

training/data/
├── soc_train.jsonl # All samples (for generic model)
├── soc_eval.jsonl # Held-out evaluation samples
└── domain/
├── security_analysis.jsonl # Alert analysis & IOC extraction
├── threat_hunting.jsonl # Hunting hypotheses & behavioral analysis
├── malware_analysis.jsonl # YARA, sandbox, behavioral signatures
├── incident_response.jsonl # Playbooks, containment, recovery
├── network_security.jsonl # Flow analysis, DDoS, DNS tunneling
├── cps_security.jsonl # ICS/OT, Modbus, IEC 62443
├── threat_intelligence.jsonl # APT tracking, STIX/TAXII
├── digital_forensics.jsonl # Disk/memory/network forensics
├── web_security.jsonl # OWASP, SQLi, XSS
├── cloud_security.jsonl # AWS/Azure/GCP misconfiguration
├── vulnerability_management.jsonl # CVE analysis, CVSS
└── orchestration.jsonl # Multi-agent routing scenarios

Adding Custom Training Data

You can supplement the public datasets with your own SOC data. Create a JSONL file where each line follows this format:

{
"messages": [
{"role": "system", "content": "Your agent system prompt here"},
{"role": "user", "content": "The user's security question or alert"},
{"role": "assistant", "content": "The ideal response from the agent"}
],
"domain": "security_analysis",
"source": "custom",
"difficulty": "intermediate"
}

Guidelines for Custom Data

  1. Use the actual agent system prompts from aurorasoc/agents/prompts.py — this ensures the model learns to respond in the context of the correct agent persona.

  2. Include diverse scenarios — don't just train on one type of alert. Cover:

    • True positives (real attacks)
    • False positives (benign activity that looks suspicious)
    • Edge cases (ambiguous alerts that need context)
  3. Format assistant responses consistently — use structured output with Markdown headers, bullet lists, and code blocks. The model will learn this formatting.

  4. Tag difficulty levelsbasic, intermediate, advanced. This helps with curriculum learning if you train in stages.

  5. Include MITRE ATT&CK mappings where relevant — the agents are expected to map findings to techniques.

Merging Custom Data

Append your custom data to the main training file:

cat your_custom_data.jsonl >> training/data/soc_train.jsonl

Or for domain-specific data:

cat your_custom_alerts.jsonl >> training/data/domain/security_analysis.jsonl

Then re-run training — the prepare step doesn't need to be repeated.

Data Validation

To verify your training data is correctly formatted:

# Count samples
wc -l training/data/soc_train.jsonl

# Validate JSON format
python -c "
import json
with open('training/data/soc_train.jsonl') as f:
for i, line in enumerate(f):
try:
d = json.loads(line)
assert 'messages' in d, f'Line {i}: missing messages'
assert len(d['messages']) >= 2, f'Line {i}: need at least 2 messages'
except Exception as e:
print(f'Error on line {i}: {e}')
break
else:
print(f'All {i+1} samples valid')
"

# Check domain distribution
python -c "
import json
from collections import Counter
domains = Counter()
with open('training/data/soc_train.jsonl') as f:
for line in f:
d = json.loads(line)
domains[d.get('domain', 'unknown')] += 1
for domain, count in sorted(domains.items(), key=lambda x: -x[1]):
print(f' {domain}: {count}')
"

Next Steps

With your training data prepared:

  1. Local GPU Training — Train on your own hardware
  2. Docker Training — Reproducible containerized training
  3. Google Colab Training — Free cloud GPU