Skip to main content

Training Pipeline — Complete Guide for Beginners

This guide is written for someone who has never trained a machine learning model before. If you follow it from top to bottom, you will go from raw security data to a fine-tuned Granite model that AuroraSOC can serve.

Why AuroraSOC Fine-Tunes Its Own Models

A general model can know many security facts, but SOC work requires something stricter: operational reasoning under pressure. "Knowing about" security means recognizing terms like CVE, MITRE ATT&CK, or ransomware. "Thinking like" a SOC analyst means rapidly connecting those facts to affected assets, CVSS severity, exploitability, blast radius, containment steps, and remediation priority in one coherent response. AuroraSOC fine-tunes for that behavior.

Training from scratch is not realistic for most teams. Full pre-training requires massive infrastructure, typically thousands of GPUs and multi-million-dollar budgets. AuroraSOC instead uses LoRA fine-tuning on an 8B model, which can run on one strong consumer GPU in hours, not months.

Key Concepts (Plain Language First)

LoRA (Low-Rank Adaptation)

LoRA does not retrain all model parameters. It keeps original weights frozen and trains lightweight adapter layers that encode domain specialization. In practice, this is often around 0.1% of full model size.

In AuroraSOC config:

  • r=64 is the adapter rank (capacity).
  • alpha=64 is scaling for adapter updates.

Higher rank means more representational power, but more VRAM and compute cost.

Unsloth

Unsloth is an open-source training stack that optimizes kernels and memory behavior for efficient fine-tuning. A practical rule of thumb is around 2x faster training and much lower VRAM use than a naive setup. That difference is why an RTX 3090 can be practical for Granite-domain LoRA where older workflows might demand much larger accelerators.

Response-only masking

During instruction tuning, each sample has prompt and answer text. Response-only masking trains on the assistant answer tokens, not the user prompt tokens. That focuses learning signal on what the model must generate in production.

Think of it as grading only the student's answer, not the exam question.

Hardware Requirements

Local hardware

ComponentMinimumRecommended
GPURTX 3090 (24 GB VRAM)A100 (80 GB)
System RAM32 GB64 GB
Storage100 GB SSD500 GB NVMe
CUDA12.1+12.4+

Cloud options

  • RunPod: RTX 3090 around $0.39/hour. Specialist training in roughly 3-4 hours is about $1.50 per run.
  • Google Colab Pro: A100 runtime is preferred. Free T4 (16 GB) is borderline for 8B and often requires aggressive memory settings.
  • Notebook path for guided runs: training/notebooks/AuroraSOC_Granite4_Finetune.ipynb.

Step-by-Step Training Walkthrough

Each step includes what it does, command(s), and realistic expected output.

Step 0: Verify your GPU

What this does: Confirms your machine can see a CUDA-capable device. If this step fails, training will fail later.

Commands:

nvidia-smi
python3 -c "import torch; print(torch.cuda.get_device_name(0))"

Expected output:

  • nvidia-smi shows your GPU and driver.
  • Python prints a GPU name such as NVIDIA GeForce RTX 3090.

Step 1: Install training dependencies

What this does: Installs the libraries used by data preparation, fine-tuning, evaluation, and export.

Command:

pip install -e ".[training]"

Expected output:

  • Dependencies install without hard errors.
  • Key package families include Unsloth, Transformers, Datasets, PEFT, TRL, and bitsandbytes.

Step 2: Prepare datasets

What this does: Builds instruction-following training data from public security corpora and synthetic SOC scenarios. The prep script fetches/uses sources such as MITRE ATT&CK techniques, Sigma detection rules, and NVD vulnerability records, then converts them into chat-style instruction/response examples.

Commands:

python training/scripts/prepare_datasets.py --output-dir training/data

Expected output:

  • Logs for each dataset source pipeline.
  • New files under training/data/ including soc_train.jsonl, soc_eval.jsonl, and domain/*.jsonl.
  • Summary line showing total sample counts.

Step 3: Fine-tune the specialist model

What this does: Loads Granite 4 base weights, applies LoRA adapters, trains on SOC-formatted examples, saves checkpoints, and exports merged FP16 weights for serving.

Command:

python scripts/finetune_granite.py --model-type specialist

Expected output:

  • Training progress logs with decreasing loss.
  • Typical trend: loss can start around ~2.x and move below ~1.0 with sufficient data/epochs.
  • Wall time: often ~2-4 hours on RTX 3090 class hardware.
  • Export path for specialist: training/output/granite-soc-specialist/.

Step 4: Fine-tune the orchestrator model

What this does: Runs a second fine-tuning pass for orchestrator behavior, so coordination and delegation logic has a dedicated model artifact.

Command:

python scripts/finetune_granite.py --model-type orchestrator

Expected output:

  • Similar training logs and checkpoint saves.
  • Export path for orchestrator: training/output/granite-soc-orchestrator/.

Step 5: Evaluate

What this does: Runs benchmark prompts across security domains and scores response quality using expected keyword coverage plus timing metrics.

Command:

python scripts/evaluate_model.py --model vllm:granite-soc-specialist --vllm-base-url http://localhost:8000/v1

Expected output:

  • Per-benchmark PASS/FAIL lines.
  • Summary with pass rate and average latency.
  • Results file (for example training/eval_results.json) written to disk.

How to read scores:

  • Higher pass rate means broader benchmark coverage.
  • Keyword hit rate measures whether core domain concepts were present.
  • Latency metrics help determine serving readiness under operational constraints.

Step 6: Deploy

What this does: Builds both model variants and starts serving backend for runtime use.

Commands:

make train-all
make vllm-up
make vllm-status

Expected output:

  • make train-all runs specialist then orchestrator training targets.
  • make vllm-up starts the vLLM container.
  • make vllm-status returns health JSON or a reachable-status message.

Output Files and What They Mean

training/
├── output/
│ ├── granite-soc-specialist/ ← Merged FP16 weights loaded by vLLM
│ │ ├── config.json ← Model architecture config
│ │ ├── tokenizer.json ← Text-to-token mapping
│ │ ├── tokenizer_config.json
│ │ └── model-*.safetensors ← Actual weight tensors
│ └── granite-soc-orchestrator/ ← Same structure, orchestrator weights
├── checkpoints/ ← LoRA adapter checkpoints (resumable)
│ └── checkpoint-*/
└── data/ ← Prepared training datasets

Troubleshooting (Symptom → Cause → Exact Fix)

torch.cuda.OutOfMemoryError during training

  • Cause: effective batch/sequence footprint exceeds GPU VRAM.
  • Fix: reduce batch size, increase gradient accumulation, lower max sequence length, and close other GPU workloads.

HuggingFace 401 Unauthorized

  • Cause: missing/invalid authentication token when pulling gated assets.
  • Fix: export HF_TOKEN with valid scope, then retry model/data pull.

ModuleNotFoundError: No module named 'unsloth'

  • Cause: training dependencies not installed in the active environment.
  • Fix: install training dependencies in the same Python environment used to run scripts.

Training loss is NaN from step 1

  • Cause: unstable optimizer state, bad mixed-precision configuration, or malformed data sample.
  • Fix: lower learning rate, verify bf16/fp16 compatibility, validate dataset JSONL formatting, and restart from clean checkpoint.

Training completes but vLLM cannot load model

  • Cause: incomplete export artifacts or mismatched model directory.
  • Fix: ensure merged FP16 export exists in training/output/granite-soc-specialist, then point vLLM --model to that directory.

Export validation FAILED in finetune_granite.py

  • Cause: exported weights/tokenizer cannot be reloaded for a sanity forward pass.
  • Fix: rerun export, verify disk space, confirm required files exist (config.json, tokenizer files, safetensors), and retry validation.