Skip to main content

Fine-Tuning Methods: A Complete Guide

This guide explains every fine-tuning technique used in modern LLM training — from QLoRA (what AuroraSOC uses by default) to full fine-tuning, DPO, and ORPO. By the end, you'll understand exactly what happens inside the GPU when you run make train, why we chose QLoRA, and when you might want a different method.

What Is Fine-Tuning?

Fine-tuning takes a pre-trained language model (one that already understands language) and teaches it new behavior specific to your task. Think of it like this:

Pre-trained Model (Granite 4, Qwen 3, Gemma 4)
↓ knows general language, coding, reasoning

Fine-Tuning with SOC Data
↓ learns alert triage, MITRE ATT&CK, incident response

Specialized SOC Model
→ can classify alerts, write YARA rules, triage incidents

The pre-trained model has billions of parameters (numbers) that encode its knowledge. Fine-tuning adjusts some or all of these parameters using your domain-specific data.

The Fine-Tuning Landscape

AuroraSOC Default

AuroraSOC uses QLoRA + SFT — the green-highlighted path above. This gives the best quality-per-dollar for security-domain training.


1. Full Fine-Tuning

How It Works

Full fine-tuning updates every single parameter in the model. For a 8B parameter model like Granite 4 H-Small, that means modifying all 8 billion numbers.

VRAM Requirements

Full fine-tuning needs to store:

  • The model weights (FP16): 2 bytes × parameters
  • Gradients: 2 bytes × parameters
  • Optimizer states (AdamW): 8 bytes × parameters
  • Activations: varies with batch size
ModelParametersModel (FP16)GradientsOptimizerTotal VRAM
Granite 4 Micro~1B2 GB2 GB8 GB~14 GB
Granite 4 H-Tiny~2B4 GB4 GB16 GB~28 GB
Granite 4 H-Small~8B16 GB16 GB64 GB~100 GB
Qwen 3 8B~8B16 GB16 GB64 GB~100 GB
Gemma 4 12B~12B24 GB24 GB96 GB~150 GB

When to Use Full Fine-Tuning

SituationRecommendation
You have 80+ GB VRAM (A100 80GB, H100)Consider it for maximum quality
You need maximum task accuracyPotential ~1-3% accuracy gain over QLoRA
You're training a small model (<1B)VRAM is manageable
You're deploying to production at scaleMinor quality gains compound

When NOT to Use

  • You have a consumer GPU (RTX 3090/4090 with 24 GB) — won't fit
  • You're iterating quickly — training takes 5-10× longer than QLoRA
  • You have limited training data (<10K samples) — risk of catastrophic forgetting

Configuration (AuroraSOC)

Full fine-tuning is not the default in AuroraSOC, but you can enable it by modifying the training config:

# training/configs/granite_soc_finetune.yaml
model:
name: "ibm-granite/granite-4.0-h-tiny" # Note: NOT the unsloth/ variant
max_seq_length: 4096
load_in_4bit: false # ← Disable quantization

lora:
# Not used — comment out or remove the lora section entirely

training:
per_device_train_batch_size: 1 # Reduce due to high VRAM
gradient_accumulation_steps: 8
bf16: true
optim: "adamw_torch" # Full precision optimizer
warning

Full fine-tuning of Granite 4 H-Small (8B) requires ~100 GB VRAM. Use an A100 80GB with gradient checkpointing, or multiple GPUs with DeepSpeed.


2. LoRA (Low-Rank Adaptation)

The Key Insight

Instead of updating all 8 billion parameters, LoRA adds small trainable matrices alongside the existing weights. The original model stays frozen.

How LoRA Works (Visually)

The math: Instead of learning a new W' = W + ΔW (where ΔW has d×d parameters), LoRA decomposes ΔW into two small matrices: ΔW = B × A, where A is d×r and B is r×d. With rank r=64 and d=4096:

  • Original ΔW: 4096 × 4096 = 16.7M parameters
  • LoRA (A + B): (4096 × 64) + (64 × 4096) = 524K parameters (32× fewer!)

LoRA Rank (r) — The Most Important Hyperparameter

The rank r controls how much the model can learn:

Rank (r)Trainable ParamsTraining SpeedQuality for SOC TasksVRAM Overhead
8~2.5MFastestInsufficient for complex security reasoning+0.5 GB
16~5MVery fastGood for simple classification tasks+1 GB
32~10MFastGood for most single-domain tasks+2 GB
64~20MModerateBest for multi-domain SOC knowledge+3 GB
128~40MSlowMarginal gains; used for orchestrator+6 GB
256~80MVery slowDiminishing returns; risk of overfitting+12 GB
Why AuroraSOC Uses r=64

SOC tasks require the model to learn diverse knowledge — MITRE ATT&CK, network protocols, malware analysis, compliance frameworks. Rank 64 provides enough capacity to absorb this breadth without overfitting. The orchestrator uses r=128 because it must understand ALL domains to route tasks correctly.

VRAM Requirements (LoRA, FP16 Base)

ModelBase Model (FP16)LoRA Adapters (r=64)Optimizer StatesTotal VRAM
Granite 4 H-Tiny (2B)4 GB0.04 GB0.3 GB~8 GB
Granite 4 H-Small (8B)16 GB0.08 GB0.6 GB~24 GB
Qwen 3 8B16 GB0.08 GB0.6 GB~24 GB

Note: The base model must be loaded in FP16 for standard LoRA, which is why the VRAM is still high. This is where QLoRA helps.


3. QLoRA (Quantized LoRA) ⭐

The Breakthrough

QLoRA combines 4-bit quantization of the base model with LoRA adapters. The base model takes 4× less VRAM, while LoRA adapters train in FP16/BF16 for full precision gradients.

QLoRA vs LoRA vs Full Fine-Tuning (VRAM Comparison)

This is the key comparison that shows why QLoRA is the default:

ModelFull Fine-TuningLoRA (FP16 base)QLoRA (4-bit base)Quality Loss vs Full
Granite 4 Micro (1B)~14 GB~8 GB~4 GB<0.5%
Granite 4 H-Tiny (2B)~28 GB~12 GB~6 GB<0.5%
Granite 4 H-Small (8B)~100 GB~24 GB~12 GB<1%
Qwen 3 8B~100 GB~24 GB~12 GB<1%
Gemma 4 12B~150 GB~36 GB~16 GB<1%

NF4 Quantization (What Makes QLoRA Special)

QLoRA uses NormalFloat4 (NF4) quantization, which is specifically designed for normally-distributed neural network weights:

Quantization TypeBitsHow It WorksQuality
FP32 (Full)32Standard floating pointPerfect
FP16 / BF1616Half precisionNear-perfect
INT88Linear integer quantizationGood
NF44Quantile-mapped to normal distributionSurprisingly good
INT44Linear integer quantizationLossy

NF4 works because neural network weights follow a normal distribution. NF4 maps the 16 available 4-bit values to the quantiles of the normal distribution, minimizing information loss where it matters most (near zero, where most weights cluster).

Double Quantization

QLoRA also uses double quantization — the quantization constants themselves are quantized:

Step 1: Quantize 8B FP16 weights → 4-bit NF4 + FP32 quantization constants
Step 2: Quantize the FP32 constants → FP8 constants (saves ~0.4 GB per billion params)

This saves an additional ~0.4 GB for an 8B model — small but meaningful on consumer GPUs.

AuroraSOC QLoRA Configuration (Explained)

# training/configs/granite_soc_finetune.yaml

model:
name: "unsloth/granite-4.0-h-tiny" # Unsloth-optimized variant (2× faster)
max_seq_length: 4096 # 4K context (sweet spot for training)
load_in_4bit: true # ← THIS enables QLoRA's NF4 quantization

lora:
r: 64 # Rank 64 — 20M trainable params
lora_alpha: 64 # Scaling = alpha/r = 1.0
lora_dropout: 0 # Unsloth recommendation (faster)
bias: "none" # Don't train biases (saves VRAM)
target_modules: # Which layers get LoRA adapters:
- "q_proj" # Query projection (attention)
- "k_proj" # Key projection (attention)
- "v_proj" # Value projection (attention)
- "o_proj" # Output projection (attention)
- "gate_proj" # FFN gate (feed-forward)
- "up_proj" # FFN up-projection
- "down_proj" # FFN down-projection
- "shared_mlp.input_linear" # Mamba SSM input (Granite 4 Hybrid)
- "shared_mlp.output_linear" # Mamba SSM output (Granite 4 Hybrid)
use_gradient_checkpointing: "unsloth" # 2× less VRAM than PyTorch native

Unsloth Optimizations (Why We Use Unsloth)

Unsloth provides hand-optimized CUDA kernels that make QLoRA training 2× faster and use 60% less VRAM compared to standard HuggingFace + PEFT:

MetricHuggingFace + PEFTUnslothImprovement
Training speedBaseline2× fasterCustom Triton kernels for attention and LoRA
VRAM usageBaseline60% lessFused operations, optimized gradient checkpointing
Max sequence lengthLimited by VRAM4× longerMemory-efficient attention implementation
tip

AuroraSOC models are prefixed with unsloth/ (e.g., unsloth/granite-4.0-h-tiny) which loads Unsloth's optimized model class automatically. If you use a non-Unsloth model ID, training will still work but will be slower and use more VRAM.


4. DoRA (Weight-Decomposed Low-Rank Adaptation)

How DoRA Differs from LoRA

DoRA decomposes weight updates into magnitude and direction components, mimicking how full fine-tuning updates weights:

DoRA vs LoRA Results

BenchmarkLoRA (r=64)DoRA (r=64)Improvement
Alert Triage Accuracy82%84%+2%
Threat Hunting Quality78%80%+2%
Average across SOC tasks75.6%77.8%+2.2%
VRAM overhead vs LoRABaseline+5%Minor
Training time vs LoRABaseline+8%Minor

When to Use DoRA

  • When you need that extra 1-3% quality and can tolerate slightly longer training
  • For the orchestrator agent which needs the highest reasoning quality
  • When you've already optimized LoRA rank and hyperparameters

Configuration

# Add to your training config
lora:
r: 64
use_dora: true # ← Enable DoRA
# ... rest of LoRA config unchanged
note

DoRA is available in Unsloth ≥0.15 and PEFT ≥0.13. Not all base models support DoRA — verify compatibility with your specific model.


5. SFT (Supervised Fine-Tuning)

What SFT Is

SFT is the training objective (how loss is computed), not a parameter method. It's the most common approach and what AuroraSOC uses:

Response Masking (Critical for Quality)

The train_on_completions: true setting in AuroraSOC's config enables response masking — the model only learns from the assistant's response, not from the system prompt or user question:

<|start_of_role|>system<|end_of_role|>
You are the AuroraSOC Security Analyst... ← MASKED (loss = 0)
<|end_of_text|>
<|start_of_role|>user<|end_of_role|>
Analyze this alert: ET TROJAN Cobalt Strike... ← MASKED (loss = 0)
<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>
## Alert Analysis ← TRAINED (loss computed)
**Severity:** Critical ← TRAINED
**Classification:** Command & Control (C2) ← TRAINED
... ← TRAINED
<|end_of_text|>

Without response masking, the model wastes capacity memorizing system prompts and user questions instead of learning high-quality responses.


6. DPO (Direct Preference Optimization)

The Idea

DPO trains the model using pairs of responses — one preferred (chosen) and one rejected. The model learns to increase the probability of chosen responses and decrease rejected ones.

DPO vs SFT

AspectSFTDPO
Training data(prompt, response) pairs(prompt, chosen, rejected) triples
What it learns"Generate responses like these""Prefer this style over that style"
Data creation effortModerateHigh (need paired comparisons)
Quality improvementGood baselineBetter alignment with desired behavior
When to useFirst training stageRefinement after SFT

DPO Data Format

{
"prompt": "A Suricata alert fires: ET MALWARE Win32/Emotet CnC Activity. Source: 10.0.1.25, Dest: 185.94.111.1. What should the SOC analyst do?",
"chosen": "## Immediate Actions\n\n1. **Escalate to Tier 2** — Emotet is a high-severity banking trojan with worm capabilities.\n2. **Isolate host 10.0.1.25** from the network immediately.\n3. **Block 185.94.111.1** at the perimeter firewall.\n\n## Investigation\n\n- Check EDR for process tree on 10.0.1.25\n- Query SIEM for other connections to 185.94.111.1\n- Map to MITRE ATT&CK: T1071.001 (Application Layer Protocol)\n\n## IOCs\n- `185.94.111.1` — Known Emotet C2\n- Check for lateral movement indicators (T1021)",
"rejected": "You should investigate the alert and check if it's a true positive. Look at the source and destination IPs and decide if action is needed."
}

Running DPO Training (After SFT)

# Step 1: Complete SFT training first
make train

# Step 2: Run DPO on the SFT-trained model
python training/scripts/finetune_granite.py \
--config training/configs/granite_soc_dpo.yaml \
--base-model training/output/generic/ # Use SFT output as base

When to Use DPO for AuroraSOC

Use CaseRecommendation
First training runUse SFT (default)
Model gives correct but poorly formatted responsesDPO can teach preferred formatting
Model sometimes gives vague vs. specific answersDPO can reinforce specificity
You have SOC analyst feedback on good/bad responsesGreat DPO training signal
Limited training data (<5K samples)Stick with SFT

7. ORPO (Odds Ratio Preference Optimization)

ORPO = SFT + DPO in One Step

ORPO combines supervised fine-tuning and preference optimization into a single training stage, eliminating the need for a separate reference model:

ORPO vs DPO

AspectDPOORPO
Training stages2 (SFT → DPO)1 (combined)
Reference model neededYes (frozen SFT model)No
VRAMHigher (two models in memory)Lower (single model)
Training timeLonger (two stages)Shorter (~40% less total time)
QualityExcellentComparable

ORPO Configuration

# training/configs/granite_soc_orpo.yaml
training:
method: "orpo"
orpo_beta: 0.1 # Controls preference strength
per_device_train_batch_size: 2
# ... standard hyperparams

When to Use ORPO

  • You have preference data (chosen + rejected pairs) AND want to do task learning in one step
  • You're resource-constrained and can't afford two training passes
  • Your preference data is high quality (noisy preferences → DPO handles noise better)

Method Comparison Summary

At a Glance

MethodVRAM (8B model)Training TimeData RequiredQualityComplexity
Full Fine-Tuning~100 GBBaselineSFT pairsBest possibleSimple concept, hard resources
LoRA (FP16)~24 GB0.5× baselineSFT pairs98-99% of fullModerate
QLoRA~12 GB0.3× baselineSFT pairs97-99% of fullModerate
DoRA~13 GB0.35× baselineSFT pairs98-99% of fullModerate
SFT + DPO~12 GB (each)2× QLoRASFT pairs + preferencesExcellent alignmentHigh (2 stages)
ORPO~12 GB1.2× QLoRACombined pairs + prefsExcellent alignmentModerate

Decision Tree

Cost Comparison (Training All 9 AuroraSOC Agents)

MethodGPU RequiredTime (9 agents)Cloud CostQuality
QLoRA + SFTRTX 3090 (24 GB)~3-5 hours$3-5 (RunPod)Excellent
LoRA + SFTA100 40GB~2-3 hours$6-9 (RunPod)Excellent+
Full FT + SFTA100 80GB~8-12 hours$25-35 (RunPod)Best
QLoRA + SFT + DPORTX 3090 (24 GB)~6-10 hours$6-10 (RunPod)Superior alignment
QLoRA + ORPORTX 3090 (24 GB)~4-7 hours$4-7 (RunPod)Superior alignment

Advanced: Combining Methods

For maximum quality with reasonable resources, combine methods in stages:

Running the Full Pipeline

# Stage 1: Generic model
make train

# Stage 2: All specialists
python training/scripts/train_all_agents.py

# Stage 3 (optional): DPO refinement
python training/scripts/finetune_granite.py \
--config training/configs/granite_soc_dpo.yaml \
--base-model training/output/generic/

# Stage 4: Evaluate + export
make train-eval
make train-serve-ollama

Next Steps