Fine-Tuning Methods: A Complete Guide

This guide explains every fine-tuning technique used in modern LLM training — from QLoRA (what AuroraSOC uses by default) to full fine-tuning, DPO, and ORPO. By the end, you'll understand exactly what happens inside the GPU when you run make train, why we chose QLoRA, and when you might want a different method.

What Is Fine-Tuning?

Fine-tuning takes a pre-trained language model (one that already understands language) and teaches it new behavior specific to your task. Think of it like this:

Pre-trained Model (Granite 4, Qwen 3, Gemma 4)
    ↓ knows general language, coding, reasoning
    ↓
Fine-Tuning with SOC Data
    ↓ learns alert triage, MITRE ATT&CK, incident response
    ↓
Specialized SOC Model
    → can classify alerts, write YARA rules, triage incidents

The pre-trained model has billions of parameters (numbers) that encode its knowledge. Fine-tuning adjusts some or all of these parameters using your domain-specific data.

The Fine-Tuning Landscape

AuroraSOC Default

AuroraSOC uses QLoRA + SFT — the green-highlighted path above. This gives the best quality-per-dollar for security-domain training.

1. Full Fine-Tuning

How It Works

Full fine-tuning updates every single parameter in the model. For a 8B parameter model like Granite 4 H-Small, that means modifying all 8 billion numbers.

VRAM Requirements

Full fine-tuning needs to store:

The model weights (FP16): 2 bytes × parameters
Gradients: 2 bytes × parameters
Optimizer states (AdamW): 8 bytes × parameters
Activations: varies with batch size

Model	Parameters	Model (FP16)	Gradients	Optimizer	Total VRAM
Granite 4 Micro	~1B	2 GB	2 GB	8 GB	~14 GB
Granite 4 H-Tiny	~2B	4 GB	4 GB	16 GB	~28 GB
Granite 4 H-Small	~8B	16 GB	16 GB	64 GB	~100 GB
Qwen 3 8B	~8B	16 GB	16 GB	64 GB	~100 GB
Gemma 4 12B	~12B	24 GB	24 GB	96 GB	~150 GB

When to Use Full Fine-Tuning

Situation	Recommendation
You have 80+ GB VRAM (A100 80GB, H100)	Consider it for maximum quality
You need maximum task accuracy	Potential ~1-3% accuracy gain over QLoRA
You're training a small model (<1B)	VRAM is manageable
You're deploying to production at scale	Minor quality gains compound

When NOT to Use

You have a consumer GPU (RTX 3090/4090 with 24 GB) — won't fit
You're iterating quickly — training takes 5-10× longer than QLoRA
You have limited training data (<10K samples) — risk of catastrophic forgetting

Configuration (AuroraSOC)

Full fine-tuning is not the default in AuroraSOC, but you can enable it by modifying the training config:

# training/configs/granite_soc_finetune.yaml
model:
  name: "ibm-granite/granite-4.0-h-tiny"  # Note: NOT the unsloth/ variant
  max_seq_length: 4096
  load_in_4bit: false  # ← Disable quantization

lora:
  # Not used — comment out or remove the lora section entirely

training:
  per_device_train_batch_size: 1  # Reduce due to high VRAM
  gradient_accumulation_steps: 8
  bf16: true
  optim: "adamw_torch"  # Full precision optimizer

warning

Full fine-tuning of Granite 4 H-Small (8B) requires ~100 GB VRAM. Use an A100 80GB with gradient checkpointing, or multiple GPUs with DeepSpeed.

2. LoRA (Low-Rank Adaptation)

The Key Insight

Instead of updating all 8 billion parameters, LoRA adds small trainable matrices alongside the existing weights. The original model stays frozen.

How LoRA Works (Visually)

The math: Instead of learning a new W' = W + ΔW (where ΔW has d×d parameters), LoRA decomposes ΔW into two small matrices: ΔW = B × A, where A is d×r and B is r×d. With rank r=64 and d=4096:

Original ΔW: 4096 × 4096 = 16.7M parameters
LoRA (A + B): (4096 × 64) + (64 × 4096) = 524K parameters (32× fewer!)

LoRA Rank (r) — The Most Important Hyperparameter

The rank r controls how much the model can learn:

Rank (r)	Trainable Params	Training Speed	Quality for SOC Tasks	VRAM Overhead
8	~2.5M	Fastest	Insufficient for complex security reasoning	+0.5 GB
16	~5M	Very fast	Good for simple classification tasks	+1 GB
32	~10M	Fast	Good for most single-domain tasks	+2 GB
64	~20M	Moderate	Best for multi-domain SOC knowledge	+3 GB
128	~40M	Slow	Marginal gains; used for orchestrator	+6 GB
256	~80M	Very slow	Diminishing returns; risk of overfitting	+12 GB

Why AuroraSOC Uses r=64

SOC tasks require the model to learn diverse knowledge — MITRE ATT&CK, network protocols, malware analysis, compliance frameworks. Rank 64 provides enough capacity to absorb this breadth without overfitting. The orchestrator uses r=128 because it must understand ALL domains to route tasks correctly.

VRAM Requirements (LoRA, FP16 Base)

Model	Base Model (FP16)	LoRA Adapters (r=64)	Optimizer States	Total VRAM
Granite 4 H-Tiny (2B)	4 GB	0.04 GB	0.3 GB	~8 GB
Granite 4 H-Small (8B)	16 GB	0.08 GB	0.6 GB	~24 GB
Qwen 3 8B	16 GB	0.08 GB	0.6 GB	~24 GB

Note: The base model must be loaded in FP16 for standard LoRA, which is why the VRAM is still high. This is where QLoRA helps.

3. QLoRA (Quantized LoRA) ⭐

The Breakthrough

QLoRA combines 4-bit quantization of the base model with LoRA adapters. The base model takes 4× less VRAM, while LoRA adapters train in FP16/BF16 for full precision gradients.

QLoRA vs LoRA vs Full Fine-Tuning (VRAM Comparison)

This is the key comparison that shows why QLoRA is the default:

Model	Full Fine-Tuning	LoRA (FP16 base)	QLoRA (4-bit base)	Quality Loss vs Full
Granite 4 Micro (1B)	~14 GB	~8 GB	~4 GB	<0.5%
Granite 4 H-Tiny (2B)	~28 GB	~12 GB	~6 GB	<0.5%
Granite 4 H-Small (8B)	~100 GB	~24 GB	~12 GB	<1%
Qwen 3 8B	~100 GB	~24 GB	~12 GB	<1%
Gemma 4 12B	~150 GB	~36 GB	~16 GB	<1%

NF4 Quantization (What Makes QLoRA Special)

QLoRA uses NormalFloat4 (NF4) quantization, which is specifically designed for normally-distributed neural network weights:

Quantization Type	Bits	How It Works	Quality
FP32 (Full)	32	Standard floating point	Perfect
FP16 / BF16	16	Half precision	Near-perfect
INT8	8	Linear integer quantization	Good
NF4	4	Quantile-mapped to normal distribution	Surprisingly good
INT4	4	Linear integer quantization	Lossy

NF4 works because neural network weights follow a normal distribution. NF4 maps the 16 available 4-bit values to the quantiles of the normal distribution, minimizing information loss where it matters most (near zero, where most weights cluster).

Double Quantization

QLoRA also uses double quantization — the quantization constants themselves are quantized:

Step 1: Quantize 8B FP16 weights → 4-bit NF4 + FP32 quantization constants
Step 2: Quantize the FP32 constants → FP8 constants (saves ~0.4 GB per billion params)

This saves an additional ~0.4 GB for an 8B model — small but meaningful on consumer GPUs.

AuroraSOC QLoRA Configuration (Explained)

# training/configs/granite_soc_finetune.yaml

model:
  name: "unsloth/granite-4.0-h-tiny"   # Unsloth-optimized variant (2× faster)
  max_seq_length: 4096                   # 4K context (sweet spot for training)
  load_in_4bit: true                     # ← THIS enables QLoRA's NF4 quantization

lora:
  r: 64                                  # Rank 64 — 20M trainable params
  lora_alpha: 64                         # Scaling = alpha/r = 1.0
  lora_dropout: 0                        # Unsloth recommendation (faster)
  bias: "none"                           # Don't train biases (saves VRAM)
  target_modules:                        # Which layers get LoRA adapters:
    - "q_proj"                           #   Query projection (attention)
    - "k_proj"                           #   Key projection (attention)
    - "v_proj"                           #   Value projection (attention)
    - "o_proj"                           #   Output projection (attention)
    - "gate_proj"                        #   FFN gate (feed-forward)
    - "up_proj"                          #   FFN up-projection
    - "down_proj"                        #   FFN down-projection
    - "shared_mlp.input_linear"          #   Mamba SSM input (Granite 4 Hybrid)
    - "shared_mlp.output_linear"         #   Mamba SSM output (Granite 4 Hybrid)
  use_gradient_checkpointing: "unsloth"  # 2× less VRAM than PyTorch native

Unsloth Optimizations (Why We Use Unsloth)

Unsloth provides hand-optimized CUDA kernels that make QLoRA training 2× faster and use 60% less VRAM compared to standard HuggingFace + PEFT:

Metric	HuggingFace + PEFT	Unsloth	Improvement
Training speed	Baseline	2× faster	Custom Triton kernels for attention and LoRA
VRAM usage	Baseline	60% less	Fused operations, optimized gradient checkpointing
Max sequence length	Limited by VRAM	4× longer	Memory-efficient attention implementation

tip

AuroraSOC models are prefixed with unsloth/ (e.g., unsloth/granite-4.0-h-tiny) which loads Unsloth's optimized model class automatically. If you use a non-Unsloth model ID, training will still work but will be slower and use more VRAM.

4. DoRA (Weight-Decomposed Low-Rank Adaptation)

How DoRA Differs from LoRA

DoRA decomposes weight updates into magnitude and direction components, mimicking how full fine-tuning updates weights:

DoRA vs LoRA Results

Benchmark	LoRA (r=64)	DoRA (r=64)	Improvement
Alert Triage Accuracy	82%	84%	+2%
Threat Hunting Quality	78%	80%	+2%
Average across SOC tasks	75.6%	77.8%	+2.2%
VRAM overhead vs LoRA	Baseline	+5%	Minor
Training time vs LoRA	Baseline	+8%	Minor

When to Use DoRA

When you need that extra 1-3% quality and can tolerate slightly longer training
For the orchestrator agent which needs the highest reasoning quality
When you've already optimized LoRA rank and hyperparameters

Configuration

# Add to your training config
lora:
  r: 64
  use_dora: true          # ← Enable DoRA
  # ... rest of LoRA config unchanged

note

DoRA is available in Unsloth ≥0.15 and PEFT ≥0.13. Not all base models support DoRA — verify compatibility with your specific model.

5. SFT (Supervised Fine-Tuning)

What SFT Is

SFT is the training objective (how loss is computed), not a parameter method. It's the most common approach and what AuroraSOC uses:

Response Masking (Critical for Quality)

The train_on_completions: true setting in AuroraSOC's config enables response masking — the model only learns from the assistant's response, not from the system prompt or user question:

<|start_of_role|>system<|end_of_role|>
You are the AuroraSOC Security Analyst...     ← MASKED (loss = 0)
<|end_of_text|>
<|start_of_role|>user<|end_of_role|>
Analyze this alert: ET TROJAN Cobalt Strike... ← MASKED (loss = 0)
<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>
## Alert Analysis                              ← TRAINED (loss computed)
**Severity:** Critical                         ← TRAINED
**Classification:** Command & Control (C2)     ← TRAINED
...                                            ← TRAINED
<|end_of_text|>

Without response masking, the model wastes capacity memorizing system prompts and user questions instead of learning high-quality responses.

6. DPO (Direct Preference Optimization)

The Idea

DPO trains the model using pairs of responses — one preferred (chosen) and one rejected. The model learns to increase the probability of chosen responses and decrease rejected ones.

DPO vs SFT

Aspect	SFT	DPO
Training data	(prompt, response) pairs	(prompt, chosen, rejected) triples
What it learns	"Generate responses like these"	"Prefer this style over that style"
Data creation effort	Moderate	High (need paired comparisons)
Quality improvement	Good baseline	Better alignment with desired behavior
When to use	First training stage	Refinement after SFT

DPO Data Format

{
  "prompt": "A Suricata alert fires: ET MALWARE Win32/Emotet CnC Activity. Source: 10.0.1.25, Dest: 185.94.111.1. What should the SOC analyst do?",
  "chosen": "## Immediate Actions\n\n1. **Escalate to Tier 2** — Emotet is a high-severity banking trojan with worm capabilities.\n2. **Isolate host 10.0.1.25** from the network immediately.\n3. **Block 185.94.111.1** at the perimeter firewall.\n\n## Investigation\n\n- Check EDR for process tree on 10.0.1.25\n- Query SIEM for other connections to 185.94.111.1\n- Map to MITRE ATT&CK: T1071.001 (Application Layer Protocol)\n\n## IOCs\n- `185.94.111.1` — Known Emotet C2\n- Check for lateral movement indicators (T1021)",
  "rejected": "You should investigate the alert and check if it's a true positive. Look at the source and destination IPs and decide if action is needed."
}

Running DPO Training (After SFT)

# Step 1: Complete SFT training first
make train

# Step 2: Run DPO on the SFT-trained model
python training/scripts/finetune_granite.py \
  --config training/configs/granite_soc_dpo.yaml \
  --base-model training/output/generic/  # Use SFT output as base

When to Use DPO for AuroraSOC

Use Case	Recommendation
First training run	Use SFT (default)
Model gives correct but poorly formatted responses	DPO can teach preferred formatting
Model sometimes gives vague vs. specific answers	DPO can reinforce specificity
You have SOC analyst feedback on good/bad responses	Great DPO training signal
Limited training data (<5K samples)	Stick with SFT

7. ORPO (Odds Ratio Preference Optimization)

ORPO = SFT + DPO in One Step

ORPO combines supervised fine-tuning and preference optimization into a single training stage, eliminating the need for a separate reference model:

ORPO vs DPO

Aspect	DPO	ORPO
Training stages	2 (SFT → DPO)	1 (combined)
Reference model needed	Yes (frozen SFT model)	No
VRAM	Higher (two models in memory)	Lower (single model)
Training time	Longer (two stages)	Shorter (~40% less total time)
Quality	Excellent	Comparable

ORPO Configuration

# training/configs/granite_soc_orpo.yaml
training:
  method: "orpo"
  orpo_beta: 0.1           # Controls preference strength
  per_device_train_batch_size: 2
  # ... standard hyperparams

When to Use ORPO

You have preference data (chosen + rejected pairs) AND want to do task learning in one step
You're resource-constrained and can't afford two training passes
Your preference data is high quality (noisy preferences → DPO handles noise better)

Method Comparison Summary

At a Glance

Method	VRAM (8B model)	Training Time	Data Required	Quality	Complexity
Full Fine-Tuning	~100 GB	Baseline	SFT pairs	Best possible	Simple concept, hard resources
LoRA (FP16)	~24 GB	0.5× baseline	SFT pairs	98-99% of full	Moderate
QLoRA ⭐	~12 GB	0.3× baseline	SFT pairs	97-99% of full	Moderate
DoRA	~13 GB	0.35× baseline	SFT pairs	98-99% of full	Moderate
SFT + DPO	~12 GB (each)	2× QLoRA	SFT pairs + preferences	Excellent alignment	High (2 stages)
ORPO	~12 GB	1.2× QLoRA	Combined pairs + prefs	Excellent alignment	Moderate

Decision Tree

Cost Comparison (Training All 9 AuroraSOC Agents)

Method	GPU Required	Time (9 agents)	Cloud Cost	Quality
QLoRA + SFT ⭐	RTX 3090 (24 GB)	~3-5 hours	$3-5 (RunPod)	Excellent
LoRA + SFT	A100 40GB	~2-3 hours	$6-9 (RunPod)	Excellent+
Full FT + SFT	A100 80GB	~8-12 hours	$25-35 (RunPod)	Best
QLoRA + SFT + DPO	RTX 3090 (24 GB)	~6-10 hours	$6-10 (RunPod)	Superior alignment
QLoRA + ORPO	RTX 3090 (24 GB)	~4-7 hours	$4-7 (RunPod)	Superior alignment

Advanced: Combining Methods

The Recommended Training Pipeline

For maximum quality with reasonable resources, combine methods in stages:

Running the Full Pipeline

# Stage 1: Generic model
make train

# Stage 2: All specialists
python training/scripts/train_all_agents.py

# Stage 3 (optional): DPO refinement  
python training/scripts/finetune_granite.py \
  --config training/configs/granite_soc_dpo.yaml \
  --base-model training/output/generic/

# Stage 4: Evaluate + export
make train-eval
make train-serve-ollama

Next Steps

Model Comparison Guide — compare Granite 4, Qwen 3, and Gemma 4 for SOC tasks
Agent Model Selection — which model + method for each AuroraSOC agent
Cloud Training Guide — train on RunPod, Lambda Labs, or vast.ai
Evaluation & Export — benchmark and deploy your models

What Is Fine-Tuning?​

The Fine-Tuning Landscape​

1. Full Fine-Tuning​

How It Works​

VRAM Requirements​

When to Use Full Fine-Tuning​

When NOT to Use​

Configuration (AuroraSOC)​

2. LoRA (Low-Rank Adaptation)​

The Key Insight​

How LoRA Works (Visually)​

LoRA Rank (r) — The Most Important Hyperparameter​

VRAM Requirements (LoRA, FP16 Base)​

3. QLoRA (Quantized LoRA) ⭐​

The Breakthrough​

QLoRA vs LoRA vs Full Fine-Tuning (VRAM Comparison)​

NF4 Quantization (What Makes QLoRA Special)​

Double Quantization​

AuroraSOC QLoRA Configuration (Explained)​

Unsloth Optimizations (Why We Use Unsloth)​

4. DoRA (Weight-Decomposed Low-Rank Adaptation)​

How DoRA Differs from LoRA​

DoRA vs LoRA Results​

When to Use DoRA​

Configuration​

5. SFT (Supervised Fine-Tuning)​

What SFT Is​

Response Masking (Critical for Quality)​

6. DPO (Direct Preference Optimization)​

The Idea​

DPO vs SFT​

DPO Data Format​

Running DPO Training (After SFT)​

When to Use DPO for AuroraSOC​

7. ORPO (Odds Ratio Preference Optimization)​

ORPO = SFT + DPO in One Step​

ORPO vs DPO​

ORPO Configuration​

When to Use ORPO​

Method Comparison Summary​

At a Glance​

Decision Tree​

Cost Comparison (Training All 9 AuroraSOC Agents)​

Advanced: Combining Methods​

The Recommended Training Pipeline​

Running the Full Pipeline​

Next Steps​