Fine-Tuning Methods: A Complete Guide
This guide explains every fine-tuning technique used in modern LLM training — from QLoRA (what AuroraSOC uses by default) to full fine-tuning, DPO, and ORPO. By the end, you'll understand exactly what happens inside the GPU when you run make train, why we chose QLoRA, and when you might want a different method.
What Is Fine-Tuning?
Fine-tuning takes a pre-trained language model (one that already understands language) and teaches it new behavior specific to your task. Think of it like this:
Pre-trained Model (Granite 4, Qwen 3, Gemma 4)
↓ knows general language, coding, reasoning
↓
Fine-Tuning with SOC Data
↓ learns alert triage, MITRE ATT&CK, incident response
↓
Specialized SOC Model
→ can classify alerts, write YARA rules, triage incidents
The pre-trained model has billions of parameters (numbers) that encode its knowledge. Fine-tuning adjusts some or all of these parameters using your domain-specific data.
The Fine-Tuning Landscape
AuroraSOC uses QLoRA + SFT — the green-highlighted path above. This gives the best quality-per-dollar for security-domain training.
1. Full Fine-Tuning
How It Works
Full fine-tuning updates every single parameter in the model. For a 8B parameter model like Granite 4 H-Small, that means modifying all 8 billion numbers.
VRAM Requirements
Full fine-tuning needs to store:
- The model weights (FP16): 2 bytes × parameters
- Gradients: 2 bytes × parameters
- Optimizer states (AdamW): 8 bytes × parameters
- Activations: varies with batch size
| Model | Parameters | Model (FP16) | Gradients | Optimizer | Total VRAM |
|---|---|---|---|---|---|
| Granite 4 Micro | ~1B | 2 GB | 2 GB | 8 GB | ~14 GB |
| Granite 4 H-Tiny | ~2B | 4 GB | 4 GB | 16 GB | ~28 GB |
| Granite 4 H-Small | ~8B | 16 GB | 16 GB | 64 GB | ~100 GB |
| Qwen 3 8B | ~8B | 16 GB | 16 GB | 64 GB | ~100 GB |
| Gemma 4 12B | ~12B | 24 GB | 24 GB | 96 GB | ~150 GB |
When to Use Full Fine-Tuning
| Situation | Recommendation |
|---|---|
| You have 80+ GB VRAM (A100 80GB, H100) | Consider it for maximum quality |
| You need maximum task accuracy | Potential ~1-3% accuracy gain over QLoRA |
| You're training a small model (<1B) | VRAM is manageable |
| You're deploying to production at scale | Minor quality gains compound |
When NOT to Use
- You have a consumer GPU (RTX 3090/4090 with 24 GB) — won't fit
- You're iterating quickly — training takes 5-10× longer than QLoRA
- You have limited training data (<10K samples) — risk of catastrophic forgetting
Configuration (AuroraSOC)
Full fine-tuning is not the default in AuroraSOC, but you can enable it by modifying the training config:
# training/configs/granite_soc_finetune.yaml
model:
name: "ibm-granite/granite-4.0-h-tiny" # Note: NOT the unsloth/ variant
max_seq_length: 4096
load_in_4bit: false # ← Disable quantization
lora:
# Not used — comment out or remove the lora section entirely
training:
per_device_train_batch_size: 1 # Reduce due to high VRAM
gradient_accumulation_steps: 8
bf16: true
optim: "adamw_torch" # Full precision optimizer
Full fine-tuning of Granite 4 H-Small (8B) requires ~100 GB VRAM. Use an A100 80GB with gradient checkpointing, or multiple GPUs with DeepSpeed.
2. LoRA (Low-Rank Adaptation)
The Key Insight
Instead of updating all 8 billion parameters, LoRA adds small trainable matrices alongside the existing weights. The original model stays frozen.
How LoRA Works (Visually)
The math: Instead of learning a new W' = W + ΔW (where ΔW has d×d parameters), LoRA decomposes ΔW into two small matrices: ΔW = B × A, where A is d×r and B is r×d. With rank r=64 and d=4096:
- Original ΔW: 4096 × 4096 = 16.7M parameters
- LoRA (A + B): (4096 × 64) + (64 × 4096) = 524K parameters (32× fewer!)
LoRA Rank (r) — The Most Important Hyperparameter
The rank r controls how much the model can learn:
| Rank (r) | Trainable Params | Training Speed | Quality for SOC Tasks | VRAM Overhead |
|---|---|---|---|---|
| 8 | ~2.5M | Fastest | Insufficient for complex security reasoning | +0.5 GB |
| 16 | ~5M | Very fast | Good for simple classification tasks | +1 GB |
| 32 | ~10M | Fast | Good for most single-domain tasks | +2 GB |
| 64 | ~20M | Moderate | Best for multi-domain SOC knowledge | +3 GB |
| 128 | ~40M | Slow | Marginal gains; used for orchestrator | +6 GB |
| 256 | ~80M | Very slow | Diminishing returns; risk of overfitting | +12 GB |
SOC tasks require the model to learn diverse knowledge — MITRE ATT&CK, network protocols, malware analysis, compliance frameworks. Rank 64 provides enough capacity to absorb this breadth without overfitting. The orchestrator uses r=128 because it must understand ALL domains to route tasks correctly.
VRAM Requirements (LoRA, FP16 Base)
| Model | Base Model (FP16) | LoRA Adapters (r=64) | Optimizer States | Total VRAM |
|---|---|---|---|---|
| Granite 4 H-Tiny (2B) | 4 GB | 0.04 GB | 0.3 GB | ~8 GB |
| Granite 4 H-Small (8B) | 16 GB | 0.08 GB | 0.6 GB | ~24 GB |
| Qwen 3 8B | 16 GB | 0.08 GB | 0.6 GB | ~24 GB |
Note: The base model must be loaded in FP16 for standard LoRA, which is why the VRAM is still high. This is where QLoRA helps.
3. QLoRA (Quantized LoRA) ⭐
The Breakthrough
QLoRA combines 4-bit quantization of the base model with LoRA adapters. The base model takes 4× less VRAM, while LoRA adapters train in FP16/BF16 for full precision gradients.
QLoRA vs LoRA vs Full Fine-Tuning (VRAM Comparison)
This is the key comparison that shows why QLoRA is the default:
| Model | Full Fine-Tuning | LoRA (FP16 base) | QLoRA (4-bit base) | Quality Loss vs Full |
|---|---|---|---|---|
| Granite 4 Micro (1B) | ~14 GB | ~8 GB | ~4 GB | <0.5% |
| Granite 4 H-Tiny (2B) | ~28 GB | ~12 GB | ~6 GB | <0.5% |
| Granite 4 H-Small (8B) | ~100 GB | ~24 GB | ~12 GB | <1% |
| Qwen 3 8B | ~100 GB | ~24 GB | ~12 GB | <1% |
| Gemma 4 12B | ~150 GB | ~36 GB | ~16 GB | <1% |
NF4 Quantization (What Makes QLoRA Special)
QLoRA uses NormalFloat4 (NF4) quantization, which is specifically designed for normally-distributed neural network weights:
| Quantization Type | Bits | How It Works | Quality |
|---|---|---|---|
| FP32 (Full) | 32 | Standard floating point | Perfect |
| FP16 / BF16 | 16 | Half precision | Near-perfect |
| INT8 | 8 | Linear integer quantization | Good |
| NF4 | 4 | Quantile-mapped to normal distribution | Surprisingly good |
| INT4 | 4 | Linear integer quantization | Lossy |
NF4 works because neural network weights follow a normal distribution. NF4 maps the 16 available 4-bit values to the quantiles of the normal distribution, minimizing information loss where it matters most (near zero, where most weights cluster).
Double Quantization
QLoRA also uses double quantization — the quantization constants themselves are quantized:
Step 1: Quantize 8B FP16 weights → 4-bit NF4 + FP32 quantization constants
Step 2: Quantize the FP32 constants → FP8 constants (saves ~0.4 GB per billion params)
This saves an additional ~0.4 GB for an 8B model — small but meaningful on consumer GPUs.
AuroraSOC QLoRA Configuration (Explained)
# training/configs/granite_soc_finetune.yaml
model:
name: "unsloth/granite-4.0-h-tiny" # Unsloth-optimized variant (2× faster)
max_seq_length: 4096 # 4K context (sweet spot for training)
load_in_4bit: true # ← THIS enables QLoRA's NF4 quantization
lora:
r: 64 # Rank 64 — 20M trainable params
lora_alpha: 64 # Scaling = alpha/r = 1.0
lora_dropout: 0 # Unsloth recommendation (faster)
bias: "none" # Don't train biases (saves VRAM)
target_modules: # Which layers get LoRA adapters:
- "q_proj" # Query projection (attention)
- "k_proj" # Key projection (attention)
- "v_proj" # Value projection (attention)
- "o_proj" # Output projection (attention)
- "gate_proj" # FFN gate (feed-forward)
- "up_proj" # FFN up-projection
- "down_proj" # FFN down-projection
- "shared_mlp.input_linear" # Mamba SSM input (Granite 4 Hybrid)
- "shared_mlp.output_linear" # Mamba SSM output (Granite 4 Hybrid)
use_gradient_checkpointing: "unsloth" # 2× less VRAM than PyTorch native
Unsloth Optimizations (Why We Use Unsloth)
Unsloth provides hand-optimized CUDA kernels that make QLoRA training 2× faster and use 60% less VRAM compared to standard HuggingFace + PEFT:
| Metric | HuggingFace + PEFT | Unsloth | Improvement |
|---|---|---|---|
| Training speed | Baseline | 2× faster | Custom Triton kernels for attention and LoRA |
| VRAM usage | Baseline | 60% less | Fused operations, optimized gradient checkpointing |
| Max sequence length | Limited by VRAM | 4× longer | Memory-efficient attention implementation |
AuroraSOC models are prefixed with unsloth/ (e.g., unsloth/granite-4.0-h-tiny) which loads Unsloth's optimized model class automatically. If you use a non-Unsloth model ID, training will still work but will be slower and use more VRAM.
4. DoRA (Weight-Decomposed Low-Rank Adaptation)
How DoRA Differs from LoRA
DoRA decomposes weight updates into magnitude and direction components, mimicking how full fine-tuning updates weights:
DoRA vs LoRA Results
| Benchmark | LoRA (r=64) | DoRA (r=64) | Improvement |
|---|---|---|---|
| Alert Triage Accuracy | 82% | 84% | +2% |
| Threat Hunting Quality | 78% | 80% | +2% |
| Average across SOC tasks | 75.6% | 77.8% | +2.2% |
| VRAM overhead vs LoRA | Baseline | +5% | Minor |
| Training time vs LoRA | Baseline | +8% | Minor |
When to Use DoRA
- When you need that extra 1-3% quality and can tolerate slightly longer training
- For the orchestrator agent which needs the highest reasoning quality
- When you've already optimized LoRA rank and hyperparameters
Configuration
# Add to your training config
lora:
r: 64
use_dora: true # ← Enable DoRA
# ... rest of LoRA config unchanged
DoRA is available in Unsloth ≥0.15 and PEFT ≥0.13. Not all base models support DoRA — verify compatibility with your specific model.
5. SFT (Supervised Fine-Tuning)
What SFT Is
SFT is the training objective (how loss is computed), not a parameter method. It's the most common approach and what AuroraSOC uses:
Response Masking (Critical for Quality)
The train_on_completions: true setting in AuroraSOC's config enables response masking — the model only learns from the assistant's response, not from the system prompt or user question:
<|start_of_role|>system<|end_of_role|>
You are the AuroraSOC Security Analyst... ← MASKED (loss = 0)
<|end_of_text|>
<|start_of_role|>user<|end_of_role|>
Analyze this alert: ET TROJAN Cobalt Strike... ← MASKED (loss = 0)
<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>
## Alert Analysis ← TRAINED (loss computed)
**Severity:** Critical ← TRAINED
**Classification:** Command & Control (C2) ← TRAINED
... ← TRAINED
<|end_of_text|>
Without response masking, the model wastes capacity memorizing system prompts and user questions instead of learning high-quality responses.
6. DPO (Direct Preference Optimization)
The Idea
DPO trains the model using pairs of responses — one preferred (chosen) and one rejected. The model learns to increase the probability of chosen responses and decrease rejected ones.
DPO vs SFT
| Aspect | SFT | DPO |
|---|---|---|
| Training data | (prompt, response) pairs | (prompt, chosen, rejected) triples |
| What it learns | "Generate responses like these" | "Prefer this style over that style" |
| Data creation effort | Moderate | High (need paired comparisons) |
| Quality improvement | Good baseline | Better alignment with desired behavior |
| When to use | First training stage | Refinement after SFT |
DPO Data Format
{
"prompt": "A Suricata alert fires: ET MALWARE Win32/Emotet CnC Activity. Source: 10.0.1.25, Dest: 185.94.111.1. What should the SOC analyst do?",
"chosen": "## Immediate Actions\n\n1. **Escalate to Tier 2** — Emotet is a high-severity banking trojan with worm capabilities.\n2. **Isolate host 10.0.1.25** from the network immediately.\n3. **Block 185.94.111.1** at the perimeter firewall.\n\n## Investigation\n\n- Check EDR for process tree on 10.0.1.25\n- Query SIEM for other connections to 185.94.111.1\n- Map to MITRE ATT&CK: T1071.001 (Application Layer Protocol)\n\n## IOCs\n- `185.94.111.1` — Known Emotet C2\n- Check for lateral movement indicators (T1021)",
"rejected": "You should investigate the alert and check if it's a true positive. Look at the source and destination IPs and decide if action is needed."
}
Running DPO Training (After SFT)
# Step 1: Complete SFT training first
make train
# Step 2: Run DPO on the SFT-trained model
python training/scripts/finetune_granite.py \
--config training/configs/granite_soc_dpo.yaml \
--base-model training/output/generic/ # Use SFT output as base
When to Use DPO for AuroraSOC
| Use Case | Recommendation |
|---|---|
| First training run | Use SFT (default) |
| Model gives correct but poorly formatted responses | DPO can teach preferred formatting |
| Model sometimes gives vague vs. specific answers | DPO can reinforce specificity |
| You have SOC analyst feedback on good/bad responses | Great DPO training signal |
| Limited training data (<5K samples) | Stick with SFT |
7. ORPO (Odds Ratio Preference Optimization)
ORPO = SFT + DPO in One Step
ORPO combines supervised fine-tuning and preference optimization into a single training stage, eliminating the need for a separate reference model:
ORPO vs DPO
| Aspect | DPO | ORPO |
|---|---|---|
| Training stages | 2 (SFT → DPO) | 1 (combined) |
| Reference model needed | Yes (frozen SFT model) | No |
| VRAM | Higher (two models in memory) | Lower (single model) |
| Training time | Longer (two stages) | Shorter (~40% less total time) |
| Quality | Excellent | Comparable |
ORPO Configuration
# training/configs/granite_soc_orpo.yaml
training:
method: "orpo"
orpo_beta: 0.1 # Controls preference strength
per_device_train_batch_size: 2
# ... standard hyperparams
When to Use ORPO
- You have preference data (chosen + rejected pairs) AND want to do task learning in one step
- You're resource-constrained and can't afford two training passes
- Your preference data is high quality (noisy preferences → DPO handles noise better)
Method Comparison Summary
At a Glance
| Method | VRAM (8B model) | Training Time | Data Required | Quality | Complexity |
|---|---|---|---|---|---|
| Full Fine-Tuning | ~100 GB | Baseline | SFT pairs | Best possible | Simple concept, hard resources |
| LoRA (FP16) | ~24 GB | 0.5× baseline | SFT pairs | 98-99% of full | Moderate |
| QLoRA ⭐ | ~12 GB | 0.3× baseline | SFT pairs | 97-99% of full | Moderate |
| DoRA | ~13 GB | 0.35× baseline | SFT pairs | 98-99% of full | Moderate |
| SFT + DPO | ~12 GB (each) | 2× QLoRA | SFT pairs + preferences | Excellent alignment | High (2 stages) |
| ORPO | ~12 GB | 1.2× QLoRA | Combined pairs + prefs | Excellent alignment | Moderate |
Decision Tree
Cost Comparison (Training All 9 AuroraSOC Agents)
| Method | GPU Required | Time (9 agents) | Cloud Cost | Quality |
|---|---|---|---|---|
| QLoRA + SFT ⭐ | RTX 3090 (24 GB) | ~3-5 hours | $3-5 (RunPod) | Excellent |
| LoRA + SFT | A100 40GB | ~2-3 hours | $6-9 (RunPod) | Excellent+ |
| Full FT + SFT | A100 80GB | ~8-12 hours | $25-35 (RunPod) | Best |
| QLoRA + SFT + DPO | RTX 3090 (24 GB) | ~6-10 hours | $6-10 (RunPod) | Superior alignment |
| QLoRA + ORPO | RTX 3090 (24 GB) | ~4-7 hours | $4-7 (RunPod) | Superior alignment |
Advanced: Combining Methods
The Recommended Training Pipeline
For maximum quality with reasonable resources, combine methods in stages:
Running the Full Pipeline
# Stage 1: Generic model
make train
# Stage 2: All specialists
python training/scripts/train_all_agents.py
# Stage 3 (optional): DPO refinement
python training/scripts/finetune_granite.py \
--config training/configs/granite_soc_dpo.yaml \
--base-model training/output/generic/
# Stage 4: Evaluate + export
make train-eval
make train-serve-ollama
Next Steps
- Model Comparison Guide — compare Granite 4, Qwen 3, and Gemma 4 for SOC tasks
- Agent Model Selection — which model + method for each AuroraSOC agent
- Cloud Training Guide — train on RunPod, Lambda Labs, or vast.ai
- Evaluation & Export — benchmark and deploy your models