Local GPU Training
This guide walks through fine-tuning IBM Granite 4 on a local machine with an NVIDIA GPU. This is the most common training method for teams that have dedicated GPU hardware.
Why Local Training?
- Full control — your data never leaves your machine
- Fast iteration — no upload/download cycles
- Persistent checkpoints — pause and resume at any time
- Direct deployment — export GGUF and import into local Ollama immediately
- Required for air-gapped environments
Quick Start
# Step 1: Install dependencies
make train-install
# Step 2: Prepare training data
make train-data
# Step 3: Train the model
make train
# Step 4: Import into Ollama
make train-serve-ollama
That's it. After these four commands, your fine-tuned model is running in Ollama and the agent factory will automatically use it (once enabled via environment variables).
Step-by-Step Guide
Step 1: Install Dependencies
make train-install
This runs pip install -e ".[training]" which installs:
- unsloth — 2x faster LoRA training
- torch — PyTorch with CUDA support
- transformers — Hugging Face model loading
- trl — Supervised Fine-Tuning Trainer
- datasets — Dataset loading and processing
- mamba_ssm + causal_conv1d — Mamba kernels for Granite 4 Hybrid models
The Mamba state-space model kernels compile from source on first install. This requires a working CUDA toolkit and takes ~10 minutes. Subsequent installs are instant.
Step 2: Prepare Training Data
make train-data
This downloads public cybersecurity datasets (MITRE ATT&CK, Sigma rules, Atomic Red Team, etc.) and converts them into training format. See the Dataset Preparation guide for details.
Output: training/data/soc_train.jsonl — a JSONL file where each line is a conversation in Granite's chat template format.
Step 3: Train the Model
# Train generic SOC model (all domains)
make train
# Or with custom config
python training/scripts/finetune_granite.py --config training/configs/granite_soc_finetune.yaml
What Happens During Training
- Model loading — Downloads
unsloth/granite-4.0-h-tiny(or your configured variant) from Hugging Face and loads it in 4-bit quantization - LoRA injection — Adds low-rank adapter matrices to attention + Mamba layers (adds ~1-2% parameters)
- Dataset loading — Reads
training/data/soc_train.jsonland applies the Granite chat template - Response masking — Configures loss to only train on assistant tokens (not system/user prompts)
- Training loop — Runs the configured number of epochs with AdamW-8bit optimizer
- Checkpoint saving — Saves LoRA adapters every N steps to
training/checkpoints/ - Final export — Saves LoRA adapters, and optionally merged FP16 + GGUF files
Monitoring Training Progress
The training script outputs a progress log:
2024-12-01 10:00:00 [INFO] ════════════════════════════════════════════════════════
2024-12-01 10:00:00 [INFO] AuroraSOC — Granite 4 Fine-Tuning Pipeline
2024-12-01 10:00:00 [INFO] Model: unsloth/granite-4.0-h-tiny | Agent: all
2024-12-01 10:00:00 [INFO] ════════════════════════════════════════════════════════
2024-12-01 10:00:15 [INFO] GPU: NVIDIA GeForce RTX 4090 | Memory: 3.2/24.0 GB (13.3%)
2024-12-01 10:00:20 [INFO] Dataset formatted: 8547 examples
2024-12-01 10:00:22 [INFO] Masking instruction tokens — training on assistant completions only
2024-12-01 10:00:22 [INFO] Starting training...
{'loss': 1.823, 'learning_rate': 2e-05, 'epoch': 0.04, 'step': 10}
{'loss': 1.512, 'learning_rate': 6.8e-05, 'epoch': 0.07, 'step': 20}
{'loss': 1.104, 'learning_rate': 1.2e-04, 'epoch': 0.11, 'step': 30}
...
{'loss': 0.423, 'learning_rate': 8.1e-06, 'epoch': 2.95, 'step': 800}
What to look for:
- Loss should decrease — starts at ~1.5-2.0 and should drop to ~0.3-0.5
- GPU memory — should stay within your VRAM limits
- Learning rate — follows a cosine schedule (ramps up, then decays)
- Steps/sec — higher is better. Unsloth typically achieves 2× throughput vs vanilla training
Adjusting Hyperparameters
Edit training/configs/granite_soc_finetune.yaml to tune training:
training:
per_device_train_batch_size: 2 # Increase if VRAM allows (4, 8)
gradient_accumulation_steps: 4 # Effective batch = batch_size × grad_accum
num_train_epochs: 3 # 3 epochs is usually enough
learning_rate: 2.0e-4 # 2e-4 is standard for LoRA
warmup_ratio: 0.05 # 5% warmup steps
weight_decay: 0.01 # Regularization
lr_scheduler_type: cosine # Cosine decay schedule
optim: adamw_8bit # 8-bit AdamW (saves ~30% memory)
bf16: true # BFloat16 mixed precision
logging_steps: 10 # Log every 10 steps
save_steps: 200 # Checkpoint every 200 steps
save_total_limit: 3 # Keep only last 3 checkpoints
Common adjustments:
| Situation | What to Change | Why |
|---|---|---|
| Training is too slow | Increase batch_size | Better GPU utilization |
| Out of VRAM | Decrease batch_size to 1, use smaller model | Less memory per step |
| Loss plateaus early | Reduce learning_rate to 1e-4 | More stable convergence |
| Overfitting (loss bounces) | Reduce num_train_epochs to 1-2 | Less memorization |
| Not enough learning | Increase num_train_epochs to 5 | More passes over data |
| Small dataset (< 1000 samples) | Increase gradient_accumulation_steps to 8-16 | Larger effective batch compensates |
Step 4: Export and Deploy
After training completes, the LoRA adapters are saved to training/checkpoints/granite_soc_lora/.
Export to GGUF (for Ollama)
# Generate GGUF with Q8_0 quantization
python training/scripts/finetune_granite.py \
--export-only training/checkpoints/granite_soc_lora
# Or enable auto-export in config:
# export:
# save_gguf: true
# gguf_quantization_methods: [q8_0, q4_k_m]
The GGUF file is saved to training/checkpoints/granite_soc_gguf/.
Import into Ollama
# Using serve_model.py
python training/scripts/serve_model.py ollama \
--gguf training/checkpoints/granite_soc_gguf/unsloth.Q8_0.gguf \
--name granite-soc:latest
# Or using Make
make train-serve-ollama
This:
- Generates an Ollama
Modelfilewith the Granite 4 chat template - Runs
ollama create granite-soc:latest -f Modelfile - The model is now available as
granite-soc:latestin Ollama
Verify the Model
# List Ollama models
ollama list
# Test inference
ollama run granite-soc:latest "Analyze this Suricata alert: ET TROJAN Cobalt Strike Beacon..."
Step 5: Enable Fine-Tuned Models in AuroraSOC
By default, AuroraSOC uses base Granite models. To switch to your fine-tuned model:
# Toggle via Make
make enable-finetuned
# Or set environment variables manually
export GRANITE_USE_FINETUNED=true
export GRANITE_FINETUNED_MODEL_TAG=granite-soc:latest
export GRANITE_SERVING_BACKEND=ollama
See the Plug-and-Play Model Swap guide for full details.
Resuming From Checkpoints
If training is interrupted, resume from the last checkpoint:
python training/scripts/finetune_granite.py \
--resume training/checkpoints/granite_soc_lora/checkpoint-400
The trainer automatically finds the latest checkpoint in the directory and continues from that step.
Multiple Training Runs
To train with different configurations (e.g., comparing model sizes):
# Train with tiny model
python training/scripts/finetune_granite.py \
--config training/configs/granite_soc_finetune.yaml
# Train with small model (modify config first)
# Edit granite_soc_finetune.yaml: model.name: unsloth/granite-4.0-h-small
python training/scripts/finetune_granite.py \
--config training/configs/granite_soc_finetune.yaml
Or create separate config files for each variant:
cp training/configs/granite_soc_finetune.yaml training/configs/granite_soc_small.yaml
# Edit the copy to use granite-4.0-h-small
python training/scripts/finetune_granite.py --config training/configs/granite_soc_small.yaml
Troubleshooting
Common Issues
| Error | Cause | Fix |
|---|---|---|
CUDA out of memory | VRAM insufficient for model+batch | Reduce batch_size to 1 or use smaller model |
ImportError: mamba_ssm | Mamba kernels not compiled | pip install --no-build-isolation mamba_ssm==2.2.5 causal_conv1d==1.5.2 |
Training data not found | Haven't run data prep | make train-data |
Tokenizer error | Wrong model version | Ensure using unsloth/granite-4.0-h-* models |
NaN loss | Learning rate too high | Reduce learning_rate to 1e-4 or lower |
| Training completes but GGUF export fails | Unsloth binary export issue | Export manually: model.save_pretrained_gguf(...) |
GPU Memory Optimization
If you're running out of VRAM:
-
Reduce batch size to 1:
training:
per_device_train_batch_size: 1
gradient_accumulation_steps: 8 # Increase to maintain effective batch -
Use a smaller model:
model:
name: unsloth/granite-4.0-micro # Smallest variant -
Reduce LoRA rank:
lora:
r: 32 # Down from 64
lora_alpha: 32 -
Reduce sequence length:
model:
max_seq_length: 2048 # Down from 4096
Next Steps
- Evaluate your model — run benchmarks before deploying
- Train per-agent specialists — if you want domain-specific models
- Docker Training — for reproducible, containerized training