Local GPU Training

This guide walks through fine-tuning IBM Granite 4 on a local machine with an NVIDIA GPU. This is the most common training method for teams that have dedicated GPU hardware.

Why Local Training?

Full control — your data never leaves your machine
Fast iteration — no upload/download cycles
Persistent checkpoints — pause and resume at any time
Direct deployment — export GGUF and import into local Ollama immediately
Required for air-gapped environments

Quick Start

# Step 1: Install dependencies
make train-install

# Step 2: Prepare training data
make train-data

# Step 3: Train the model
make train

# Step 4: Import into Ollama
make train-serve-ollama

That's it. After these four commands, your fine-tuned model is running in Ollama and the agent factory will automatically use it (once enabled via environment variables).

Step-by-Step Guide

Step 1: Install Dependencies

make train-install

This runs pip install -e ".[training]" which installs:

unsloth — 2x faster LoRA training
torch — PyTorch with CUDA support
transformers — Hugging Face model loading
trl — Supervised Fine-Tuning Trainer
datasets — Dataset loading and processing
mamba_ssm + causal_conv1d — Mamba kernels for Granite 4 Hybrid models

First-Time Mamba Compilation

The Mamba state-space model kernels compile from source on first install. This requires a working CUDA toolkit and takes ~10 minutes. Subsequent installs are instant.

Step 2: Prepare Training Data

make train-data

This downloads public cybersecurity datasets (MITRE ATT&CK, Sigma rules, Atomic Red Team, etc.) and converts them into training format. See the Dataset Preparation guide for details.

Output: training/data/soc_train.jsonl — a JSONL file where each line is a conversation in Granite's chat template format.

Step 3: Train the Model

# Train generic SOC model (all domains)
make train

# Or with custom config
python training/scripts/finetune_granite.py --config training/configs/granite_soc_finetune.yaml

What Happens During Training

Model loading — Downloads unsloth/granite-4.0-h-tiny (or your configured variant) from Hugging Face and loads it in 4-bit quantization
LoRA injection — Adds low-rank adapter matrices to attention + Mamba layers (adds ~1-2% parameters)
Dataset loading — Reads training/data/soc_train.jsonl and applies the Granite chat template
Response masking — Configures loss to only train on assistant tokens (not system/user prompts)
Training loop — Runs the configured number of epochs with AdamW-8bit optimizer
Checkpoint saving — Saves LoRA adapters every N steps to training/checkpoints/
Final export — Saves LoRA adapters, and optionally merged FP16 + GGUF files

Monitoring Training Progress

The training script outputs a progress log:

2024-12-01 10:00:00 [INFO] ════════════════════════════════════════════════════════
2024-12-01 10:00:00 [INFO] AuroraSOC — Granite 4 Fine-Tuning Pipeline
2024-12-01 10:00:00 [INFO] Model: unsloth/granite-4.0-h-tiny | Agent: all
2024-12-01 10:00:00 [INFO] ════════════════════════════════════════════════════════
2024-12-01 10:00:15 [INFO] GPU: NVIDIA GeForce RTX 4090 | Memory: 3.2/24.0 GB (13.3%)
2024-12-01 10:00:20 [INFO] Dataset formatted: 8547 examples
2024-12-01 10:00:22 [INFO] Masking instruction tokens — training on assistant completions only
2024-12-01 10:00:22 [INFO] Starting training...
{'loss': 1.823, 'learning_rate': 2e-05, 'epoch': 0.04, 'step': 10}
{'loss': 1.512, 'learning_rate': 6.8e-05, 'epoch': 0.07, 'step': 20}
{'loss': 1.104, 'learning_rate': 1.2e-04, 'epoch': 0.11, 'step': 30}
...
{'loss': 0.423, 'learning_rate': 8.1e-06, 'epoch': 2.95, 'step': 800}

What to look for:

Loss should decrease — starts at ~1.5-2.0 and should drop to ~0.3-0.5
GPU memory — should stay within your VRAM limits
Learning rate — follows a cosine schedule (ramps up, then decays)
Steps/sec — higher is better. Unsloth typically achieves 2× throughput vs vanilla training

Adjusting Hyperparameters

Edit training/configs/granite_soc_finetune.yaml to tune training:

training:
  per_device_train_batch_size: 2   # Increase if VRAM allows (4, 8)
  gradient_accumulation_steps: 4   # Effective batch = batch_size × grad_accum
  num_train_epochs: 3              # 3 epochs is usually enough
  learning_rate: 2.0e-4            # 2e-4 is standard for LoRA
  warmup_ratio: 0.05               # 5% warmup steps
  weight_decay: 0.01               # Regularization
  lr_scheduler_type: cosine        # Cosine decay schedule
  optim: adamw_8bit                # 8-bit AdamW (saves ~30% memory)
  bf16: true                       # BFloat16 mixed precision
  logging_steps: 10                # Log every 10 steps
  save_steps: 200                  # Checkpoint every 200 steps
  save_total_limit: 3              # Keep only last 3 checkpoints

Common adjustments:

Situation	What to Change	Why
Training is too slow	Increase `batch_size`	Better GPU utilization
Out of VRAM	Decrease `batch_size` to 1, use smaller model	Less memory per step
Loss plateaus early	Reduce `learning_rate` to 1e-4	More stable convergence
Overfitting (loss bounces)	Reduce `num_train_epochs` to 1-2	Less memorization
Not enough learning	Increase `num_train_epochs` to 5	More passes over data
Small dataset (< 1000 samples)	Increase `gradient_accumulation_steps` to 8-16	Larger effective batch compensates

Step 4: Export and Deploy

After training completes, the LoRA adapters are saved to training/checkpoints/granite_soc_lora/.

Export to GGUF (for Ollama)

# Generate GGUF with Q8_0 quantization
python training/scripts/finetune_granite.py \
  --export-only training/checkpoints/granite_soc_lora

# Or enable auto-export in config:
# export:
#   save_gguf: true
#   gguf_quantization_methods: [q8_0, q4_k_m]

The GGUF file is saved to training/checkpoints/granite_soc_gguf/.

Import into Ollama

# Using serve_model.py
python training/scripts/serve_model.py ollama \
  --gguf training/checkpoints/granite_soc_gguf/unsloth.Q8_0.gguf \
  --name granite-soc:latest

# Or using Make
make train-serve-ollama

This:

Generates an Ollama Modelfile with the Granite 4 chat template
Runs ollama create granite-soc:latest -f Modelfile
The model is now available as granite-soc:latest in Ollama

Verify the Model

# List Ollama models
ollama list

# Test inference
ollama run granite-soc:latest "Analyze this Suricata alert: ET TROJAN Cobalt Strike Beacon..."

Step 5: Enable Fine-Tuned Models in AuroraSOC

By default, AuroraSOC uses base Granite models. To switch to your fine-tuned model:

# Toggle via Make
make enable-finetuned

# Or set environment variables manually
export GRANITE_USE_FINETUNED=true
export GRANITE_FINETUNED_MODEL_TAG=granite-soc:latest
export GRANITE_SERVING_BACKEND=ollama

See the Plug-and-Play Model Swap guide for full details.

Resuming From Checkpoints

If training is interrupted, resume from the last checkpoint:

python training/scripts/finetune_granite.py \
  --resume training/checkpoints/granite_soc_lora/checkpoint-400

The trainer automatically finds the latest checkpoint in the directory and continues from that step.

Multiple Training Runs

To train with different configurations (e.g., comparing model sizes):

# Train with tiny model
python training/scripts/finetune_granite.py \
  --config training/configs/granite_soc_finetune.yaml

# Train with small model (modify config first)
# Edit granite_soc_finetune.yaml: model.name: unsloth/granite-4.0-h-small
python training/scripts/finetune_granite.py \
  --config training/configs/granite_soc_finetune.yaml

Or create separate config files for each variant:

cp training/configs/granite_soc_finetune.yaml training/configs/granite_soc_small.yaml
# Edit the copy to use granite-4.0-h-small
python training/scripts/finetune_granite.py --config training/configs/granite_soc_small.yaml

Troubleshooting

Common Issues

Error	Cause	Fix
`CUDA out of memory`	VRAM insufficient for model+batch	Reduce `batch_size` to 1 or use smaller model
`ImportError: mamba_ssm`	Mamba kernels not compiled	`pip install --no-build-isolation mamba_ssm==2.2.5 causal_conv1d==1.5.2`
`Training data not found`	Haven't run data prep	`make train-data`
`Tokenizer error`	Wrong model version	Ensure using `unsloth/granite-4.0-h-*` models
`NaN loss`	Learning rate too high	Reduce `learning_rate` to 1e-4 or lower
Training completes but GGUF export fails	Unsloth binary export issue	Export manually: `model.save_pretrained_gguf(...)`

GPU Memory Optimization

If you're running out of VRAM:

Reduce batch size to 1:

training:
  per_device_train_batch_size: 1
  gradient_accumulation_steps: 8  # Increase to maintain effective batch

Use a smaller model:

model:
  name: unsloth/granite-4.0-micro  # Smallest variant

Reduce LoRA rank:

lora:
  r: 32          # Down from 64
  lora_alpha: 32

Reduce sequence length:

model:
  max_seq_length: 2048  # Down from 4096

Next Steps

Evaluate your model — run benchmarks before deploying
Train per-agent specialists — if you want domain-specific models
Docker Training — for reproducible, containerized training

Why Local Training?​

Quick Start​

Step-by-Step Guide​

Step 1: Install Dependencies​

Step 2: Prepare Training Data​

Step 3: Train the Model​

What Happens During Training​

Monitoring Training Progress​

Adjusting Hyperparameters​

Step 4: Export and Deploy​

Export to GGUF (for Ollama)​

Import into Ollama​

Verify the Model​

Step 5: Enable Fine-Tuned Models in AuroraSOC​

Resuming From Checkpoints​

Multiple Training Runs​

Troubleshooting​

Common Issues​

GPU Memory Optimization​

Next Steps​