Skip to main content

Local GPU Training

This guide walks through fine-tuning IBM Granite 4 on a local machine with an NVIDIA GPU. This is the most common training method for teams that have dedicated GPU hardware.

Why Local Training?

  • Full control — your data never leaves your machine
  • Fast iteration — no upload/download cycles
  • Persistent checkpoints — pause and resume at any time
  • Direct deployment — export GGUF and import into local Ollama immediately
  • Required for air-gapped environments

Quick Start

# Step 1: Install dependencies
make train-install

# Step 2: Prepare training data
make train-data

# Step 3: Train the model
make train

# Step 4: Import into Ollama
make train-serve-ollama

That's it. After these four commands, your fine-tuned model is running in Ollama and the agent factory will automatically use it (once enabled via environment variables).

Step-by-Step Guide

Step 1: Install Dependencies

make train-install

This runs pip install -e ".[training]" which installs:

  • unsloth — 2x faster LoRA training
  • torch — PyTorch with CUDA support
  • transformers — Hugging Face model loading
  • trl — Supervised Fine-Tuning Trainer
  • datasets — Dataset loading and processing
  • mamba_ssm + causal_conv1d — Mamba kernels for Granite 4 Hybrid models
First-Time Mamba Compilation

The Mamba state-space model kernels compile from source on first install. This requires a working CUDA toolkit and takes ~10 minutes. Subsequent installs are instant.

Step 2: Prepare Training Data

make train-data

This downloads public cybersecurity datasets (MITRE ATT&CK, Sigma rules, Atomic Red Team, etc.) and converts them into training format. See the Dataset Preparation guide for details.

Output: training/data/soc_train.jsonl — a JSONL file where each line is a conversation in Granite's chat template format.

Step 3: Train the Model

# Train generic SOC model (all domains)
make train

# Or with custom config
python training/scripts/finetune_granite.py --config training/configs/granite_soc_finetune.yaml

What Happens During Training

  1. Model loading — Downloads unsloth/granite-4.0-h-tiny (or your configured variant) from Hugging Face and loads it in 4-bit quantization
  2. LoRA injection — Adds low-rank adapter matrices to attention + Mamba layers (adds ~1-2% parameters)
  3. Dataset loading — Reads training/data/soc_train.jsonl and applies the Granite chat template
  4. Response masking — Configures loss to only train on assistant tokens (not system/user prompts)
  5. Training loop — Runs the configured number of epochs with AdamW-8bit optimizer
  6. Checkpoint saving — Saves LoRA adapters every N steps to training/checkpoints/
  7. Final export — Saves LoRA adapters, and optionally merged FP16 + GGUF files

Monitoring Training Progress

The training script outputs a progress log:

2024-12-01 10:00:00 [INFO] ════════════════════════════════════════════════════════
2024-12-01 10:00:00 [INFO] AuroraSOC — Granite 4 Fine-Tuning Pipeline
2024-12-01 10:00:00 [INFO] Model: unsloth/granite-4.0-h-tiny | Agent: all
2024-12-01 10:00:00 [INFO] ════════════════════════════════════════════════════════
2024-12-01 10:00:15 [INFO] GPU: NVIDIA GeForce RTX 4090 | Memory: 3.2/24.0 GB (13.3%)
2024-12-01 10:00:20 [INFO] Dataset formatted: 8547 examples
2024-12-01 10:00:22 [INFO] Masking instruction tokens — training on assistant completions only
2024-12-01 10:00:22 [INFO] Starting training...
{'loss': 1.823, 'learning_rate': 2e-05, 'epoch': 0.04, 'step': 10}
{'loss': 1.512, 'learning_rate': 6.8e-05, 'epoch': 0.07, 'step': 20}
{'loss': 1.104, 'learning_rate': 1.2e-04, 'epoch': 0.11, 'step': 30}
...
{'loss': 0.423, 'learning_rate': 8.1e-06, 'epoch': 2.95, 'step': 800}

What to look for:

  • Loss should decrease — starts at ~1.5-2.0 and should drop to ~0.3-0.5
  • GPU memory — should stay within your VRAM limits
  • Learning rate — follows a cosine schedule (ramps up, then decays)
  • Steps/sec — higher is better. Unsloth typically achieves 2× throughput vs vanilla training

Adjusting Hyperparameters

Edit training/configs/granite_soc_finetune.yaml to tune training:

training:
per_device_train_batch_size: 2 # Increase if VRAM allows (4, 8)
gradient_accumulation_steps: 4 # Effective batch = batch_size × grad_accum
num_train_epochs: 3 # 3 epochs is usually enough
learning_rate: 2.0e-4 # 2e-4 is standard for LoRA
warmup_ratio: 0.05 # 5% warmup steps
weight_decay: 0.01 # Regularization
lr_scheduler_type: cosine # Cosine decay schedule
optim: adamw_8bit # 8-bit AdamW (saves ~30% memory)
bf16: true # BFloat16 mixed precision
logging_steps: 10 # Log every 10 steps
save_steps: 200 # Checkpoint every 200 steps
save_total_limit: 3 # Keep only last 3 checkpoints

Common adjustments:

SituationWhat to ChangeWhy
Training is too slowIncrease batch_sizeBetter GPU utilization
Out of VRAMDecrease batch_size to 1, use smaller modelLess memory per step
Loss plateaus earlyReduce learning_rate to 1e-4More stable convergence
Overfitting (loss bounces)Reduce num_train_epochs to 1-2Less memorization
Not enough learningIncrease num_train_epochs to 5More passes over data
Small dataset (< 1000 samples)Increase gradient_accumulation_steps to 8-16Larger effective batch compensates

Step 4: Export and Deploy

After training completes, the LoRA adapters are saved to training/checkpoints/granite_soc_lora/.

Export to GGUF (for Ollama)

# Generate GGUF with Q8_0 quantization
python training/scripts/finetune_granite.py \
--export-only training/checkpoints/granite_soc_lora

# Or enable auto-export in config:
# export:
# save_gguf: true
# gguf_quantization_methods: [q8_0, q4_k_m]

The GGUF file is saved to training/checkpoints/granite_soc_gguf/.

Import into Ollama

# Using serve_model.py
python training/scripts/serve_model.py ollama \
--gguf training/checkpoints/granite_soc_gguf/unsloth.Q8_0.gguf \
--name granite-soc:latest

# Or using Make
make train-serve-ollama

This:

  1. Generates an Ollama Modelfile with the Granite 4 chat template
  2. Runs ollama create granite-soc:latest -f Modelfile
  3. The model is now available as granite-soc:latest in Ollama

Verify the Model

# List Ollama models
ollama list

# Test inference
ollama run granite-soc:latest "Analyze this Suricata alert: ET TROJAN Cobalt Strike Beacon..."

Step 5: Enable Fine-Tuned Models in AuroraSOC

By default, AuroraSOC uses base Granite models. To switch to your fine-tuned model:

# Toggle via Make
make enable-finetuned

# Or set environment variables manually
export GRANITE_USE_FINETUNED=true
export GRANITE_FINETUNED_MODEL_TAG=granite-soc:latest
export GRANITE_SERVING_BACKEND=ollama

See the Plug-and-Play Model Swap guide for full details.

Resuming From Checkpoints

If training is interrupted, resume from the last checkpoint:

python training/scripts/finetune_granite.py \
--resume training/checkpoints/granite_soc_lora/checkpoint-400

The trainer automatically finds the latest checkpoint in the directory and continues from that step.

Multiple Training Runs

To train with different configurations (e.g., comparing model sizes):

# Train with tiny model
python training/scripts/finetune_granite.py \
--config training/configs/granite_soc_finetune.yaml

# Train with small model (modify config first)
# Edit granite_soc_finetune.yaml: model.name: unsloth/granite-4.0-h-small
python training/scripts/finetune_granite.py \
--config training/configs/granite_soc_finetune.yaml

Or create separate config files for each variant:

cp training/configs/granite_soc_finetune.yaml training/configs/granite_soc_small.yaml
# Edit the copy to use granite-4.0-h-small
python training/scripts/finetune_granite.py --config training/configs/granite_soc_small.yaml

Troubleshooting

Common Issues

ErrorCauseFix
CUDA out of memoryVRAM insufficient for model+batchReduce batch_size to 1 or use smaller model
ImportError: mamba_ssmMamba kernels not compiledpip install --no-build-isolation mamba_ssm==2.2.5 causal_conv1d==1.5.2
Training data not foundHaven't run data prepmake train-data
Tokenizer errorWrong model versionEnsure using unsloth/granite-4.0-h-* models
NaN lossLearning rate too highReduce learning_rate to 1e-4 or lower
Training completes but GGUF export failsUnsloth binary export issueExport manually: model.save_pretrained_gguf(...)

GPU Memory Optimization

If you're running out of VRAM:

  1. Reduce batch size to 1:

    training:
    per_device_train_batch_size: 1
    gradient_accumulation_steps: 8 # Increase to maintain effective batch
  2. Use a smaller model:

    model:
    name: unsloth/granite-4.0-micro # Smallest variant
  3. Reduce LoRA rank:

    lora:
    r: 32 # Down from 64
    lora_alpha: 32
  4. Reduce sequence length:

    model:
    max_seq_length: 2048 # Down from 4096

Next Steps