Docker-Based Training
Docker training provides a reproducible, isolated environment for fine-tuning Granite models. This is ideal for teams that want to version their training environment, run training in CI/CD, or avoid polluting their host system with CUDA/Python dependencies.
Why Use Docker for Training?
| Benefit | Details |
|---|---|
| Reproducibility | Same container = same results, regardless of host OS |
| Isolation | Training deps don't interfere with production Python |
| CI/CD integration | Run training as a pipeline stage |
| Multi-GPU | Docker Compose can orchestrate multi-container training |
| Versioned environment | Pin exact CUDA, PyTorch, Unsloth versions |
Prerequisites
- Docker ≥ 24.0
- Docker Compose ≥ 2.20
- NVIDIA Container Toolkit (for GPU passthrough)
- NVIDIA GPU with ≥ 8 GB VRAM
Verify GPU access in Docker:
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
Quick Start
# Train generic SOC model
make train-docker
# Train specific agent
make train-agent-docker AGENT=security_analyst
# Prepare data (in Docker)
make train-data-docker
Docker Compose Configuration
The training infrastructure is defined in docker-compose.training.yml:
services:
# ── Data Preparation ──────────────────────────────────────
prepare-data:
build:
context: .
dockerfile: Dockerfile.training
command: python training/scripts/prepare_datasets.py
volumes:
- ./training/data:/app/training/data
# No GPU needed for data prep
# ── Generic Model Training ────────────────────────────────
training:
build:
context: .
dockerfile: Dockerfile.training
command: >
python training/scripts/finetune_granite.py
--config training/configs/granite_soc_finetune.yaml
volumes:
- ./training:/app/training
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
environment:
- NVIDIA_VISIBLE_DEVICES=all
# ── Per-Agent Training ────────────────────────────────────
training-agent:
build:
context: .
dockerfile: Dockerfile.training
command: >
python training/scripts/finetune_granite.py
--config training/configs/granite_soc_finetune.yaml
--agent ${AGENT:-security_analyst}
volumes:
- ./training:/app/training
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
# ── Evaluation ────────────────────────────────────────────
eval:
build:
context: .
dockerfile: Dockerfile.training
command: >
python training/scripts/evaluate_model.py
--model training/checkpoints/granite_soc_lora
volumes:
- ./training:/app/training
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
# ── vLLM Serving ──────────────────────────────────────────
vllm:
build:
context: .
dockerfile: Dockerfile.training
command: >
python training/scripts/serve_model.py vllm
--model training/checkpoints/granite_soc_merged_16bit
ports:
- "8000:8000"
volumes:
- ./training:/app/training
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
# ── Ollama Import ─────────────────────────────────────────
ollama-import:
build:
context: .
dockerfile: Dockerfile.training
command: >
python training/scripts/serve_model.py ollama
--gguf training/checkpoints/granite_soc_gguf/unsloth.Q8_0.gguf
--name granite-soc:latest
volumes:
- ./training:/app/training
- /usr/share/ollama:/usr/share/ollama
network_mode: host
Understanding the Services
| Service | Purpose | GPU Required | When to Use |
|---|---|---|---|
prepare-data | Downloads + transforms SOC datasets | No | Once, before first training |
training | Trains generic SOC model (all domains) | Yes | Main training run |
training-agent | Trains per-agent specialist model | Yes | Per-agent fine-tuning |
eval | Runs evaluation benchmarks | Yes | After training, before deployment |
vllm | Serves merged FP16 model via vLLM | Yes | Production serving (high throughput) |
ollama-import | Imports GGUF into host Ollama | No | Local deployment |
The Training Dockerfile
Dockerfile.training builds the training environment:
FROM nvidia/cuda:12.1.0-devel-ubuntu22.04
# System deps
RUN apt-get update && apt-get install -y \
python3.11 python3-pip git curl && \
rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY pyproject.toml .
COPY training/ training/
COPY aurorasoc/ aurorasoc/
# Install training dependencies
RUN pip install -e ".[training]"
# Compile Mamba kernels (cached in Docker layer)
RUN pip install --no-build-isolation mamba_ssm==2.2.5 causal_conv1d==1.5.2
Key design decisions:
- Uses CUDA 12.1 devel image (not runtime) because Mamba kernels need
nvccto compile - Mamba compilation is a separate layer so it's cached across rebuilds
- Training data is mounted as a volume, not copied into the image
Running Training
Step 1: Prepare Data
docker compose -f docker-compose.training.yml run --rm prepare-data
This downloads and prepares all SOC datasets. The data is persisted to ./training/data/ via the volume mount.
Step 2: Train
# Generic model
docker compose -f docker-compose.training.yml run --rm training
# Specific agent
AGENT=threat_hunter docker compose -f docker-compose.training.yml run --rm training-agent
Training output (LoRA adapters, GGUF files) is saved to ./training/checkpoints/ via the volume mount.
Step 3: Evaluate
docker compose -f docker-compose.training.yml run --rm eval
Step 4: Deploy
To import the GGUF into your host Ollama instance:
docker compose -f docker-compose.training.yml run --rm ollama-import
This uses network_mode: host to access the host's Ollama service directly.
Custom Configuration
Overriding Training Parameters
Pass custom parameters via environment variables:
docker compose -f docker-compose.training.yml run \
-e GRANITE_MODEL_NAME=unsloth/granite-4.0-h-small \
-e GRANITE_EPOCHS=5 \
-e GRANITE_BATCH_SIZE=4 \
--rm training
Using a Custom Config File
Mount your custom config:
docker compose -f docker-compose.training.yml run \
-v ./my-custom-config.yaml:/app/training/configs/granite_soc_finetune.yaml \
--rm training
Multi-GPU Training
For machines with multiple GPUs, specify which GPU to use:
# Use GPU 1 only
NVIDIA_VISIBLE_DEVICES=1 docker compose -f docker-compose.training.yml run --rm training
# Use GPUs 0 and 1
NVIDIA_VISIBLE_DEVICES=0,1 docker compose -f docker-compose.training.yml run --rm training
Monitoring
Viewing Logs
# Follow training logs in real-time
docker compose -f docker-compose.training.yml logs -f training
# View last 100 lines
docker compose -f docker-compose.training.yml logs --tail 100 training
GPU Utilization
# Monitor GPU from host while container trains
watch -n 1 nvidia-smi
Volumes and Persistence
| Host Path | Container Path | Purpose |
|---|---|---|
./training/data/ | /app/training/data/ | Training datasets (survive container removal) |
./training/checkpoints/ | /app/training/checkpoints/ | Model checkpoints (survive container removal) |
./training/configs/ | /app/training/configs/ | Configuration files |
All training artifacts are persisted on the host — destroying the container doesn't lose your work.
Troubleshooting
| Issue | Cause | Solution |
|---|---|---|
docker: Error response from daemon: could not select device driver "nvidia" | NVIDIA Container Toolkit not installed | Install nvidia-container-toolkit and restart Docker |
| Container exits immediately | Check logs with docker compose logs training | Usually a missing dependency or config issue |
Permission denied on volume mounts | Docker user ≠ host user | Run chmod -R a+rw training/ on host |
| Slow initial build | Mamba kernel compilation | Normal first time, cached after |
Next Steps
- Google Colab Training — if you don't have a local GPU
- Per-Agent Specialists — train domain-specific models
- Evaluation & Export — test your models