Docker-Based Training

Docker training provides a reproducible, isolated environment for fine-tuning Granite models. This is ideal for teams that want to version their training environment, run training in CI/CD, or avoid polluting their host system with CUDA/Python dependencies.

Why Use Docker for Training?

Benefit	Details
Reproducibility	Same container = same results, regardless of host OS
Isolation	Training deps don't interfere with production Python
CI/CD integration	Run training as a pipeline stage
Multi-GPU	Docker Compose can orchestrate multi-container training
Versioned environment	Pin exact CUDA, PyTorch, Unsloth versions

Prerequisites

Docker ≥ 24.0
Docker Compose ≥ 2.20
NVIDIA Container Toolkit (for GPU passthrough)
NVIDIA GPU with ≥ 8 GB VRAM

For gated Hugging Face pulls or W&B tracking in Docker, prefer file-backed secrets instead of inline environment variables:

mkdir -p .secrets/training
printf '%s' 'hf_your_token_here' > .secrets/training/hf_token
printf '%s' 'wandb_your_key_here' > .secrets/training/wandb_api_key
chmod 700 .secrets/training
chmod 600 .secrets/training/*

docker-compose.training.yml mounts ${TRAINING_SECRETS_DIR:-./.secrets/training} read-only into the training containers and the image entrypoint exports the values only inside the container process.

Verify GPU access in Docker:

docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi

Quick Start

# Train generic SOC model
make train-docker

# Train specific agent
make train-docker-agent AGENT=security_analyst

# Prepare data (in Docker)
make train-docker-data

Docker Compose Configuration

The training infrastructure is defined in docker-compose.training.yml:

services:
  # ── Data Preparation ──────────────────────────────────────
  prepare-data:
    build:
      context: .
      dockerfile: Dockerfile.training
    command: python training/scripts/prepare_datasets.py
    volumes:
      - ./training/data:/app/training/data
    # No GPU needed for data prep

  # ── Generic Model Training ────────────────────────────────
  training:
    build:
      context: .
      dockerfile: Dockerfile.training
    command: >
      python training/scripts/finetune_granite.py
        --config training/configs/granite_soc_finetune.yaml
    volumes:
      - ./training:/app/training
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    environment:
      - NVIDIA_VISIBLE_DEVICES=all

  # ── Per-Agent Training ────────────────────────────────────
  training-agent:
    build:
      context: .
      dockerfile: Dockerfile.training
    command: >
      python training/scripts/finetune_granite.py
        --config training/configs/granite_soc_finetune.yaml
        --agent ${AGENT:-security_analyst}
    volumes:
      - ./training:/app/training
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

  # ── Evaluation ────────────────────────────────────────────
  eval:
    build:
      context: .
      dockerfile: Dockerfile.training
    command: >
      python training/scripts/evaluate_model.py
        --model training/checkpoints/granite_soc_lora
    volumes:
      - ./training:/app/training
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

  # ── vLLM Serving ──────────────────────────────────────────
  vllm:
    build:
      context: .
      dockerfile: Dockerfile.training
    command: >
      python training/scripts/serve_model.py vllm
        --model training/checkpoints/granite_soc_merged_16bit
    ports:
      - "8000:8000"
    volumes:
      - ./training:/app/training
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

  # ── Ollama Import ─────────────────────────────────────────
  ollama-import:
    build:
      context: .
      dockerfile: Dockerfile.training
    command: >
      python training/scripts/serve_model.py ollama
        --gguf training/checkpoints/granite_soc_gguf/unsloth.Q8_0.gguf
        --name granite-soc:latest
    volumes:
      - ./training:/app/training
      - /usr/share/ollama:/usr/share/ollama
    network_mode: host

Understanding the Services

Service	Purpose	GPU Required	When to Use
`prepare-data`	Downloads + transforms SOC datasets	No	Once, before first training
`training`	Trains generic SOC model (all domains)	Yes	Main training run
`training-agent`	Trains per-agent specialist model	Yes	Per-agent fine-tuning
`eval`	Runs evaluation benchmarks	Yes	After training, before deployment
`vllm`	Serves merged FP16 model via vLLM	Yes	Production serving (high throughput)
`ollama-import`	Imports GGUF into host Ollama	No	Local deployment

The Training Dockerfile

Dockerfile.training builds the training environment:

FROM nvidia/cuda:12.1.0-devel-ubuntu22.04

# System deps
RUN apt-get update && apt-get install -y \
    python3.11 python3-pip git curl && \
    rm -rf /var/lib/apt/lists/*

WORKDIR /app
COPY pyproject.toml .
COPY training/ training/
COPY aurorasoc/ aurorasoc/

# Install training dependencies
RUN pip install -e ".[training]"

# Compile Mamba kernels (cached in Docker layer)
RUN pip install --no-build-isolation mamba_ssm==2.2.5 causal_conv1d==1.5.2

Key design decisions:

Uses CUDA 12.1 devel image (not runtime) because Mamba kernels need nvcc to compile
Mamba compilation is a separate layer so it's cached across rebuilds
Training data is mounted as a volume, not copied into the image

Running Training

The Docker training services resolve optional Hugging Face and W&B credentials from /run/aurorasoc-secrets inside the container. If you store the files somewhere else on the host, set TRAINING_SECRETS_DIR=/absolute/path/to/your/secrets before running Compose.

Step 1: Prepare Data

docker compose -f docker-compose.training.yml run --rm prepare-data

This downloads and prepares all SOC datasets. The data is persisted to ./training/data/ via the volume mount.

Step 2: Train

# Generic model
docker compose -f docker-compose.training.yml run --rm training

# Specific agent
AGENT=threat_hunter docker compose -f docker-compose.training.yml run --rm training-agent

Training output (LoRA adapters, GGUF files) is saved to ./training/checkpoints/ via the volume mount.

Step 3: Evaluate

docker compose -f docker-compose.training.yml run --rm eval

Step 4: Deploy

To import the GGUF into your host Ollama instance:

docker compose -f docker-compose.training.yml run --rm ollama-import

This uses network_mode: host to access the host's Ollama service directly.

Custom Configuration

Overriding Training Parameters

Training hyperparameters are read from training/configs/granite_soc_finetune.yaml. To override defaults, mount a modified config file into the container:

docker compose -f docker-compose.training.yml run \
  -v ./my-custom-config.yaml:/app/training/configs/granite_soc_finetune.yaml \
  --rm training

Using a Custom Config File

Mount your custom config:

docker compose -f docker-compose.training.yml run \
  -v ./my-custom-config.yaml:/app/training/configs/granite_soc_finetune.yaml \
  --rm training

Multi-GPU Training

For machines with multiple GPUs, specify which GPU to use:

# Use GPU 1 only
NVIDIA_VISIBLE_DEVICES=1 docker compose -f docker-compose.training.yml run --rm training

# Use GPUs 0 and 1
NVIDIA_VISIBLE_DEVICES=0,1 docker compose -f docker-compose.training.yml run --rm training

Monitoring

Viewing Logs

# Follow training logs in real-time
docker compose -f docker-compose.training.yml logs -f training

# View last 100 lines
docker compose -f docker-compose.training.yml logs --tail 100 training

GPU Utilization

# Monitor GPU from host while container trains
watch -n 1 nvidia-smi

Volumes and Persistence

Host Path	Container Path	Purpose
`./training/data/`	`/app/training/data/`	Training datasets (survive container removal)
`./training/checkpoints/`	`/app/training/checkpoints/`	Model checkpoints (survive container removal)
`./training/configs/`	`/app/training/configs/`	Configuration files

All training artifacts are persisted on the host — destroying the container doesn't lose your work.

Troubleshooting

Issue	Cause	Solution
`docker: Error response from daemon: could not select device driver "nvidia"`	NVIDIA Container Toolkit not installed	Install `nvidia-container-toolkit` and restart Docker
Container exits immediately	Check logs with `docker compose logs training`	Usually a missing dependency or config issue
`Permission denied` on volume mounts	Docker user ≠ host user	Run `chmod -R a+rw training/` on host
Slow initial build	Mamba kernel compilation	Normal first time, cached after

Next Steps

Google Colab Training — if you don't have a local GPU
Per-Agent Specialists — train domain-specific models
Evaluation & Export — test your models

Why Use Docker for Training?​

Prerequisites​

Quick Start​

Docker Compose Configuration​

Understanding the Services​

The Training Dockerfile​

Running Training​

Step 1: Prepare Data​

Step 2: Train​

Step 3: Evaluate​

Step 4: Deploy​

Custom Configuration​

Overriding Training Parameters​

Using a Custom Config File​

Multi-GPU Training​

Monitoring​

Viewing Logs​

GPU Utilization​

Volumes and Persistence​

Troubleshooting​

Next Steps​