انتقل إلى المحتوى الرئيسي

Docker-Based Training

Docker training provides a reproducible, isolated environment for fine-tuning Granite models. This is ideal for teams that want to version their training environment, run training in CI/CD, or avoid polluting their host system with CUDA/Python dependencies.

Why Use Docker for Training?

BenefitDetails
ReproducibilitySame container = same results, regardless of host OS
IsolationTraining deps don't interfere with production Python
CI/CD integrationRun training as a pipeline stage
Multi-GPUDocker Compose can orchestrate multi-container training
Versioned environmentPin exact CUDA, PyTorch, Unsloth versions

Prerequisites

  • Docker ≥ 24.0
  • Docker Compose ≥ 2.20
  • NVIDIA Container Toolkit (for GPU passthrough)
  • NVIDIA GPU with ≥ 8 GB VRAM

Verify GPU access in Docker:

docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi

Quick Start

# Train generic SOC model
make train-docker

# Train specific agent
make train-agent-docker AGENT=security_analyst

# Prepare data (in Docker)
make train-data-docker

Docker Compose Configuration

The training infrastructure is defined in docker-compose.training.yml:

services:
# ── Data Preparation ──────────────────────────────────────
prepare-data:
build:
context: .
dockerfile: Dockerfile.training
command: python training/scripts/prepare_datasets.py
volumes:
- ./training/data:/app/training/data
# No GPU needed for data prep

# ── Generic Model Training ────────────────────────────────
training:
build:
context: .
dockerfile: Dockerfile.training
command: >
python training/scripts/finetune_granite.py
--config training/configs/granite_soc_finetune.yaml
volumes:
- ./training:/app/training
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
environment:
- NVIDIA_VISIBLE_DEVICES=all

# ── Per-Agent Training ────────────────────────────────────
training-agent:
build:
context: .
dockerfile: Dockerfile.training
command: >
python training/scripts/finetune_granite.py
--config training/configs/granite_soc_finetune.yaml
--agent ${AGENT:-security_analyst}
volumes:
- ./training:/app/training
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]

# ── Evaluation ────────────────────────────────────────────
eval:
build:
context: .
dockerfile: Dockerfile.training
command: >
python training/scripts/evaluate_model.py
--model training/checkpoints/granite_soc_lora
volumes:
- ./training:/app/training
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]

# ── vLLM Serving ──────────────────────────────────────────
vllm:
build:
context: .
dockerfile: Dockerfile.training
command: >
python training/scripts/serve_model.py vllm
--model training/checkpoints/granite_soc_merged_16bit
ports:
- "8000:8000"
volumes:
- ./training:/app/training
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]

# ── Ollama Import ─────────────────────────────────────────
ollama-import:
build:
context: .
dockerfile: Dockerfile.training
command: >
python training/scripts/serve_model.py ollama
--gguf training/checkpoints/granite_soc_gguf/unsloth.Q8_0.gguf
--name granite-soc:latest
volumes:
- ./training:/app/training
- /usr/share/ollama:/usr/share/ollama
network_mode: host

Understanding the Services

ServicePurposeGPU RequiredWhen to Use
prepare-dataDownloads + transforms SOC datasetsNoOnce, before first training
trainingTrains generic SOC model (all domains)YesMain training run
training-agentTrains per-agent specialist modelYesPer-agent fine-tuning
evalRuns evaluation benchmarksYesAfter training, before deployment
vllmServes merged FP16 model via vLLMYesProduction serving (high throughput)
ollama-importImports GGUF into host OllamaNoLocal deployment

The Training Dockerfile

Dockerfile.training builds the training environment:

FROM nvidia/cuda:12.1.0-devel-ubuntu22.04

# System deps
RUN apt-get update && apt-get install -y \
python3.11 python3-pip git curl && \
rm -rf /var/lib/apt/lists/*

WORKDIR /app
COPY pyproject.toml .
COPY training/ training/
COPY aurorasoc/ aurorasoc/

# Install training dependencies
RUN pip install -e ".[training]"

# Compile Mamba kernels (cached in Docker layer)
RUN pip install --no-build-isolation mamba_ssm==2.2.5 causal_conv1d==1.5.2

Key design decisions:

  • Uses CUDA 12.1 devel image (not runtime) because Mamba kernels need nvcc to compile
  • Mamba compilation is a separate layer so it's cached across rebuilds
  • Training data is mounted as a volume, not copied into the image

Running Training

Step 1: Prepare Data

docker compose -f docker-compose.training.yml run --rm prepare-data

This downloads and prepares all SOC datasets. The data is persisted to ./training/data/ via the volume mount.

Step 2: Train

# Generic model
docker compose -f docker-compose.training.yml run --rm training

# Specific agent
AGENT=threat_hunter docker compose -f docker-compose.training.yml run --rm training-agent

Training output (LoRA adapters, GGUF files) is saved to ./training/checkpoints/ via the volume mount.

Step 3: Evaluate

docker compose -f docker-compose.training.yml run --rm eval

Step 4: Deploy

To import the GGUF into your host Ollama instance:

docker compose -f docker-compose.training.yml run --rm ollama-import

This uses network_mode: host to access the host's Ollama service directly.

Custom Configuration

Overriding Training Parameters

Pass custom parameters via environment variables:

docker compose -f docker-compose.training.yml run \
-e GRANITE_MODEL_NAME=unsloth/granite-4.0-h-small \
-e GRANITE_EPOCHS=5 \
-e GRANITE_BATCH_SIZE=4 \
--rm training

Using a Custom Config File

Mount your custom config:

docker compose -f docker-compose.training.yml run \
-v ./my-custom-config.yaml:/app/training/configs/granite_soc_finetune.yaml \
--rm training

Multi-GPU Training

For machines with multiple GPUs, specify which GPU to use:

# Use GPU 1 only
NVIDIA_VISIBLE_DEVICES=1 docker compose -f docker-compose.training.yml run --rm training

# Use GPUs 0 and 1
NVIDIA_VISIBLE_DEVICES=0,1 docker compose -f docker-compose.training.yml run --rm training

Monitoring

Viewing Logs

# Follow training logs in real-time
docker compose -f docker-compose.training.yml logs -f training

# View last 100 lines
docker compose -f docker-compose.training.yml logs --tail 100 training

GPU Utilization

# Monitor GPU from host while container trains
watch -n 1 nvidia-smi

Volumes and Persistence

Host PathContainer PathPurpose
./training/data//app/training/data/Training datasets (survive container removal)
./training/checkpoints//app/training/checkpoints/Model checkpoints (survive container removal)
./training/configs//app/training/configs/Configuration files

All training artifacts are persisted on the host — destroying the container doesn't lose your work.

Troubleshooting

IssueCauseSolution
docker: Error response from daemon: could not select device driver "nvidia"NVIDIA Container Toolkit not installedInstall nvidia-container-toolkit and restart Docker
Container exits immediatelyCheck logs with docker compose logs trainingUsually a missing dependency or config issue
Permission denied on volume mountsDocker user ≠ host userRun chmod -R a+rw training/ on host
Slow initial buildMamba kernel compilationNormal first time, cached after

Next Steps