Skip to main content

Cloud GPU Training Guide

Not everyone has a local GPU. This guide walks you through training AuroraSOC models on cloud GPU platforms — from zero to a fully fine-tuned model you can download and run locally.

Platform Comparison

PlatformBest GPUCost/hourMinimum SpendSetup TimeBest For
RunPodRTX 3090 (24 GB)$0.69None5 minBest value for 8B models
RunPodA100 40 GB$1.49None5 min12B+ models
Lambda LabsA100 80 GB$1.29None10 minLarge-scale training
vast.aiRTX 3090~$0.40None15 minCheapest option (variable)
Google Colab ProA100 40 GB$9.99/mo (limited)$9.99/mo2 minQuick experiments
AWS SageMakerA100 40 GB~$5.67Pay-per-use30 minEnterprise / existing AWS

RunPod is the best balance of cost, ease of use, and reliability for AuroraSOC fine-tuning.

Step 1: Create Account & Add Funds

  1. Go to runpod.io and create an account
  2. Add funds: $10-15 is enough to fine-tune all 9 agents on Granite 4
  3. Navigate to GPU CloudSecure Cloud (recommended) or Community Cloud (cheaper)

Step 2: Choose a GPU Pod

For AuroraSOC fine-tuning, you need:

Model SizeMinimum GPURecommended GPUCost/hr
Granite 4 H-Micro (3B)RTX 3090 (24 GB)RTX 3090$0.69
Granite 4 H-Small (8B)RTX 3090 (24 GB)RTX 3090$0.69
Qwen 3 8BRTX 3090 (24 GB)RTX 3090$0.69
Gemma 4 12BA100 40 GBA100 40 GB$1.49
Qwen 3 14BA100 40 GBA100 40 GB$1.49
Qwen 3 32B / 30B-A3BA100 80 GBA100 80 GB$2.49
Cost Saving

For most AuroraSOC agents, an RTX 3090 at $0.69/hr is sufficient. Only use A100 for 12B+ models.

Step 3: Select a Template

Choose RunPod PyTorch 2.4+ template, which includes:

  • CUDA 12.4
  • Python 3.11
  • PyTorch 2.4
  • 50 GB disk (increase to 100 GB for training output)

Pod configuration:

  • Container disk: 20 GB
  • Volume disk: 100 GB (persistent — survives pod restarts)
  • Expose ports: 8888 (Jupyter), 22 (SSH)

Step 4: Connect via SSH

# RunPod provides an SSH command — copy it from the pod dashboard:
ssh root@<pod-ip> -p <port> -i ~/.ssh/id_rsa

# First time: add your SSH key in RunPod Settings → SSH Keys

Or use the web terminal in the RunPod dashboard (no SSH setup needed).

Step 5: Set Up the Training Environment

# Connect to your pod

# Clone AuroraSOC
git clone https://github.com/your-org/AuroraSOC.git
cd AuroraSOC

# Install dependencies
pip install -e ".[training]"

# Verify GPU
python -c "import torch; print(f'GPU: {torch.cuda.get_device_name(0)}'); print(f'VRAM: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB')"
# GPU: NVIDIA GeForce RTX 3090
# VRAM: 24.0 GB

# Verify Unsloth
python -c "from unsloth import FastLanguageModel; print('Unsloth ready!')"

Step 6: Run Training

# Prepare datasets
make train-data

# Train generic model (all domains)
make train

# Train per-agent specialists
python training/scripts/train_all_agents.py

# Or train a single agent:
python training/scripts/finetune_granite.py \
--config training/configs/granite_soc_finetune.yaml \
--agent malware_analyst

Expected training times (RTX 3090, Granite 4 H-Small):

StepDurationCost
Dataset preparation5 min$0.06
Generic model training25 min$0.29
Per-agent training (×9 agents)3-4 hours$2.30-2.76
Evaluation10 min$0.12
GGUF export15 min$0.17
Total~4.5 hours~$3.10

Step 7: Export & Download

# Export to GGUF
make train-export

# Download to local machine (from your LOCAL terminal):
scp -P <port> root@<pod-ip>:/workspace/AuroraSOC/training/output/gguf/*.gguf ./models/

# Or use RunPod's file browser in the web UI

Step 8: Stop the Pod

warning

Always stop your pod when done! RunPod charges by the hour. An idle RTX 3090 pod costs $0.69/hr = $16.56/day.

# From RunPod dashboard: Click "Stop" on your pod
# Your /workspace volume persists — you can restart later without losing data

Lambda Labs

Lambda Labs offers on-demand A100 instances with a clean Ubuntu environment. Good for larger models (12B+).

Quick Start

# 1. Create account at lambdalabs.com
# 2. Launch instance: 1x A100 (40 GB) at $1.29/hr
# 3. SSH in:
ssh ubuntu@<instance-ip>

# 4. Set up environment
sudo apt update && sudo apt install -y git python3-pip
git clone https://github.com/your-org/AuroraSOC.git
cd AuroraSOC
pip install -e ".[training]"

# 5. Run training (same commands as RunPod)
make train-data && make train && python training/scripts/train_all_agents.py

Cost comparison for full training:

ConfigurationLambda A100 ($1.29/hr)RunPod RTX 3090 ($0.69/hr)
Generic model only$0.54$0.29
All 9 agents$6.45$3.10
All agents + 3 models (Option C)$25.80$14.50

Lambda is ~1.9× more expensive than RunPod for the same work, but the A100 is faster for 12B+ models.


vast.ai

vast.ai is a marketplace for GPU rentals — prices vary based on supply and demand. It can be the cheapest option but requires more setup.

Quick Start

# 1. Create account at vast.ai
# 2. Install vast CLI
pip install vastai
vastai set api-key <your-key>

# 3. Search for GPUs
vastai search offers 'gpu_ram >= 24 cuda_vers >= 12.0 reliability > 0.95' \
--order 'dph_total'

# 4. Create instance (pick cheapest RTX 3090)
vastai create instance <offer-id> \
--image pytorch/pytorch:2.4.0-cuda12.4-cudnn9-devel \
--disk 100

# 5. SSH in
vastai ssh-url <instance-id>
ssh -p <port> root@<host>

# 6. Same training commands
git clone https://github.com/your-org/AuroraSOC.git
cd AuroraSOC && pip install -e ".[training]"
make train-data && make train
caution

vast.ai uses community-provided GPUs. Instances can be interrupted if the provider needs their GPU back. Always save checkpoints frequently and download results promptly.


Google Colab Pro

Colab Pro is the easiest option for experiments and testing, but not ideal for full training runs.

See the dedicated Colab Training Guide for the full notebook walkthrough.

Quick summary:

  • $9.99/month for Colab Pro (A100 access, 24-hour sessions)
  • No SSH needed — runs in browser
  • Limited to ~12 hours continuous runtime (Pro+: ~24 hours)
  • Storage via Google Drive (15 GB free, 100 GB with Google One)

Best for: Training a single agent, quick experiments, testing hyperparameters.

Not recommended for: Training all 9 agents (insufficient runtime) or multi-model configurations.


GPU Selection Guide

How Much VRAM Do You Need?

GPU Speed Comparison

Real-world training times for Granite 4 H-Small (8B), single agent, 3 epochs:

GPUVRAMTraining TimeCost (RunPod)
T416 GB52 min$0.32
RTX 309024 GB25 min$0.29
RTX 409024 GB18 min$0.54
A100 40 GB40 GB14 min$0.35
A100 80 GB80 GB12 min$0.50
H100 80 GB80 GB8 min$0.53
Best Value

The RTX 3090 consistently offers the lowest total cost despite not being the fastest. At $0.69/hr and 25 minutes per agent, you spend only $0.29 per agent fine-tune.


Cost Calculator

Single-Model Configuration (Granite 4 only)

Base training: 25 min × $0.69/hr = $0.29
Per-agent (×9): 225 min × $0.69/hr = $2.59
Dataset prep: 5 min × $0.69/hr = $0.06
Export + eval: 25 min × $0.69/hr = $0.29
────────────────────────────────────────────────
Total: 280 min = 4.7 hrs → $3.23

Two-Model Configuration (Granite 4 + Qwen 3)

Granite 4 agents (×9): 225 min × $0.69/hr = $2.59
Qwen 3 agents (×4): 100 min × $0.69/hr = $1.15
Dataset prep: 5 min × $0.69/hr = $0.06
Export + eval: 40 min × $0.69/hr = $0.46
────────────────────────────────────────────────
Total: 370 min = 6.2 hrs → $4.26

Three-Model Configuration (Granite 4 + Qwen 3 + Gemma 4)

Granite 4 agents (×9): 225 min × $0.69/hr = $2.59
Qwen 3 agents (×4): 100 min × $0.69/hr = $1.15
Gemma 4 agents (×3): 120 min × $1.49/hr = $2.98
Dataset prep: 5 min × $0.69/hr = $0.06
Export + eval: 45 min × $1.49/hr = $1.12
────────────────────────────────────────────────
Total: 495 min = 8.25 hrs → $7.90

Data Transfer Tips

Uploading Training Data

# Option 1: Clone from Git (fastest if data is in repo)
git clone --depth 1 https://github.com/your-org/AuroraSOC.git

# Option 2: rsync for large datasets
rsync -avz --progress training/data/ root@<pod-ip>:/workspace/AuroraSOC/training/data/

# Option 3: Hugging Face Hub (for public/private datasets)
pip install huggingface-hub
huggingface-cli download your-org/aurora-soc-dataset --local-dir training/data/

Downloading Results

# Download only GGUF files (smallest, ready for serving)
scp -P <port> root@<pod-ip>:/workspace/AuroraSOC/training/output/gguf/*.gguf ./models/

# Download LoRA adapters (for later merging)
scp -P <port> -r root@<pod-ip>:/workspace/AuroraSOC/training/output/*/adapter_model.safetensors ./adapters/

# Download everything
rsync -avz root@<pod-ip>:/workspace/AuroraSOC/training/output/ ./training/output/

Troubleshooting

Common Cloud Training Issues

IssueCauseSolution
CUDA out of memoryBatch size too large or wrong GPUReduce per_device_train_batch_size to 1, enable gradient checkpointing
Connection resetPod terminatedUse RunPod persistent volumes; restart pod and resume from checkpoint
Permission denied on SSHSSH key not configuredAdd key in platform settings; use web terminal as fallback
Training slower than expectedShared GPU (vast.ai)Monitor with nvidia-smi; switch to dedicated instance
Disk fullOutput files accumulatingIncrease volume to 200 GB; delete intermediate checkpoints
ModuleNotFoundError: unslothMissing dependencypip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

Resuming from Checkpoints

If your training is interrupted (pod preempted, connection dropped):

# Find the latest checkpoint
ls -la training/output/checkpoint-*/

# Resume training from checkpoint
python training/scripts/finetune_granite.py \
--config training/configs/granite_soc_finetune.yaml \
--resume-from-checkpoint training/output/checkpoint-500/

Next Steps