انتقل إلى المحتوى الرئيسي

Serving Backends

AuroraSOC supports two serving backends for hosting Granite models: Ollama (for local/edge deployment) and vLLM (for production). This page explains when to use each, how to configure them, and how they integrate with the agent framework.

Backend Comparison

FeatureOllamavLLM
Model formatGGUF (quantized)FP16/BF16 (full precision)
GPU requiredNo (CPU fallback)Yes
Min VRAM4-8 GB (GGUF)8-16 GB (FP16)
Concurrent requestsSequentialBatched (continuous batching)
Throughput~1-5 req/sec~50-500 req/sec
Latency (first token)~100-500ms~50-200ms
Setup complexitySimple (ollama create)Moderate
API protocolOllama native APIOpenAI-compatible /v1/chat/completions
Best forDevelopment, edge, single-userProduction, multi-user, high-throughput

Ollama

Why Ollama?

Ollama is the recommended backend for development and local deployment because:

  • Zero-config setup — install and run
  • Works without a GPU (CPU fallback, slower but functional)
  • Tiny memory footprint with GGUF quantization
  • Model management built-in (pull, create, list, delete)
  • Great for testing and iterating on prompts

Installation

# Linux
curl -fsSL https://ollama.ai/install.sh | sh

# macOS
brew install ollama

# Verify
ollama --version

Pulling the Base Model

ollama pull granite3.2:2b

This downloads the base Granite 3.2 model from the Ollama registry (~1.5 GB).

Importing a Fine-Tuned Model

After training, import your GGUF file:

# Using serve_model.py (recommended — handles Modelfile generation)
python training/scripts/serve_model.py ollama \
--gguf training/output/generic/unsloth.Q8_0.gguf \
--name granite-soc:latest

# Using the Makefile
make train-serve-ollama

The script generates a Modelfile that includes the Granite 4 chat template:

FROM /path/to/unsloth.Q8_0.gguf

TEMPLATE """{{- if .System }}<|start_of_role|>system<|end_of_role|>
{{ .System }}<|end_of_text|>
{{- end }}
<|start_of_role|>user<|end_of_role|>
{{ .Prompt }}<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>
{{ .Response }}<|end_of_text|>"""

PARAMETER temperature 0.1
PARAMETER top_p 0.95
PARAMETER stop "<|end_of_text|>"
PARAMETER stop "<|start_of_role|>"

Why the template matters: Without this template, Ollama uses a generic chat format. Granite 4 models are trained with <|start_of_role|>...<|end_of_role|> delimiters — using the wrong template causes the model to produce incoherent output.

Ollama in Docker Compose

The main docker-compose.yml includes an Ollama service:

ollama:
image: ollama/ollama:latest
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]

Other services reference it via OLLAMA_HOST=http://ollama:11434.

Configuration

# .env
GRANITE_SERVING_BACKEND=ollama
OLLAMA_HOST=http://localhost:11434 # or http://ollama:11434 in Docker

Verifying Ollama

# Check running models
ollama list

# Test inference
ollama run granite-soc:latest "Analyze this alert: ET TROJAN Cobalt Strike"

# Check model details
ollama show granite-soc:latest

Ollama Performance Tuning

ParameterDefaultTuning
OLLAMA_NUM_PARALLEL1Increase for concurrent requests (uses more VRAM)
OLLAMA_MAX_LOADED_MODELS1Increase if running per-agent models (each stays in VRAM)
OLLAMA_KEEP_ALIVE5mHow long to keep model in memory after last request
OLLAMA_GPU_LAYERSAutoNumber of layers offloaded to GPU. Lower = less VRAM usage

For per-agent models, set OLLAMA_MAX_LOADED_MODELS to the number of distinct models you expect to be active simultaneously:

# Allow up to 4 models in VRAM simultaneously
export OLLAMA_MAX_LOADED_MODELS=4

vLLM

Why vLLM?

vLLM is the recommended backend for production because:

  • Continuous batching — serves multiple requests simultaneously without queuing
  • PagedAttention — efficient GPU memory management, 24× higher throughput than naive batching
  • OpenAI-compatible API — drop-in replacement for any OpenAI client
  • Tensor parallelism — split large models across multiple GPUs
  • Speculative decoding — faster generation with draft models

Requirements

  • GPU: NVIDIA GPU with ≥16 GB VRAM (for Granite 4 Tiny FP16)
  • CUDA: 12.0+
  • Python: 3.10+

Installation

pip install vllm

Starting vLLM

# Serve a fine-tuned model
python training/scripts/serve_model.py vllm \
--model-path training/output/generic/merged_fp16

# Or start vLLM directly
python -m vllm.entrypoints.openai.api_server \
--model training/output/generic/merged_fp16 \
--host 0.0.0.0 \
--port 8000 \
--max-model-len 4096

vLLM in Docker Compose

The training compose file includes a vLLM service:

vllm:
image: vllm/vllm-openai:latest
ports:
- "8000:8000"
volumes:
- ./training/output:/models
command: >
--model /models/generic/merged_fp16
--host 0.0.0.0
--port 8000
--max-model-len 4096
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]

Configuration

# .env
GRANITE_SERVING_BACKEND=vllm
VLLM_API_BASE=http://localhost:8000 # or http://vllm:8000 in Docker

vLLM Performance Tuning

ParameterDefaultDescription
--max-model-lenModel's maxMaximum sequence length. Lower = less VRAM.
--tensor-parallel-size1Number of GPUs for model parallelism.
--gpu-memory-utilization0.9Fraction of GPU memory to use. Lower for shared GPUs.
--max-num-batched-tokensAutoMaximum tokens in a single batch.
--quantizationNoneRuntime quantization: awq, gptq, squeezellm.

Multi-GPU with vLLM

For larger models or higher throughput:

# Spread across 2 GPUs
python -m vllm.entrypoints.openai.api_server \
--model training/output/generic/merged_fp16 \
--tensor-parallel-size 2

Verifying vLLM

# Check loaded models
curl http://localhost:8000/v1/models | jq

# Test inference
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "granite-soc",
"messages": [
{"role": "system", "content": "You are a security analyst."},
{"role": "user", "content": "Analyze: ET TROJAN Cobalt Strike Beacon"}
],
"temperature": 0.1,
"max_tokens": 512
}'

How Backends Connect to BeeAI

The Granite module translates the backend choice into the correct ChatModel constructor:

The agent doesn't know or care which backend is behind ChatModel. This abstraction allows seamless backend switching without changing any agent code.

Model Registry Health Checks

The registry.py module verifies backend availability:

# Check which models are available in Ollama
available = await check_ollama_models(ollama_host)
# Returns: ["granite-soc:latest", "granite-soc-threat-hunter:latest", ...]

# Check vLLM models
available = await check_vllm_models(vllm_base)
# Returns: ["granite-soc"]

# Warmup a model (pre-load into GPU memory)
await warmup_model(config, agent_name="security_analyst")

These checks are used during:

  • Startup — verify required models are available before accepting requests
  • Health endpoints — report model status in /health API responses
  • Auto-recovery — detect when a backend goes down and fail gracefully

Choosing a Backend: Decision Tree

Next Steps