Serving Backends
AuroraSOC supports two serving backends for hosting Granite models: Ollama (for local/edge deployment) and vLLM (for production). This page explains when to use each, how to configure them, and how they integrate with the agent framework.
Backend Comparison
| Feature | Ollama | vLLM |
|---|---|---|
| Model format | GGUF (quantized) | FP16/BF16 (full precision) |
| GPU required | No (CPU fallback) | Yes |
| Min VRAM | 4-8 GB (GGUF) | 8-16 GB (FP16) |
| Concurrent requests | Sequential | Batched (continuous batching) |
| Throughput | ~1-5 req/sec | ~50-500 req/sec |
| Latency (first token) | ~100-500ms | ~50-200ms |
| Setup complexity | Simple (ollama create) | Moderate |
| API protocol | Ollama native API | OpenAI-compatible /v1/chat/completions |
| Best for | Development, edge, single-user | Production, multi-user, high-throughput |
Ollama
Why Ollama?
Ollama is the recommended backend for development and local deployment because:
- Zero-config setup — install and run
- Works without a GPU (CPU fallback, slower but functional)
- Tiny memory footprint with GGUF quantization
- Model management built-in (pull, create, list, delete)
- Great for testing and iterating on prompts
Installation
# Linux
curl -fsSL https://ollama.ai/install.sh | sh
# macOS
brew install ollama
# Verify
ollama --version
Pulling the Base Model
ollama pull granite3.2:2b
This downloads the base Granite 3.2 model from the Ollama registry (~1.5 GB).
Importing a Fine-Tuned Model
After training, import your GGUF file:
# Using serve_model.py (recommended — handles Modelfile generation)
python training/scripts/serve_model.py ollama \
--gguf training/output/generic/unsloth.Q8_0.gguf \
--name granite-soc:latest
# Using the Makefile
make train-serve-ollama
The script generates a Modelfile that includes the Granite 4 chat template:
FROM /path/to/unsloth.Q8_0.gguf
TEMPLATE """{{- if .System }}<|start_of_role|>system<|end_of_role|>
{{ .System }}<|end_of_text|>
{{- end }}
<|start_of_role|>user<|end_of_role|>
{{ .Prompt }}<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>
{{ .Response }}<|end_of_text|>"""
PARAMETER temperature 0.1
PARAMETER top_p 0.95
PARAMETER stop "<|end_of_text|>"
PARAMETER stop "<|start_of_role|>"
Why the template matters: Without this template, Ollama uses a generic chat format. Granite 4 models are trained with <|start_of_role|>...<|end_of_role|> delimiters — using the wrong template causes the model to produce incoherent output.
Ollama in Docker Compose
The main docker-compose.yml includes an Ollama service:
ollama:
image: ollama/ollama:latest
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
Other services reference it via OLLAMA_HOST=http://ollama:11434.
Configuration
# .env
GRANITE_SERVING_BACKEND=ollama
OLLAMA_HOST=http://localhost:11434 # or http://ollama:11434 in Docker
Verifying Ollama
# Check running models
ollama list
# Test inference
ollama run granite-soc:latest "Analyze this alert: ET TROJAN Cobalt Strike"
# Check model details
ollama show granite-soc:latest
Ollama Performance Tuning
| Parameter | Default | Tuning |
|---|---|---|
OLLAMA_NUM_PARALLEL | 1 | Increase for concurrent requests (uses more VRAM) |
OLLAMA_MAX_LOADED_MODELS | 1 | Increase if running per-agent models (each stays in VRAM) |
OLLAMA_KEEP_ALIVE | 5m | How long to keep model in memory after last request |
OLLAMA_GPU_LAYERS | Auto | Number of layers offloaded to GPU. Lower = less VRAM usage |
For per-agent models, set OLLAMA_MAX_LOADED_MODELS to the number of distinct models you expect to be active simultaneously:
# Allow up to 4 models in VRAM simultaneously
export OLLAMA_MAX_LOADED_MODELS=4
vLLM
Why vLLM?
vLLM is the recommended backend for production because:
- Continuous batching — serves multiple requests simultaneously without queuing
- PagedAttention — efficient GPU memory management, 24× higher throughput than naive batching
- OpenAI-compatible API — drop-in replacement for any OpenAI client
- Tensor parallelism — split large models across multiple GPUs
- Speculative decoding — faster generation with draft models
Requirements
- GPU: NVIDIA GPU with ≥16 GB VRAM (for Granite 4 Tiny FP16)
- CUDA: 12.0+
- Python: 3.10+
Installation
pip install vllm
Starting vLLM
# Serve a fine-tuned model
python training/scripts/serve_model.py vllm \
--model-path training/output/generic/merged_fp16
# Or start vLLM directly
python -m vllm.entrypoints.openai.api_server \
--model training/output/generic/merged_fp16 \
--host 0.0.0.0 \
--port 8000 \
--max-model-len 4096
vLLM in Docker Compose
The training compose file includes a vLLM service:
vllm:
image: vllm/vllm-openai:latest
ports:
- "8000:8000"
volumes:
- ./training/output:/models
command: >
--model /models/generic/merged_fp16
--host 0.0.0.0
--port 8000
--max-model-len 4096
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
Configuration
# .env
GRANITE_SERVING_BACKEND=vllm
VLLM_API_BASE=http://localhost:8000 # or http://vllm:8000 in Docker
vLLM Performance Tuning
| Parameter | Default | Description |
|---|---|---|
--max-model-len | Model's max | Maximum sequence length. Lower = less VRAM. |
--tensor-parallel-size | 1 | Number of GPUs for model parallelism. |
--gpu-memory-utilization | 0.9 | Fraction of GPU memory to use. Lower for shared GPUs. |
--max-num-batched-tokens | Auto | Maximum tokens in a single batch. |
--quantization | None | Runtime quantization: awq, gptq, squeezellm. |
Multi-GPU with vLLM
For larger models or higher throughput:
# Spread across 2 GPUs
python -m vllm.entrypoints.openai.api_server \
--model training/output/generic/merged_fp16 \
--tensor-parallel-size 2
Verifying vLLM
# Check loaded models
curl http://localhost:8000/v1/models | jq
# Test inference
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "granite-soc",
"messages": [
{"role": "system", "content": "You are a security analyst."},
{"role": "user", "content": "Analyze: ET TROJAN Cobalt Strike Beacon"}
],
"temperature": 0.1,
"max_tokens": 512
}'
How Backends Connect to BeeAI
The Granite module translates the backend choice into the correct ChatModel constructor:
The agent doesn't know or care which backend is behind ChatModel. This abstraction allows seamless backend switching without changing any agent code.
Model Registry Health Checks
The registry.py module verifies backend availability:
# Check which models are available in Ollama
available = await check_ollama_models(ollama_host)
# Returns: ["granite-soc:latest", "granite-soc-threat-hunter:latest", ...]
# Check vLLM models
available = await check_vllm_models(vllm_base)
# Returns: ["granite-soc"]
# Warmup a model (pre-load into GPU memory)
await warmup_model(config, agent_name="security_analyst")
These checks are used during:
- Startup — verify required models are available before accepting requests
- Health endpoints — report model status in
/healthAPI responses - Auto-recovery — detect when a backend goes down and fail gracefully
Choosing a Backend: Decision Tree
Next Steps
- Local Deployment — complete end-to-end setup
- Training: Evaluation & Export — export models for each backend