Serving Backends
AuroraSOC supports three serving backends for hosting LLM inference: Ollama (for local/edge deployment), vLLM (for production), and OpenAI-compatible APIs (for cloud providers or BYO endpoints). This page explains when to use each, how to configure them, and how they integrate with the agent framework.
Backend Comparison
| Feature | Ollama | vLLM | OpenAI-compatible |
|---|---|---|---|
| Model format | GGUF (quantized) | FP16/BF16 (full precision) | Provider-managed |
| GPU required | No (CPU fallback) | Yes | No (remote) |
| Min VRAM | 4-8 GB (GGUF) | 8-16 GB (FP16) | N/A |
| Concurrent requests | Sequential | Batched (continuous batching) | Provider-dependent |
| Throughput | ~1-5 req/sec | ~50-500 req/sec | Provider-dependent |
| Latency (first token) | ~100-500ms | ~50-200ms | Network + provider |
| Setup complexity | Simple (ollama create) | Moderate | Minimal (env vars only) |
| API protocol | Ollama native API | OpenAI-compatible /v1/chat/completions | OpenAI-compatible /v1/chat/completions |
| Best for | Development, edge, single-user | Production, multi-user, high-throughput | Cloud models, rapid prototyping, BYO servers |
Ollama
Why Ollama?
Ollama is the recommended backend for development and local deployment because:
- Zero-config setup — install and run
- Works without a GPU (CPU fallback, slower but functional)
- Tiny memory footprint with GGUF quantization
- Model management built-in (pull, create, list, delete)
- Great for testing and iterating on prompts
Installation
# Linux
curl -fsSL https://ollama.ai/install.sh | sh
# macOS
brew install ollama
# Verify
ollama --version
Pulling the Base Model
ollama pull granite4:8b
This downloads the base Granite 4 model from the Ollama registry (~1.5 GB).
Importing a Fine-Tuned Model
After training, import your GGUF file:
# Using serve_model.py (recommended — handles Modelfile generation)
python training/scripts/serve_model.py ollama \
--gguf training/output/generic/unsloth.Q8_0.gguf \
--name granite-soc:latest
# Using the Makefile
make train-serve-ollama
The script generates a Modelfile that includes the Granite 4 chat template:
FROM /path/to/unsloth.Q8_0.gguf
TEMPLATE """{{- if .System }}<|start_of_role|>system<|end_of_role|>
{{ .System }}<|end_of_text|>
{{- end }}
<|start_of_role|>user<|end_of_role|>
{{ .Prompt }}<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>
{{ .Response }}<|end_of_text|>"""
PARAMETER temperature 0.1
PARAMETER top_p 0.95
PARAMETER stop "<|end_of_text|>"
PARAMETER stop "<|start_of_role|>"
Why the template matters: Without this template, Ollama uses a generic chat format. Granite 4 models are trained with <|start_of_role|>...<|end_of_role|> delimiters — using the wrong template causes the model to produce incoherent output.
Ollama in Docker Compose
The main docker-compose.yml includes an Ollama service:
ollama:
image: ollama/ollama:latest
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
Other services reference it via OLLAMA_BASE_URL=http://ollama:11434.
Configuration
# .env
LLM_BACKEND=ollama
OLLAMA_BASE_URL=http://localhost:11434 # or http://ollama:11434 in Docker
OLLAMA_MODEL=granite4:8b
OLLAMA_ORCHESTRATOR_MODEL=granite4:dense
Verifying Ollama
# Check running models
ollama list
# Test inference
ollama run granite-soc:latest "Analyze this alert: ET TROJAN Cobalt Strike"
# Check model details
ollama show granite-soc:latest
Ollama Performance Tuning
| Parameter | Default | Tuning |
|---|---|---|
OLLAMA_NUM_PARALLEL | 1 | Increase for concurrent requests (uses more VRAM) |
OLLAMA_MAX_LOADED_MODELS | 1 | Increase if running per-agent models (each stays in VRAM) |
OLLAMA_KEEP_ALIVE | 5m | How long to keep model in memory after last request |
OLLAMA_GPU_LAYERS | Auto | Number of layers offloaded to GPU. Lower = less VRAM usage |
For per-agent models, set OLLAMA_MAX_LOADED_MODELS to the number of distinct models you expect to be active simultaneously:
# Allow up to 4 models in VRAM simultaneously
export OLLAMA_MAX_LOADED_MODELS=4
vLLM
Why vLLM?
vLLM is the recommended backend for production because:
- Continuous batching — serves multiple requests simultaneously without queuing
- PagedAttention — efficient GPU memory management, 24× higher throughput than naive batching
- OpenAI-compatible API — drop-in replacement for any OpenAI client
- Tensor parallelism — split large models across multiple GPUs
- Speculative decoding — faster generation with draft models
Requirements
- GPU: NVIDIA GPU with ≥16 GB VRAM (for Granite 4 Tiny FP16)
- CUDA: 12.0+
- Python: 3.10+
Installation
pip install vllm
Starting vLLM
# Serve a fine-tuned model
python training/scripts/serve_model.py vllm \
--model training/output/generic/merged_fp16 \
--served-model-name granite-soc-specialist
# Or start vLLM directly
python -m vllm.entrypoints.openai.api_server \
--model training/output/generic/merged_fp16 \
--host 0.0.0.0 \
--port 8000 \
--max-model-len 4096
vLLM in Docker Compose
The training compose file includes a vLLM service:
vllm:
image: vllm/vllm-openai:latest
ports:
- "8000:8000"
volumes:
- ./training/output:/models
command: >
--model /models/generic/merged_fp16
--host 0.0.0.0
--port 8000
--max-model-len 4096
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
Configuration
# .env
LLM_BACKEND=vllm
VLLM_BASE_URL=http://localhost:8000/v1 # or http://vllm:8000/v1 in Docker
VLLM_MODEL=granite-soc-specialist
VLLM_ORCHESTRATOR_MODEL=granite-soc-specialist
vLLM Performance Tuning
| Parameter | Default | Description |
|---|---|---|
--max-model-len | Model's max | Maximum sequence length. Lower = less VRAM. |
--tensor-parallel-size | 1 | Number of GPUs for model parallelism. |
--gpu-memory-utilization | 0.9 | Fraction of GPU memory to use. Lower for shared GPUs. |
--max-num-batched-tokens | Auto | Maximum tokens in a single batch. |
--quantization | None | Runtime quantization: awq, gptq, bitsandbytes, squeezellm. |
Consumer GPU Deployment (RTX 4060 / 8 GB VRAM)
AuroraSOC runs on consumer-class GPUs using bitsandbytes INT4 quantization. This reduces model VRAM from ~16 GB (FP16) to ~5 GB, fitting comfortably in an 8 GB card.
python -m vllm.entrypoints.openai.api_server \
--model ibm-granite/granite-4.1-8b \
--quantization bitsandbytes \
--enforce-eager \
--gpu-memory-utilization 0.90 \
--dtype bfloat16 \
--max-model-len 8192 \
--tensor-parallel-size 1 \
--host 0.0.0.0 --port 8000
Flag rationale:
| Flag | Value | Reason |
|---|---|---|
--quantization bitsandbytes | INT4 | Reduces 8B model from ~16 GB to ~5 GB VRAM |
--enforce-eager | — | Disables CUDA graph capture; required for bitsandbytes compatibility |
--gpu-memory-utilization | 0.90 | Leaves 10% headroom for KV-cache overhead and OS |
--dtype bfloat16 | bfloat16 | Preserves accuracy vs float16 on Ampere+ architectures |
--max-model-len | 8192 | Caps sequence length to control KV-cache VRAM consumption |
--tensor-parallel-size | 1 | Single-GPU deployment |
These flags are exposed via environment variables (VLLM_GPU_MEMORY_UTIL, VLLM_MAX_MODEL_LEN) and applied automatically when using docker-compose.gpu.yml.
Note: For ≥16 GB VRAM (e.g. RTX 3090, A10G), drop
--quantizationand--enforce-eagerfor full FP16/BF16 throughput.
Multi-GPU with vLLM
For larger models or higher throughput:
# Spread across 2 GPUs
python -m vllm.entrypoints.openai.api_server \
--model training/output/generic/merged_fp16 \
--tensor-parallel-size 2
Verifying vLLM
# Check loaded models
curl http://localhost:8000/v1/models | jq
# Test inference
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "granite-soc-specialist",
"messages": [
{"role": "system", "content": "You are a security analyst."},
{"role": "user", "content": "Analyze: ET TROJAN Cobalt Strike Beacon"}
],
"temperature": 0.1,
"max_tokens": 512
}'
OpenAI-Compatible APIs
Any endpoint that implements the OpenAI /v1/chat/completions contract can be used. This includes cloud providers (Together AI, Groq, Fireworks AI, OpenAI) and local servers (llama.cpp, LM Studio, LocalAI).
Environment Variables
| Variable | Purpose |
|---|---|
LLM_BACKEND=openai | Select the generic OpenAI backend |
OPENAI_COMPATIBLE_BASE_URL | Full base URL (e.g. https://api.together.xyz/v1) |
OPENAI_COMPATIBLE_MODEL | Primary model name as listed by the provider |
OPENAI_COMPATIBLE_ORCHESTRATOR_MODEL | Optional separate orchestrator model (falls back to above) |
OPENAI_COMPATIBLE_API_KEY | Bearer token for authentication (if required) |
docker-compose.yml
services:
aurorasoc:
environment:
LLM_BACKEND: "openai"
OPENAI_COMPATIBLE_BASE_URL: "https://api.groq.com/openai/v1"
OPENAI_COMPATIBLE_MODEL: "llama-3.3-70b-versatile"
OPENAI_COMPATIBLE_API_KEY: "${GROQ_API_KEY}"
Verify connectivity
# Confirm the endpoint is reachable and lists models
curl -s "$OPENAI_COMPATIBLE_BASE_URL/models" \
-H "Authorization: Bearer $OPENAI_COMPATIBLE_API_KEY" | head -c 500
How Backends Connect to BeeAI
The Granite module translates the backend choice into the correct ChatModel constructor:
The agent doesn't know or care which backend is behind ChatModel. This abstraction allows seamless backend switching without changing any agent code.
Model Registry Health Checks
The registry.py module verifies backend availability:
# Check which models are available in Ollama
available = await check_ollama_models(ollama_host)
# Returns: list[ModelStatus]
# Check vLLM models
available = await check_vllm_models(vllm_base)
# Returns: list[ModelStatus]
# Check OpenAI-compatible endpoint models
available = await check_openai_compatible_models(base_url, api_key)
# Returns: list[ModelStatus]
# Warmup a model (pre-load into GPU memory)
await warmup_model(config, agent_name="security_analyst")
These checks are used during:
- Startup — verify required models are available before accepting requests
- Health endpoints — report model status in
/healthAPI responses - Auto-recovery — detect when a backend goes down and fail gracefully
Choosing a Backend: Decision Tree
Next Steps
- Local Deployment — complete end-to-end setup
- Training: Evaluation & Export — export models for each backend