Skip to main content

Serving Backends

AuroraSOC supports three serving backends for hosting LLM inference: Ollama (for local/edge deployment), vLLM (for production), and OpenAI-compatible APIs (for cloud providers or BYO endpoints). This page explains when to use each, how to configure them, and how they integrate with the agent framework.

Backend Comparison

FeatureOllamavLLMOpenAI-compatible
Model formatGGUF (quantized)FP16/BF16 (full precision)Provider-managed
GPU requiredNo (CPU fallback)YesNo (remote)
Min VRAM4-8 GB (GGUF)8-16 GB (FP16)N/A
Concurrent requestsSequentialBatched (continuous batching)Provider-dependent
Throughput~1-5 req/sec~50-500 req/secProvider-dependent
Latency (first token)~100-500ms~50-200msNetwork + provider
Setup complexitySimple (ollama create)ModerateMinimal (env vars only)
API protocolOllama native APIOpenAI-compatible /v1/chat/completionsOpenAI-compatible /v1/chat/completions
Best forDevelopment, edge, single-userProduction, multi-user, high-throughputCloud models, rapid prototyping, BYO servers

Ollama

Why Ollama?

Ollama is the recommended backend for development and local deployment because:

  • Zero-config setup — install and run
  • Works without a GPU (CPU fallback, slower but functional)
  • Tiny memory footprint with GGUF quantization
  • Model management built-in (pull, create, list, delete)
  • Great for testing and iterating on prompts

Installation

# Linux
curl -fsSL https://ollama.ai/install.sh | sh

# macOS
brew install ollama

# Verify
ollama --version

Pulling the Base Model

ollama pull granite4:8b

This downloads the base Granite 4 model from the Ollama registry (~1.5 GB).

Importing a Fine-Tuned Model

After training, import your GGUF file:

# Using serve_model.py (recommended — handles Modelfile generation)
python training/scripts/serve_model.py ollama \
--gguf training/output/generic/unsloth.Q8_0.gguf \
--name granite-soc:latest

# Using the Makefile
make train-serve-ollama

The script generates a Modelfile that includes the Granite 4 chat template:

FROM /path/to/unsloth.Q8_0.gguf

TEMPLATE """{{- if .System }}<|start_of_role|>system<|end_of_role|>
{{ .System }}<|end_of_text|>
{{- end }}
<|start_of_role|>user<|end_of_role|>
{{ .Prompt }}<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>
{{ .Response }}<|end_of_text|>"""

PARAMETER temperature 0.1
PARAMETER top_p 0.95
PARAMETER stop "<|end_of_text|>"
PARAMETER stop "<|start_of_role|>"

Why the template matters: Without this template, Ollama uses a generic chat format. Granite 4 models are trained with <|start_of_role|>...<|end_of_role|> delimiters — using the wrong template causes the model to produce incoherent output.

Ollama in Docker Compose

The main docker-compose.yml includes an Ollama service:

ollama:
image: ollama/ollama:latest
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]

Other services reference it via OLLAMA_BASE_URL=http://ollama:11434.

Configuration

# .env
LLM_BACKEND=ollama
OLLAMA_BASE_URL=http://localhost:11434 # or http://ollama:11434 in Docker
OLLAMA_MODEL=granite4:8b
OLLAMA_ORCHESTRATOR_MODEL=granite4:dense

Verifying Ollama

# Check running models
ollama list

# Test inference
ollama run granite-soc:latest "Analyze this alert: ET TROJAN Cobalt Strike"

# Check model details
ollama show granite-soc:latest

Ollama Performance Tuning

ParameterDefaultTuning
OLLAMA_NUM_PARALLEL1Increase for concurrent requests (uses more VRAM)
OLLAMA_MAX_LOADED_MODELS1Increase if running per-agent models (each stays in VRAM)
OLLAMA_KEEP_ALIVE5mHow long to keep model in memory after last request
OLLAMA_GPU_LAYERSAutoNumber of layers offloaded to GPU. Lower = less VRAM usage

For per-agent models, set OLLAMA_MAX_LOADED_MODELS to the number of distinct models you expect to be active simultaneously:

# Allow up to 4 models in VRAM simultaneously
export OLLAMA_MAX_LOADED_MODELS=4

vLLM

Why vLLM?

vLLM is the recommended backend for production because:

  • Continuous batching — serves multiple requests simultaneously without queuing
  • PagedAttention — efficient GPU memory management, 24× higher throughput than naive batching
  • OpenAI-compatible API — drop-in replacement for any OpenAI client
  • Tensor parallelism — split large models across multiple GPUs
  • Speculative decoding — faster generation with draft models

Requirements

  • GPU: NVIDIA GPU with ≥16 GB VRAM (for Granite 4 Tiny FP16)
  • CUDA: 12.0+
  • Python: 3.10+

Installation

pip install vllm

Starting vLLM

# Serve a fine-tuned model
python training/scripts/serve_model.py vllm \
--model training/output/generic/merged_fp16 \
--served-model-name granite-soc-specialist

# Or start vLLM directly
python -m vllm.entrypoints.openai.api_server \
--model training/output/generic/merged_fp16 \
--host 0.0.0.0 \
--port 8000 \
--max-model-len 4096

vLLM in Docker Compose

The training compose file includes a vLLM service:

vllm:
image: vllm/vllm-openai:latest
ports:
- "8000:8000"
volumes:
- ./training/output:/models
command: >
--model /models/generic/merged_fp16
--host 0.0.0.0
--port 8000
--max-model-len 4096
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]

Configuration

# .env
LLM_BACKEND=vllm
VLLM_BASE_URL=http://localhost:8000/v1 # or http://vllm:8000/v1 in Docker
VLLM_MODEL=granite-soc-specialist
VLLM_ORCHESTRATOR_MODEL=granite-soc-specialist

vLLM Performance Tuning

ParameterDefaultDescription
--max-model-lenModel's maxMaximum sequence length. Lower = less VRAM.
--tensor-parallel-size1Number of GPUs for model parallelism.
--gpu-memory-utilization0.9Fraction of GPU memory to use. Lower for shared GPUs.
--max-num-batched-tokensAutoMaximum tokens in a single batch.
--quantizationNoneRuntime quantization: awq, gptq, bitsandbytes, squeezellm.

Consumer GPU Deployment (RTX 4060 / 8 GB VRAM)

AuroraSOC runs on consumer-class GPUs using bitsandbytes INT4 quantization. This reduces model VRAM from ~16 GB (FP16) to ~5 GB, fitting comfortably in an 8 GB card.

python -m vllm.entrypoints.openai.api_server \
--model ibm-granite/granite-4.1-8b \
--quantization bitsandbytes \
--enforce-eager \
--gpu-memory-utilization 0.90 \
--dtype bfloat16 \
--max-model-len 8192 \
--tensor-parallel-size 1 \
--host 0.0.0.0 --port 8000

Flag rationale:

FlagValueReason
--quantization bitsandbytesINT4Reduces 8B model from ~16 GB to ~5 GB VRAM
--enforce-eagerDisables CUDA graph capture; required for bitsandbytes compatibility
--gpu-memory-utilization0.90Leaves 10% headroom for KV-cache overhead and OS
--dtype bfloat16bfloat16Preserves accuracy vs float16 on Ampere+ architectures
--max-model-len8192Caps sequence length to control KV-cache VRAM consumption
--tensor-parallel-size1Single-GPU deployment

These flags are exposed via environment variables (VLLM_GPU_MEMORY_UTIL, VLLM_MAX_MODEL_LEN) and applied automatically when using docker-compose.gpu.yml.

Note: For ≥16 GB VRAM (e.g. RTX 3090, A10G), drop --quantization and --enforce-eager for full FP16/BF16 throughput.

Multi-GPU with vLLM

For larger models or higher throughput:

# Spread across 2 GPUs
python -m vllm.entrypoints.openai.api_server \
--model training/output/generic/merged_fp16 \
--tensor-parallel-size 2

Verifying vLLM

# Check loaded models
curl http://localhost:8000/v1/models | jq

# Test inference
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "granite-soc-specialist",
"messages": [
{"role": "system", "content": "You are a security analyst."},
{"role": "user", "content": "Analyze: ET TROJAN Cobalt Strike Beacon"}
],
"temperature": 0.1,
"max_tokens": 512
}'

OpenAI-Compatible APIs

Any endpoint that implements the OpenAI /v1/chat/completions contract can be used. This includes cloud providers (Together AI, Groq, Fireworks AI, OpenAI) and local servers (llama.cpp, LM Studio, LocalAI).

Environment Variables

VariablePurpose
LLM_BACKEND=openaiSelect the generic OpenAI backend
OPENAI_COMPATIBLE_BASE_URLFull base URL (e.g. https://api.together.xyz/v1)
OPENAI_COMPATIBLE_MODELPrimary model name as listed by the provider
OPENAI_COMPATIBLE_ORCHESTRATOR_MODELOptional separate orchestrator model (falls back to above)
OPENAI_COMPATIBLE_API_KEYBearer token for authentication (if required)

docker-compose.yml

services:
aurorasoc:
environment:
LLM_BACKEND: "openai"
OPENAI_COMPATIBLE_BASE_URL: "https://api.groq.com/openai/v1"
OPENAI_COMPATIBLE_MODEL: "llama-3.3-70b-versatile"
OPENAI_COMPATIBLE_API_KEY: "${GROQ_API_KEY}"

Verify connectivity

# Confirm the endpoint is reachable and lists models
curl -s "$OPENAI_COMPATIBLE_BASE_URL/models" \
-H "Authorization: Bearer $OPENAI_COMPATIBLE_API_KEY" | head -c 500

How Backends Connect to BeeAI

The Granite module translates the backend choice into the correct ChatModel constructor:

The agent doesn't know or care which backend is behind ChatModel. This abstraction allows seamless backend switching without changing any agent code.

Model Registry Health Checks

The registry.py module verifies backend availability:

# Check which models are available in Ollama
available = await check_ollama_models(ollama_host)
# Returns: list[ModelStatus]

# Check vLLM models
available = await check_vllm_models(vllm_base)
# Returns: list[ModelStatus]

# Check OpenAI-compatible endpoint models
available = await check_openai_compatible_models(base_url, api_key)
# Returns: list[ModelStatus]

# Warmup a model (pre-load into GPU memory)
await warmup_model(config, agent_name="security_analyst")

These checks are used during:

  • Startup — verify required models are available before accepting requests
  • Health endpoints — report model status in /health API responses
  • Auto-recovery — detect when a backend goes down and fail gracefully

Choosing a Backend: Decision Tree

Next Steps