Serving Backends

AuroraSOC supports three serving backends for hosting LLM inference: Ollama (for local/edge deployment), vLLM (for production), and OpenAI-compatible APIs (for cloud providers or BYO endpoints). This page explains when to use each, how to configure them, and how they integrate with the agent framework.

Backend Comparison

Feature	Ollama	vLLM	OpenAI-compatible
Model format	GGUF (quantized)	FP16/BF16 (full precision)	Provider-managed
GPU required	No (CPU fallback)	Yes	No (remote)
Min VRAM	4-8 GB (GGUF)	8-16 GB (FP16)	N/A
Concurrent requests	Sequential	Batched (continuous batching)	Provider-dependent
Throughput	~1-5 req/sec	~50-500 req/sec	Provider-dependent
Latency (first token)	~100-500ms	~50-200ms	Network + provider
Setup complexity	Simple (`ollama create`)	Moderate	Minimal (env vars only)
API protocol	Ollama native API	OpenAI-compatible `/v1/chat/completions`	OpenAI-compatible `/v1/chat/completions`
Best for	Development, edge, single-user	Production, multi-user, high-throughput	Cloud models, rapid prototyping, BYO servers

Ollama

Why Ollama?

Ollama is the recommended backend for development and local deployment because:

Zero-config setup — install and run
Works without a GPU (CPU fallback, slower but functional)
Tiny memory footprint with GGUF quantization
Model management built-in (pull, create, list, delete)
Great for testing and iterating on prompts

Installation

# Linux
curl -fsSL https://ollama.ai/install.sh | sh

# macOS
brew install ollama

# Verify
ollama --version

Pulling the Base Model

ollama pull granite4:8b

This downloads the base Granite 4 model from the Ollama registry (~1.5 GB).

Importing a Fine-Tuned Model

After training, import your GGUF file:

# Using serve_model.py (recommended — handles Modelfile generation)
python training/scripts/serve_model.py ollama \
  --gguf training/output/generic/unsloth.Q8_0.gguf \
  --name granite-soc:latest

# Using the Makefile
make train-serve-ollama

The script generates a Modelfile that includes the Granite 4 chat template:

FROM /path/to/unsloth.Q8_0.gguf

TEMPLATE """{{- if .System }}<|start_of_role|>system<|end_of_role|>
{{ .System }}<|end_of_text|>
{{- end }}
<|start_of_role|>user<|end_of_role|>
{{ .Prompt }}<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>
{{ .Response }}<|end_of_text|>"""

PARAMETER temperature 0.1
PARAMETER top_p 0.95
PARAMETER stop "<|end_of_text|>"
PARAMETER stop "<|start_of_role|>"

Why the template matters: Without this template, Ollama uses a generic chat format. Granite 4 models are trained with <|start_of_role|>...<|end_of_role|> delimiters — using the wrong template causes the model to produce incoherent output.

Ollama in Docker Compose

The main docker-compose.yml includes an Ollama service:

ollama:
  image: ollama/ollama:latest
  ports:
    - "11434:11434"
  volumes:
    - ollama_data:/root/.ollama
  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            count: 1
            capabilities: [gpu]

Other services reference it via OLLAMA_BASE_URL=http://ollama:11434.

Configuration

# .env
LLM_BACKEND=ollama
OLLAMA_BASE_URL=http://localhost:11434   # or http://ollama:11434 in Docker
OLLAMA_MODEL=granite4:8b
OLLAMA_ORCHESTRATOR_MODEL=granite4:dense

Verifying Ollama

# Check running models
ollama list

# Test inference
ollama run granite-soc:latest "Analyze this alert: ET TROJAN Cobalt Strike"

# Check model details
ollama show granite-soc:latest

Ollama Performance Tuning

Parameter	Default	Tuning
`OLLAMA_NUM_PARALLEL`	1	Increase for concurrent requests (uses more VRAM)
`OLLAMA_MAX_LOADED_MODELS`	1	Increase if running per-agent models (each stays in VRAM)
`OLLAMA_KEEP_ALIVE`	5m	How long to keep model in memory after last request
`OLLAMA_GPU_LAYERS`	Auto	Number of layers offloaded to GPU. Lower = less VRAM usage

For per-agent models, set OLLAMA_MAX_LOADED_MODELS to the number of distinct models you expect to be active simultaneously:

# Allow up to 4 models in VRAM simultaneously
export OLLAMA_MAX_LOADED_MODELS=4

vLLM

Why vLLM?

vLLM is the recommended backend for production because:

Continuous batching — serves multiple requests simultaneously without queuing
PagedAttention — efficient GPU memory management, 24× higher throughput than naive batching
OpenAI-compatible API — drop-in replacement for any OpenAI client
Tensor parallelism — split large models across multiple GPUs
Speculative decoding — faster generation with draft models

Requirements

GPU: NVIDIA GPU with ≥16 GB VRAM (for Granite 4 Tiny FP16)
CUDA: 12.0+
Python: 3.10+

Installation

pip install vllm

Starting vLLM

# Serve a fine-tuned model
python training/scripts/serve_model.py vllm \
  --model training/output/generic/merged_fp16 \
  --served-model-name granite-soc-specialist

# Or start vLLM directly
python -m vllm.entrypoints.openai.api_server \
  --model training/output/generic/merged_fp16 \
  --host 0.0.0.0 \
  --port 8000 \
  --max-model-len 4096

vLLM in Docker Compose

The training compose file includes a vLLM service:

vllm:
  image: vllm/vllm-openai:latest
  ports:
    - "8000:8000"
  volumes:
    - ./training/output:/models
  command: >
    --model /models/generic/merged_fp16
    --host 0.0.0.0
    --port 8000
    --max-model-len 4096
  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            count: all
            capabilities: [gpu]

Configuration

# .env
LLM_BACKEND=vllm
VLLM_BASE_URL=http://localhost:8000/v1   # or http://vllm:8000/v1 in Docker
VLLM_MODEL=granite-soc-specialist
VLLM_ORCHESTRATOR_MODEL=granite-soc-specialist

vLLM Performance Tuning

Parameter	Default	Description
`--max-model-len`	Model's max	Maximum sequence length. Lower = less VRAM.
`--tensor-parallel-size`	1	Number of GPUs for model parallelism.
`--gpu-memory-utilization`	0.9	Fraction of GPU memory to use. Lower for shared GPUs.
`--max-num-batched-tokens`	Auto	Maximum tokens in a single batch.
`--quantization`	None	Runtime quantization: `awq`, `gptq`, `bitsandbytes`, `squeezellm`.

Consumer GPU Deployment (RTX 4060 / 8 GB VRAM)

AuroraSOC runs on consumer-class GPUs using bitsandbytes INT4 quantization. This reduces model VRAM from ~16 GB (FP16) to ~5 GB, fitting comfortably in an 8 GB card.

python -m vllm.entrypoints.openai.api_server \
  --model ibm-granite/granite-4.1-8b \
  --quantization bitsandbytes \
  --enforce-eager \
  --gpu-memory-utilization 0.90 \
  --dtype bfloat16 \
  --max-model-len 8192 \
  --tensor-parallel-size 1 \
  --host 0.0.0.0 --port 8000

Flag rationale:

Flag	Value	Reason
`--quantization bitsandbytes`	INT4	Reduces 8B model from ~16 GB to ~5 GB VRAM
`--enforce-eager`	—	Disables CUDA graph capture; required for bitsandbytes compatibility
`--gpu-memory-utilization`	`0.90`	Leaves 10% headroom for KV-cache overhead and OS
`--dtype bfloat16`	bfloat16	Preserves accuracy vs float16 on Ampere+ architectures
`--max-model-len`	`8192`	Caps sequence length to control KV-cache VRAM consumption
`--tensor-parallel-size`	`1`	Single-GPU deployment

These flags are exposed via environment variables (VLLM_GPU_MEMORY_UTIL, VLLM_MAX_MODEL_LEN) and applied automatically when using docker-compose.gpu.yml.

Note: For ≥16 GB VRAM (e.g. RTX 3090, A10G), drop --quantization and --enforce-eager for full FP16/BF16 throughput.

Multi-GPU with vLLM

For larger models or higher throughput:

# Spread across 2 GPUs
python -m vllm.entrypoints.openai.api_server \
  --model training/output/generic/merged_fp16 \
  --tensor-parallel-size 2

Verifying vLLM

# Check loaded models
curl http://localhost:8000/v1/models | jq

# Test inference
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "granite-soc-specialist",
    "messages": [
      {"role": "system", "content": "You are a security analyst."},
      {"role": "user", "content": "Analyze: ET TROJAN Cobalt Strike Beacon"}
    ],
    "temperature": 0.1,
    "max_tokens": 512
  }'

OpenAI-Compatible APIs

Any endpoint that implements the OpenAI /v1/chat/completions contract can be used. This includes cloud providers (Together AI, Groq, Fireworks AI, OpenAI) and local servers (llama.cpp, LM Studio, LocalAI).

Environment Variables

Variable	Purpose
`LLM_BACKEND=openai`	Select the generic OpenAI backend
`OPENAI_COMPATIBLE_BASE_URL`	Full base URL (e.g. `https://api.together.xyz/v1`)
`OPENAI_COMPATIBLE_MODEL`	Primary model name as listed by the provider
`OPENAI_COMPATIBLE_ORCHESTRATOR_MODEL`	Optional separate orchestrator model (falls back to above)
`OPENAI_COMPATIBLE_API_KEY`	Bearer token for authentication (if required)

docker-compose.yml

services:
  aurorasoc:
    environment:
      LLM_BACKEND: "openai"
      OPENAI_COMPATIBLE_BASE_URL: "https://api.groq.com/openai/v1"
      OPENAI_COMPATIBLE_MODEL: "llama-3.3-70b-versatile"
      OPENAI_COMPATIBLE_API_KEY: "${GROQ_API_KEY}"

Verify connectivity

# Confirm the endpoint is reachable and lists models
curl -s "$OPENAI_COMPATIBLE_BASE_URL/models" \
  -H "Authorization: Bearer $OPENAI_COMPATIBLE_API_KEY" | head -c 500

How Backends Connect to BeeAI

The Granite module translates the backend choice into the correct ChatModel constructor:

The agent doesn't know or care which backend is behind ChatModel. This abstraction allows seamless backend switching without changing any agent code.

Model Registry Health Checks

The registry.py module verifies backend availability:

# Check which models are available in Ollama
available = await check_ollama_models(ollama_host)
# Returns: list[ModelStatus]

# Check vLLM models
available = await check_vllm_models(vllm_base)
# Returns: list[ModelStatus]

# Check OpenAI-compatible endpoint models
available = await check_openai_compatible_models(base_url, api_key)
# Returns: list[ModelStatus]

# Warmup a model (pre-load into GPU memory)
await warmup_model(config, agent_name="security_analyst")

Serving Backends

Backend Comparison

Ollama

Why Ollama?

Installation

Pulling the Base Model

Importing a Fine-Tuned Model

Ollama in Docker Compose

Configuration

Verifying Ollama

Ollama Performance Tuning

vLLM

Why vLLM?

Requirements

Installation

Starting vLLM

vLLM in Docker Compose

Configuration

vLLM Performance Tuning

Consumer GPU Deployment (RTX 4060 / 8 GB VRAM)

Multi-GPU with vLLM

Verifying vLLM

OpenAI-Compatible APIs

Environment Variables

docker-compose.yml

Verify connectivity

How Backends Connect to BeeAI

Model Registry Health Checks

Choosing a Backend: Decision Tree

Next Steps

Backend Comparison​

Ollama​

Why Ollama?​

Installation​

Pulling the Base Model​

Importing a Fine-Tuned Model​

Ollama in Docker Compose​

Configuration​

Verifying Ollama​

Ollama Performance Tuning​

vLLM​

Why vLLM?​

Requirements​

Installation​

Starting vLLM​

vLLM in Docker Compose​

Configuration​

vLLM Performance Tuning​

Consumer GPU Deployment (RTX 4060 / 8 GB VRAM)​

Multi-GPU with vLLM​

Verifying vLLM​

OpenAI-Compatible APIs​

Environment Variables​

docker-compose.yml​

Verify connectivity​

How Backends Connect to BeeAI​

Model Registry Health Checks​

Choosing a Backend: Decision Tree​

Next Steps​

Backend Comparison

Ollama

Why Ollama?

Installation

Pulling the Base Model

Importing a Fine-Tuned Model

Ollama in Docker Compose

Configuration

Verifying Ollama

Ollama Performance Tuning

vLLM

Why vLLM?

Requirements

Installation

Starting vLLM

vLLM in Docker Compose

Configuration

vLLM Performance Tuning

Consumer GPU Deployment (RTX 4060 / 8 GB VRAM)

Multi-GPU with vLLM

Verifying vLLM

OpenAI-Compatible APIs

Environment Variables

docker-compose.yml

Verify connectivity

How Backends Connect to BeeAI

Model Registry Health Checks

Choosing a Backend: Decision Tree

Next Steps