Inference Backends: vLLM and Ollama — Complete Guide

What Is an Inference Backend?

The inference backend is the engine that actually runs the model and produces answers. AuroraSOC agents are the drivers: they decide what question to ask and when. The backend is the engine under the hood: it executes the model and returns output. Agents do not need to change behavior when you switch backend, because they always send a question and consume a response. What changes is throughput, concurrency behavior, hardware profile, and operating cost.

vLLM: The Production Backend

vLLM is AuroraSOC's production default. It originated at UC Berkeley and is now widely used for large-scale model serving in production environments.

Why vLLM is the default

AuroraSOC is a concurrent, multi-agent system. During incident response, the orchestrator and specialist agents can issue inference requests at nearly the same time. vLLM is engineered for that pattern.

Continuous batching in plain language

Traditional backends often process requests in a simple queue, so one request blocks others. Under SOC load, that means 13 agents wait while one is served.

vLLM uses continuous batching: as requests arrive, it dynamically groups work into efficient GPU passes instead of running each request in isolation.

Restaurant analogy:

Traditional queue: the kitchen cooks one table at a time.
Continuous batching: the kitchen coordinates all active tables and keeps every burner utilized.

The result is higher throughput and lower tail latency under concurrent load.

PagedAttention and KV cache efficiency

A model's key-value (KV) cache is the "scratch paper" it uses to keep context during generation. In many backends, that memory is allocated in rigid chunks, which wastes VRAM.

vLLM's PagedAttention allocates this memory more dynamically, so you can fit more concurrent conversations into the same GPU memory budget.

OpenAI-compatible API

vLLM provides an OpenAI-compatible interface. In practice, code that targets OpenAI-style chat APIs can usually be reused by changing only base URL and model name. That compatibility is a major reason AuroraSOC could migrate to vLLM-default with minimal application-level changes.

GPU requirements

GPU	VRAM	Suitable for
RTX 3090	24 GB	Specialist model (8B)
RTX 4090	24 GB	Specialist model (8B)
A10G	24 GB	Specialist model (8B)
A100 (40G)	40 GB	Orchestrator or both
A100 (80G)	80 GB	Both models, tensor parallel

Ollama: The Developer Fallback

Ollama is AuroraSOC's developer-friendly fallback. It is simple to run, easy to pull models with a single command, and practical on laptops or environments without production GPUs.

Why Ollama is not production default

Ollama is excellent for local development and single-agent testing, but it is not optimized for high-concurrency SOC fan-out.

Queueing math example:

If one agent call takes ~3 seconds on CPU
And 16 specialist agents are effectively queued
The last agent may wait about 42 seconds before its work even starts

That delay is usually unacceptable for active incident response.

OpenAI-Compatible APIs: The Cloud / BYO-Endpoint Backend

AuroraSOC can also route all agent inference to any OpenAI-compatible API — hosted services like Together AI, Groq, Fireworks, or OpenAI itself, as well as local servers such as llama.cpp, LM Studio, or text-generation-inference.

Because BeeAI's ChatModel.from_name("openai:model-name") works with any endpoint that implements the /v1/chat/completions contract, adding a new provider is purely configuration:

export LLM_BACKEND=openai
export OPENAI_COMPATIBLE_BASE_URL=https://api.together.xyz/v1
export OPENAI_COMPATIBLE_MODEL=meta-llama/Llama-3-70b-chat-hf
export OPENAI_COMPATIBLE_API_KEY=<your-key>

Supported providers (non-exhaustive)

Provider	Example Base URL	Notes
Together AI	`https://api.together.xyz/v1`	Wide model catalog
Groq	`https://api.groq.com/openai/v1`	Ultra-low latency
Fireworks AI	`https://api.fireworks.ai/inference/v1`	Fast open-model hosting
OpenAI	`https://api.openai.com/v1`	GPT-4o, o3-mini, etc.
llama.cpp	`http://localhost:8080/v1`	Local CPU/GPU inference
LM Studio	`http://localhost:1234/v1`	Desktop app for local models

When to use the OpenAI-compatible backend

You want to leverage cloud-hosted models without running your own GPU infrastructure.
You need access to models not available as Ollama tags or vLLM exports (e.g., GPT-4o, Claude-via-proxy).
You are prototyping and want instant access to frontier models before fine-tuning Granite locally.

Limitations

Per-agent fine-tuned model routing and Granite model normalization do not apply — the configured model name is passed through as-is.
Latency depends on the remote provider and network path.
Sensitive investigation data leaves your infrastructure boundary (unless the endpoint is on your network).

Backend decision table

Situation	Recommended backend
Production deployment	vLLM
Multi-agent load testing	vLLM
Cloud model access (GPT-4o, Llama 3, etc.)	OpenAI-compatible
BYO inference server (llama.cpp, TGI)	OpenAI-compatible
Developer laptop (no GPU)	Ollama
CI/CD smoke test	Ollama
Single-agent debugging	Ollama

How to Switch Backends

Open .env.
Set LLM_BACKEND to vllm, ollama, or openai.
Configure the matching variables:
- vllm: VLLM_BASE_URL, VLLM_MODEL, VLLM_ORCHESTRATOR_MODEL
- ollama: OLLAMA_BASE_URL, OLLAMA_MODEL, OLLAMA_ORCHESTRATOR_MODEL (pull models first)
- openai: OPENAI_COMPATIBLE_BASE_URL, OPENAI_COMPATIBLE_MODEL, and optionally OPENAI_COMPATIBLE_ORCHESTRATOR_MODEL and OPENAI_COMPATIBLE_API_KEY
If switching to Ollama, pull models:

ollama pull granite4:8b && ollama pull granite4:dense

In docker-compose.yml, comment out depends_on: vllm in agent services when operating in Ollama mode.
Start or refresh services:

docker compose up -d

Verify active backend:

curl http://localhost:8000/api/v1/inference/status

Verifying the Active Backend

1) vLLM health endpoint

curl http://localhost:8000/health

Expected behavior: HTTP 200 with vLLM health response.

2) Ollama version endpoint

curl http://localhost:11434/api/version

Expected behavior: JSON containing Ollama version metadata.

3) AuroraSOC inference status endpoint

curl http://localhost:8000/api/v1/inference/status

Example expected JSON:

{
  "backend": "vllm",
  "base_url": "http://vllm:8000/v1",
  "model": "granite-soc-specialist",
  "orchestrator_model": "granite-soc-specialist",
  "healthy": true
}

Troubleshooting

Symptom: CUDA out of memory in vLLM logs

Cause: model context length or tensor layout exceeds available VRAM.
Fix: lower --max-model-len, reduce concurrency, or increase GPU capacity. If multi-GPU is available, adjust VLLM_TENSOR_PARALLEL to match available devices.

Symptom: vLLM container exits immediately on startup

Cause: missing NVIDIA runtime/toolkit, invalid model path, or gated-model access failure.
Fix: verify NVIDIA Container Toolkit, confirm ./training/output mount contains exported model directories, and set HF_TOKEN when required.

Symptom: agents return connection refused

Cause: LLM_BACKEND points to a backend that is not running or URL points to wrong host.
Fix: confirm service health (/health for vLLM, /api/version for Ollama) and validate VLLM_BASE_URL/OLLAMA_BASE_URL values for the runtime environment.

Symptom: 404 Model Not Found from vLLM

Cause: VLLM_MODEL does not match --served-model-name in docker-compose.yml.
Fix: align model names exactly, including hyphenation and case, then restart affected services.

Symptom: Ollama model not found error

Cause: configured OLLAMA_MODEL or OLLAMA_ORCHESTRATOR_MODEL was not pulled/imported.
Fix: run:

ollama pull granite4:8b
ollama pull granite4:dense

Then restart services.

Symptom: changed `LLM_BACKEND` in `.env` but agents still use old backend

Cause: environment variables are loaded at container start; running containers keep previous values.
Fix: recreate containers with docker compose up -d (or restart affected services explicitly).

What Is an Inference Backend?​

vLLM: The Production Backend​

Why vLLM is the default​

Continuous batching in plain language​

PagedAttention and KV cache efficiency​

OpenAI-compatible API​

GPU requirements​

Ollama: The Developer Fallback​

Why Ollama is not production default​

OpenAI-Compatible APIs: The Cloud / BYO-Endpoint Backend​

Supported providers (non-exhaustive)​

When to use the OpenAI-compatible backend​

Limitations​

Backend decision table​

How to Switch Backends​

Verifying the Active Backend​

1) vLLM health endpoint​

2) Ollama version endpoint​

3) AuroraSOC inference status endpoint​

Troubleshooting​

Symptom: CUDA out of memory in vLLM logs​

Symptom: vLLM container exits immediately on startup​

Symptom: agents return connection refused​

Symptom: 404 Model Not Found from vLLM​

Symptom: Ollama model not found error​

Symptom: changed LLM_BACKEND in .env but agents still use old backend​