Skip to main content

Inference Backends: vLLM and Ollama — Complete Guide

What Is an Inference Backend?

The inference backend is the engine that actually runs the model and produces answers. AuroraSOC agents are the drivers: they decide what question to ask and when. The backend is the engine under the hood: it executes the model and returns output. Agents do not need to change behavior when you switch backend, because they always send a question and consume a response. What changes is throughput, concurrency behavior, hardware profile, and operating cost.

vLLM: The Production Backend

vLLM is AuroraSOC's production default. It originated at UC Berkeley and is now widely used for large-scale model serving in production environments.

Why vLLM is the default

AuroraSOC is a concurrent, multi-agent system. During incident response, the orchestrator and specialist agents can issue inference requests at nearly the same time. vLLM is engineered for that pattern.

Continuous batching in plain language

Traditional backends often process requests in a simple queue, so one request blocks others. Under SOC load, that means 13 agents wait while one is served.

vLLM uses continuous batching: as requests arrive, it dynamically groups work into efficient GPU passes instead of running each request in isolation.

Restaurant analogy:

  • Traditional queue: the kitchen cooks one table at a time.
  • Continuous batching: the kitchen coordinates all active tables and keeps every burner utilized.

The result is higher throughput and lower tail latency under concurrent load.

PagedAttention and KV cache efficiency

A model's key-value (KV) cache is the "scratch paper" it uses to keep context during generation. In many backends, that memory is allocated in rigid chunks, which wastes VRAM.

vLLM's PagedAttention allocates this memory more dynamically, so you can fit more concurrent conversations into the same GPU memory budget.

OpenAI-compatible API

vLLM provides an OpenAI-compatible interface. In practice, code that targets OpenAI-style chat APIs can usually be reused by changing only base URL and model name. That compatibility is a major reason AuroraSOC could migrate to vLLM-default with minimal application-level changes.

GPU requirements

GPUVRAMSuitable for
RTX 309024 GBSpecialist model (8B)
RTX 409024 GBSpecialist model (8B)
A10G24 GBSpecialist model (8B)
A100 (40G)40 GBOrchestrator or both
A100 (80G)80 GBBoth models, tensor parallel

Ollama: The Developer Fallback

Ollama is AuroraSOC's developer-friendly fallback. It is simple to run, easy to pull models with a single command, and practical on laptops or environments without production GPUs.

Why Ollama is not production default

Ollama is excellent for local development and single-agent testing, but it is not optimized for high-concurrency SOC fan-out.

Queueing math example:

  • If one agent call takes ~3 seconds on CPU
  • And 16 specialist agents are effectively queued
  • The last agent may wait about 42 seconds before its work even starts

That delay is usually unacceptable for active incident response.

OpenAI-Compatible APIs: The Cloud / BYO-Endpoint Backend

AuroraSOC can also route all agent inference to any OpenAI-compatible API — hosted services like Together AI, Groq, Fireworks, or OpenAI itself, as well as local servers such as llama.cpp, LM Studio, or text-generation-inference.

Because BeeAI's ChatModel.from_name("openai:model-name") works with any endpoint that implements the /v1/chat/completions contract, adding a new provider is purely configuration:

export LLM_BACKEND=openai
export OPENAI_COMPATIBLE_BASE_URL=https://api.together.xyz/v1
export OPENAI_COMPATIBLE_MODEL=meta-llama/Llama-3-70b-chat-hf
export OPENAI_COMPATIBLE_API_KEY=<your-key>

Supported providers (non-exhaustive)

ProviderExample Base URLNotes
Together AIhttps://api.together.xyz/v1Wide model catalog
Groqhttps://api.groq.com/openai/v1Ultra-low latency
Fireworks AIhttps://api.fireworks.ai/inference/v1Fast open-model hosting
OpenAIhttps://api.openai.com/v1GPT-4o, o3-mini, etc.
llama.cpphttp://localhost:8080/v1Local CPU/GPU inference
LM Studiohttp://localhost:1234/v1Desktop app for local models

When to use the OpenAI-compatible backend

  • You want to leverage cloud-hosted models without running your own GPU infrastructure.
  • You need access to models not available as Ollama tags or vLLM exports (e.g., GPT-4o, Claude-via-proxy).
  • You are prototyping and want instant access to frontier models before fine-tuning Granite locally.

Limitations

  • Per-agent fine-tuned model routing and Granite model normalization do not apply — the configured model name is passed through as-is.
  • Latency depends on the remote provider and network path.
  • Sensitive investigation data leaves your infrastructure boundary (unless the endpoint is on your network).

Backend decision table

SituationRecommended backend
Production deploymentvLLM
Multi-agent load testingvLLM
Cloud model access (GPT-4o, Llama 3, etc.)OpenAI-compatible
BYO inference server (llama.cpp, TGI)OpenAI-compatible
Developer laptop (no GPU)Ollama
CI/CD smoke testOllama
Single-agent debuggingOllama

How to Switch Backends

  1. Open .env.
  2. Set LLM_BACKEND to vllm, ollama, or openai.
  3. Configure the matching variables:
    • vllm: VLLM_BASE_URL, VLLM_MODEL, VLLM_ORCHESTRATOR_MODEL
    • ollama: OLLAMA_BASE_URL, OLLAMA_MODEL, OLLAMA_ORCHESTRATOR_MODEL (pull models first)
    • openai: OPENAI_COMPATIBLE_BASE_URL, OPENAI_COMPATIBLE_MODEL, and optionally OPENAI_COMPATIBLE_ORCHESTRATOR_MODEL and OPENAI_COMPATIBLE_API_KEY
  4. If switching to Ollama, pull models:
ollama pull granite4:8b && ollama pull granite4:dense
  1. In docker-compose.yml, comment out depends_on: vllm in agent services when operating in Ollama mode.
  2. Start or refresh services:
docker compose up -d
  1. Verify active backend:
curl http://localhost:8000/api/v1/inference/status

Verifying the Active Backend

1) vLLM health endpoint

curl http://localhost:8000/health

Expected behavior: HTTP 200 with vLLM health response.

2) Ollama version endpoint

curl http://localhost:11434/api/version

Expected behavior: JSON containing Ollama version metadata.

3) AuroraSOC inference status endpoint

curl http://localhost:8000/api/v1/inference/status

Example expected JSON:

{
"backend": "vllm",
"base_url": "http://vllm:8000/v1",
"model": "granite-soc-specialist",
"orchestrator_model": "granite-soc-specialist",
"healthy": true
}

Troubleshooting

Symptom: CUDA out of memory in vLLM logs

  • Cause: model context length or tensor layout exceeds available VRAM.
  • Fix: lower --max-model-len, reduce concurrency, or increase GPU capacity. If multi-GPU is available, adjust VLLM_TENSOR_PARALLEL to match available devices.

Symptom: vLLM container exits immediately on startup

  • Cause: missing NVIDIA runtime/toolkit, invalid model path, or gated-model access failure.
  • Fix: verify NVIDIA Container Toolkit, confirm ./training/output mount contains exported model directories, and set HF_TOKEN when required.

Symptom: agents return connection refused

  • Cause: LLM_BACKEND points to a backend that is not running or URL points to wrong host.
  • Fix: confirm service health (/health for vLLM, /api/version for Ollama) and validate VLLM_BASE_URL/OLLAMA_BASE_URL values for the runtime environment.

Symptom: 404 Model Not Found from vLLM

  • Cause: VLLM_MODEL does not match --served-model-name in docker-compose.yml.
  • Fix: align model names exactly, including hyphenation and case, then restart affected services.

Symptom: Ollama model not found error

  • Cause: configured OLLAMA_MODEL or OLLAMA_ORCHESTRATOR_MODEL was not pulled/imported.
  • Fix: run:
ollama pull granite4:8b
ollama pull granite4:dense

Then restart services.

Symptom: changed LLM_BACKEND in .env but agents still use old backend

  • Cause: environment variables are loaded at container start; running containers keep previous values.
  • Fix: recreate containers with docker compose up -d (or restart affected services explicitly).