Inference Backends: vLLM and Ollama — Complete Guide
What Is an Inference Backend?
The inference backend is the engine that actually runs the model and produces answers. AuroraSOC agents are the drivers: they decide what question to ask and when. The backend is the engine under the hood: it executes the model and returns output. Agents do not need to change behavior when you switch backend, because they always send a question and consume a response. What changes is throughput, concurrency behavior, hardware profile, and operating cost.
vLLM: The Production Backend
vLLM is AuroraSOC's production default. It originated at UC Berkeley and is now widely used for large-scale model serving in production environments.
Why vLLM is the default
AuroraSOC is a concurrent, multi-agent system. During incident response, the orchestrator and specialist agents can issue inference requests at nearly the same time. vLLM is engineered for that pattern.
Continuous batching in plain language
Traditional backends often process requests in a simple queue, so one request blocks others. Under SOC load, that means 13 agents wait while one is served.
vLLM uses continuous batching: as requests arrive, it dynamically groups work into efficient GPU passes instead of running each request in isolation.
Restaurant analogy:
- Traditional queue: the kitchen cooks one table at a time.
- Continuous batching: the kitchen coordinates all active tables and keeps every burner utilized.
The result is higher throughput and lower tail latency under concurrent load.
PagedAttention and KV cache efficiency
A model's key-value (KV) cache is the "scratch paper" it uses to keep context during generation. In many backends, that memory is allocated in rigid chunks, which wastes VRAM.
vLLM's PagedAttention allocates this memory more dynamically, so you can fit more concurrent conversations into the same GPU memory budget.
OpenAI-compatible API
vLLM provides an OpenAI-compatible interface. In practice, code that targets OpenAI-style chat APIs can usually be reused by changing only base URL and model name. That compatibility is a major reason AuroraSOC could migrate to vLLM-default with minimal application-level changes.
GPU requirements
| GPU | VRAM | Suitable for |
|---|---|---|
| RTX 3090 | 24 GB | Specialist model (8B) |
| RTX 4090 | 24 GB | Specialist model (8B) |
| A10G | 24 GB | Specialist model (8B) |
| A100 (40G) | 40 GB | Orchestrator or both |
| A100 (80G) | 80 GB | Both models, tensor parallel |
Ollama: The Developer Fallback
Ollama is AuroraSOC's developer-friendly fallback. It is simple to run, easy to pull models with a single command, and practical on laptops or environments without production GPUs.
Why Ollama is not production default
Ollama is excellent for local development and single-agent testing, but it is not optimized for high-concurrency SOC fan-out.
Queueing math example:
- If one agent call takes ~3 seconds on CPU
- And 16 specialist agents are effectively queued
- The last agent may wait about 42 seconds before its work even starts
That delay is usually unacceptable for active incident response.
OpenAI-Compatible APIs: The Cloud / BYO-Endpoint Backend
AuroraSOC can also route all agent inference to any OpenAI-compatible API — hosted services like Together AI, Groq, Fireworks, or OpenAI itself, as well as local servers such as llama.cpp, LM Studio, or text-generation-inference.
Because BeeAI's ChatModel.from_name("openai:model-name") works with any endpoint that implements the /v1/chat/completions contract, adding a new provider is purely configuration:
export LLM_BACKEND=openai
export OPENAI_COMPATIBLE_BASE_URL=https://api.together.xyz/v1
export OPENAI_COMPATIBLE_MODEL=meta-llama/Llama-3-70b-chat-hf
export OPENAI_COMPATIBLE_API_KEY=<your-key>
Supported providers (non-exhaustive)
| Provider | Example Base URL | Notes |
|---|---|---|
| Together AI | https://api.together.xyz/v1 | Wide model catalog |
| Groq | https://api.groq.com/openai/v1 | Ultra-low latency |
| Fireworks AI | https://api.fireworks.ai/inference/v1 | Fast open-model hosting |
| OpenAI | https://api.openai.com/v1 | GPT-4o, o3-mini, etc. |
| llama.cpp | http://localhost:8080/v1 | Local CPU/GPU inference |
| LM Studio | http://localhost:1234/v1 | Desktop app for local models |
When to use the OpenAI-compatible backend
- You want to leverage cloud-hosted models without running your own GPU infrastructure.
- You need access to models not available as Ollama tags or vLLM exports (e.g., GPT-4o, Claude-via-proxy).
- You are prototyping and want instant access to frontier models before fine-tuning Granite locally.
Limitations
- Per-agent fine-tuned model routing and Granite model normalization do not apply — the configured model name is passed through as-is.
- Latency depends on the remote provider and network path.
- Sensitive investigation data leaves your infrastructure boundary (unless the endpoint is on your network).
Backend decision table
| Situation | Recommended backend |
|---|---|
| Production deployment | vLLM |
| Multi-agent load testing | vLLM |
| Cloud model access (GPT-4o, Llama 3, etc.) | OpenAI-compatible |
| BYO inference server (llama.cpp, TGI) | OpenAI-compatible |
| Developer laptop (no GPU) | Ollama |
| CI/CD smoke test | Ollama |
| Single-agent debugging | Ollama |
How to Switch Backends
- Open
.env. - Set
LLM_BACKENDtovllm,ollama, oropenai. - Configure the matching variables:
- vllm:
VLLM_BASE_URL,VLLM_MODEL,VLLM_ORCHESTRATOR_MODEL - ollama:
OLLAMA_BASE_URL,OLLAMA_MODEL,OLLAMA_ORCHESTRATOR_MODEL(pull models first) - openai:
OPENAI_COMPATIBLE_BASE_URL,OPENAI_COMPATIBLE_MODEL, and optionallyOPENAI_COMPATIBLE_ORCHESTRATOR_MODELandOPENAI_COMPATIBLE_API_KEY
- vllm:
- If switching to Ollama, pull models:
ollama pull granite4:8b && ollama pull granite4:dense
- In
docker-compose.yml, comment outdepends_on: vllmin agent services when operating in Ollama mode. - Start or refresh services:
docker compose up -d
- Verify active backend:
curl http://localhost:8000/api/v1/inference/status
Verifying the Active Backend
1) vLLM health endpoint
curl http://localhost:8000/health
Expected behavior: HTTP 200 with vLLM health response.
2) Ollama version endpoint
curl http://localhost:11434/api/version
Expected behavior: JSON containing Ollama version metadata.
3) AuroraSOC inference status endpoint
curl http://localhost:8000/api/v1/inference/status
Example expected JSON:
{
"backend": "vllm",
"base_url": "http://vllm:8000/v1",
"model": "granite-soc-specialist",
"orchestrator_model": "granite-soc-specialist",
"healthy": true
}
Troubleshooting
Symptom: CUDA out of memory in vLLM logs
- Cause: model context length or tensor layout exceeds available VRAM.
- Fix: lower
--max-model-len, reduce concurrency, or increase GPU capacity. If multi-GPU is available, adjustVLLM_TENSOR_PARALLELto match available devices.
Symptom: vLLM container exits immediately on startup
- Cause: missing NVIDIA runtime/toolkit, invalid model path, or gated-model access failure.
- Fix: verify NVIDIA Container Toolkit, confirm
./training/outputmount contains exported model directories, and setHF_TOKENwhen required.
Symptom: agents return connection refused
- Cause:
LLM_BACKENDpoints to a backend that is not running or URL points to wrong host. - Fix: confirm service health (
/healthfor vLLM,/api/versionfor Ollama) and validateVLLM_BASE_URL/OLLAMA_BASE_URLvalues for the runtime environment.
Symptom: 404 Model Not Found from vLLM
- Cause:
VLLM_MODELdoes not match--served-model-nameindocker-compose.yml. - Fix: align model names exactly, including hyphenation and case, then restart affected services.
Symptom: Ollama model not found error
- Cause: configured
OLLAMA_MODELorOLLAMA_ORCHESTRATOR_MODELwas not pulled/imported. - Fix: run:
ollama pull granite4:8b
ollama pull granite4:dense
Then restart services.
Symptom: changed LLM_BACKEND in .env but agents still use old backend
- Cause: environment variables are loaded at container start; running containers keep previous values.
- Fix: recreate containers with
docker compose up -d(or restart affected services explicitly).