Inference backend resolution
What this page is
How the agent fleet picks between vLLM, Ollama, and the OpenAI- compatible API surface at startup, what the auto-fallback behaviour is, and how the cache interacts with multi-replica deployments.
Why it exists this way
ADR 005
introduced the auto-resolver after operators reported that
the dev compose stack required setting LLM_BACKEND=ollama
manually because vLLM is not part of the small-tier stack. The
resolver removes the manual step in dev while leaving prod
deployments explicit.
How it works
The resolver lives at
packages/backend/aurorasoc/granite/backend_resolution.py.
resolve_inference_backend follows a priority chain:
- Honour an explicit
LLM_BACKEND=vllm|ollama|openaienv variable; no probing. - With
LLM_BACKEND=auto(the default in dev), the resolver probes vLLM's/healthfirst. If it responds 200 in under 2 seconds, vLLM wins. - On vLLM probe failure, fall back to Ollama's
/api/tags. If Ollama responds, it wins. - If neither answers, the resolver returns
InferenceBackend::Unavailableand the agent factory refuses to start.
The resolved value is cached for 60 seconds so a single multi-agent fan-out does not hammer the probe endpoints.
What goes wrong
- vLLM is up but Ollama is the right choice for this fleet
(operator wants to test the small-tier path). Set
LLM_BACKEND=ollamaexplicitly; the resolver respects it. - Both backends are up but disagree on responses. The resolver does not arbitrate; the LLM evaluation harness is the path for that comparison.
- Cache stale during deploy. The cache TTL is 60 seconds; an
agent that started during a Collector downtime sees the
cached
Unavailablefor that minute. Worker restart picks up the new state immediately.