إنتقل إلى المحتوى الرئيسي

Inference backend resolution

What this page is

How the agent fleet picks between vLLM, Ollama, and the OpenAI- compatible API surface at startup, what the auto-fallback behaviour is, and how the cache interacts with multi-replica deployments.

Why it exists this way

ADR 005 introduced the auto-resolver after operators reported that the dev compose stack required setting LLM_BACKEND=ollama manually because vLLM is not part of the small-tier stack. The resolver removes the manual step in dev while leaving prod deployments explicit.

How it works

The resolver lives at packages/backend/aurorasoc/granite/backend_resolution.py. resolve_inference_backend follows a priority chain:

  1. Honour an explicit LLM_BACKEND=vllm|ollama|openai env variable; no probing.
  2. With LLM_BACKEND=auto (the default in dev), the resolver probes vLLM's /health first. If it responds 200 in under 2 seconds, vLLM wins.
  3. On vLLM probe failure, fall back to Ollama's /api/tags. If Ollama responds, it wins.
  4. If neither answers, the resolver returns InferenceBackend::Unavailable and the agent factory refuses to start.

The resolved value is cached for 60 seconds so a single multi-agent fan-out does not hammer the probe endpoints.

What goes wrong

  • vLLM is up but Ollama is the right choice for this fleet (operator wants to test the small-tier path). Set LLM_BACKEND=ollama explicitly; the resolver respects it.
  • Both backends are up but disagree on responses. The resolver does not arbitrate; the LLM evaluation harness is the path for that comparison.
  • Cache stale during deploy. The cache TTL is 60 seconds; an agent that started during a Collector downtime sees the cached Unavailable for that minute. Worker restart picks up the new state immediately.