Skip to main content

Local Deployment Guide

This page walks through deploying AuroraSOC with Granite LLMs on a single machine — the fastest way to go from fine-tuned model to running agents. It covers the automated setup script, manual steps, Docker Compose integration, and verification.

If you want the full platform bring-up with the dashboard, API, MCP services, orchestrator, and specialist agents, use AI Agent Fleet Deployment. This page stays focused on the model-serving side of the stack.

TL;DR — MVP-1 Single-Command Path

For the MVP-1 host-run topology with the 14-agent mesh and the dashboard, three Makefile targets are all you need after ollama serve and ollama pull granite3.2:8b:

make llm-doctor # Verify Ollama + Granite 3.2 + BeeAI ChatModel end-to-end
make stack-up # Bring up Postgres, Redis, NATS, API, A2A mesh, dashboard
make agents-smoke # Deterministic prompt to every live agent, PASS/FAIL summary

Tear everything down with make stack-down (set KEEP_INFRA=1 to keep Postgres/Redis/NATS running between iterations).

For container-based topologies see Deployment Modes. The rest of this page is the long-form reference for each layer.

Prerequisites

Before you begin:

RequirementMinimumRecommended
OSUbuntu 22.04 / macOS 14+Ubuntu 24.04
RAM16 GB32 GB
GPU VRAM4 GB (GGUF q4)8+ GB (GGUF q8)
Disk20 GB free50 GB free
Ollamav0.4+Latest
Docker24.0+27.0+
Python3.10+3.12

Automated Setup – setup_local.sh

The fastest path. This script installs everything and verifies the setup:

chmod +x scripts/setup_local.sh
./scripts/setup_local.sh

What the Script Does

  1. Checks system dependencies — verifies Python ≥ 3.10, Docker, Docker Compose, NVIDIA drivers (if GPU present)
  2. Installs Ollama — downloads and installs Ollama if not present
  3. Pulls the base modelollama pull granite4:8b
  4. Creates Python virtualenv — installs AuroraSOC with training extras (pip install -e ".[training]")
  5. Copies .env.example.env — seeds runtime defaults (vLLM default backend; switch to Ollama for CPU-only local mode)
  6. Runs database migrationsalembic upgrade head
  7. Verifies Ollama inference — sends a test prompt and checks for a valid response
  8. Prints status summary — shows all service URLs and next steps

When to Use the Script

  • First time setup on a new machine
  • After cloning the repository on a fresh environment
  • When onboarding a new developer who needs everything working quickly

When NOT to Use the Script

  • You already have a working environment (just update .env manually)
  • You're deploying to production (use Docker Compose instead)
  • You need a GPU-first vLLM setup immediately (the script prepares the Ollama local path)

Manual Setup (Step-by-Step)

If you prefer control over each step:

Step 1: Install Ollama

curl -fsSL https://ollama.ai/install.sh | sh
ollama serve & # Start in background

Step 2: Pull or Import a Model

Option A — Use one installed model, or pull one base model:

Check local inventory first, especially on limited data quotas:

ollama list

If a usable model is already installed, use that same tag for both agent roles. Pull granite4:8b only when no acceptable local model exists:

ollama pull granite4:8b

granite3.2:8b is also a supported single-model tag for the entire 14-agent fleet — set OLLAMA_MODEL=granite3.2:8b and OLLAMA_ORCHESTRATOR_MODEL=granite3.2:8b if that is what your ollama list already shows. Either tag runs on one warm process thanks to GRANITE_SINGLE_MODEL_MODE=true.

For quota-limited systems, do not pull a second model until the installed model has failed a small /api/chat or /api/generate probe. AuroraSOC's local MVP expects one inference service and one shared model tag for every specialist plus the orchestrator.

Option B — Import your fine-tuned GGUF:

# Generate Modelfile + create Ollama model
python training/scripts/serve_model.py ollama \
--gguf training/output/generic/unsloth.Q8_0.gguf \
--name granite-soc:latest

Option C — Import per-agent models:

# Create all agent-specific models at once
python training/scripts/serve_model.py ollama-all \
--output-dir training/output

This creates separate Ollama models for each trained agent:

  • granite-soc-security-analyst:latest
  • granite-soc-threat-hunter:latest
  • granite-soc-incident-responder:latest
  • (etc.)

Step 3: Configure Environment

make env-init

Review .env and adjust only if you need to deviate from the host-run demo path:

# Backend selection
LLM_BACKEND=ollama
OLLAMA_BASE_URL=http://localhost:11434

# Demo-mode runtime defaults
SYSTEM_MODE=dummy
LOCAL_AUTH_ENABLED=true
DEV_PG_PORT=5432
DEV_REDIS_PORT=6379
DEV_NATS_PORT=4222
PG_HOST=localhost
PG_PORT=5432
PG_DATABASE=aurorasoc
PG_USER=aurora
PG_PASSWORD=aurora_dev
PG_SSLMODE=disable
REDIS_URL=redis://localhost:6379
NATS_URL=nats://localhost:4222

# Model selection (base or fine-tuned)
OLLAMA_MODEL=granite4:8b # Base model name in Ollama
OLLAMA_ORCHESTRATOR_MODEL=granite4:8b # Local single-model fallback for orchestration
GRANITE_SINGLE_MODEL_MODE=true # Force all Ollama agents to one model tag
GRANITE_USE_SHARED_MODEL_POOL=true # Reuse equivalent BeeAI ChatModel clients
GRANITE_MAX_CONCURRENT_REQUESTS=1 # Recommended for 8 GB VRAM laptops
GRANITE_REQUEST_QUEUE_SIZE=16
GRANITE_INFERENCE_TIMEOUT_SECONDS=180

# Optional: generic fine-tuned model tag
# OLLAMA_MODEL=granite-soc:latest

# Optional: use per-agent fine-tuned models (turn off single-model mode first)
GRANITE_USE_FINETUNED=true
# GRANITE_SINGLE_MODEL_MODE=false

Step 4: Verify

# Confirm models are loaded
ollama list

# Test inference directly; replace granite4:8b with your selected installed tag
curl -sS http://127.0.0.1:11434/api/chat \
-d '{"model":"granite4:8b","messages":[{"role":"user","content":"Reply with OK only."}],"stream":false,"options":{"num_predict":8,"num_ctx":1024,"temperature":0}}'

# Test via the Granite module
python -c "
from aurorasoc.granite import get_default_granite_config
config = get_default_granite_config()
print(f'Backend: {config.backend}')
print(f'Model: {config.resolve_model(\"security_analyst\")}')
"

# After the API starts, verify the dashboard-facing registry reports one model
curl -sS http://localhost:8000/api/v1/agents | jq '.agents | map(.model) | unique'

To start the host-run A2A mesh against that same model, run:

make agents-local MODEL=<installed-ollama-tag>

The launcher sets both OLLAMA_MODEL and OLLAMA_ORCHESTRATOR_MODEL to the same tag, points local A2A and MCP discovery at 127.0.0.1, and runs all specialist agents plus the orchestrator on ports 9000 through 9016.

Step 5: Start AuroraSOC

# Start the API server
uvicorn aurorasoc.api.main:app --host 0.0.0.0 --port 8000 --reload

# Or use the Makefile
make api

If port 8000 is already occupied, stop the existing process before restarting AuroraSOC:

lsof -i :8000
kill <pid>

If PostgreSQL, Redis, or NATS already use their default host ports, change DEV_PG_PORT, DEV_REDIS_PORT, or DEV_NATS_PORT in .env and rerun docker compose -f docker-compose.dev.yml up -d.

Docker Compose Deployment

For a containerised deployment of the default Compose stack:

The x-granite-env YAML Anchor

The docker-compose.yml uses a YAML anchor to avoid repeating Granite environment variables across services:

x-granite-env: &granite-env
LLM_BACKEND: ${LLM_BACKEND:-vllm}
VLLM_BASE_URL: ${VLLM_BASE_URL:-http://vllm:8000/v1}
VLLM_MODEL: ${VLLM_MODEL:-granite-soc-specialist}
VLLM_ORCHESTRATOR_MODEL: ${VLLM_ORCHESTRATOR_MODEL:-granite-soc-specialist}
OLLAMA_BASE_URL: ${OLLAMA_BASE_URL:-http://ollama:11434}
OLLAMA_MODEL: ${OLLAMA_MODEL:-granite4:8b}
OLLAMA_ORCHESTRATOR_MODEL: ${OLLAMA_ORCHESTRATOR_MODEL:-granite4:8b}
GRANITE_USE_FINETUNED: ${GRANITE_USE_FINETUNED:-false}

Why an anchor? Multiple services (API, workers, health-check) need the same Granite settings. The anchor ensures they stay in sync: change it once, and every service that references *granite-env picks up the change.

Each service merges these variables:

services:
api:
environment:
<<: *granite-env
# ... other service-specific vars

Using Host-Native Ollama With Compose

AuroraSOC now ships a checked-in override file, docker-compose.host-ollama.yml, for the common local case where Ollama already runs on the host machine and you want the Compose-managed API, worker, and agent services to reuse that single local model.

Use it like this:

export LLM_BACKEND=ollama
export OLLAMA_DOCKER_BASE_URL=http://host.docker.internal:11434
export OLLAMA_MODEL=granite4:8b
export OLLAMA_ORCHESTRATOR_MODEL=granite4:8b

docker compose -f docker-compose.yml -f docker-compose.host-ollama.yml up -d
docker compose -f docker-compose.yml -f docker-compose.host-ollama.yml --profile agents up -d

What the override does:

  • Repoints API, worker, orchestrator, and specialist agents at the host Ollama endpoint.
  • Adds host.docker.internal:host-gateway so Linux containers can reach the host cleanly.
  • Disables the default ollama service unless you explicitly opt into the fallback profile.
  • Keeps the local single-model path on granite4:8b for both specialists and orchestration unless you override the tags and disable GRANITE_SINGLE_MODEL_MODE.

Use this override only when host Ollama is already running and the required model tags are already pulled.

For the minimal orchestrator plus Network Analyzer stack, the Makefile uses the same MODEL value for specialists and orchestration:

make docker-up-minimal MODEL=granite3.2:8b

Replace granite3.2:8b with whichever single model tag already appears in ollama list.

Starting Docker Compose

# Start the default Compose stack (Ollama, API, workers, monitoring)
docker compose up -d

# Check service health
docker compose ps

# View Ollama logs
docker compose logs ollama

# View API logs
docker compose logs api

Add --profile agents when you want the full agent fleet, and add --profile rust-core only when you need the optional Rust fast path.

Docker Compose Service Architecture

Importing Fine-Tuned Models in Docker

After training, import your GGUF into the Docker Ollama instance:

# Copy GGUF into the Ollama container
docker compose cp training/output/generic/unsloth.Q8_0.gguf ollama:/tmp/

# Exec into the container and create the model
docker compose exec ollama ollama create granite-soc:latest \
-f /tmp/Modelfile

# Or use the serve script which handles this automatically
docker compose exec api python training/scripts/serve_model.py ollama \
--gguf /models/generic/unsloth.Q8_0.gguf \
--name granite-soc:latest

Enabling / Disabling Fine-Tuned Models

Use base model only (no fine-tuning)

# .env
LLM_BACKEND=ollama
GRANITE_USE_FINETUNED=false
OLLAMA_MODEL=granite4:8b

All agents will use the same base model. Good for initial development.

Use a single fine-tuned generic model

# .env
LLM_BACKEND=ollama
GRANITE_USE_FINETUNED=true
OLLAMA_MODEL=granite-soc:latest

All agents share one fine-tuned model. Good after generic training.

Use per-agent fine-tuned models

# .env
LLM_BACKEND=ollama
GRANITE_USE_FINETUNED=true
OLLAMA_MODEL=granite-soc:latest # fallback for agents without a specialist model

The AGENT_MODEL_MAP in aurorasoc/granite/__init__.py maps each agent to its specialist:

AGENT_MODEL_MAP = {
"security_analyst": "granite-soc-security-analyst:latest",
"threat_hunter": "granite-soc-threat-hunter:latest",
"incident_responder": "granite-soc-incident-responder:latest",
# ...
}

The 4-tier resolution automatically selects the right model:

Override → Per-agent fine-tuned → Generic fine-tuned → Base

See Granite Module for the full resolution logic.

Verification Checklist

Run through this checklist after deployment:

# 1. Ollama is running and responsive
curl -s http://localhost:11434/api/tags | jq '.models[].name'

# 2. Expected models are loaded
ollama list | grep granite

# 3. Inference works
curl -s http://localhost:11434/api/chat -d '{
"model": "granite-soc:latest",
"messages": [{"role": "user", "content": "What is lateral movement?"}],
"stream": false
}' | jq '.message.content'

# 4. API server starts without errors
curl -s http://localhost:8000/health | jq

# 5. Granite module resolves models correctly
python -c "
from aurorasoc.granite import get_default_granite_config
cfg = get_default_granite_config()
for agent in ['security_analyst', 'threat_hunter', 'incident_responder']:
print(f'{agent}: {cfg.resolve_model(agent)}')
"

Troubleshooting

SymptomCauseFix
connection refused :11434Ollama not runningollama serve or systemctl start ollama
model not foundModel not importedollama list → then ollama create or ollama pull
out of memoryGGUF too large for GPUUse smaller quant (Q4_K_M) or set OLLAMA_GPU_LAYERS=0 for CPU
Models load slowlyCold startSet OLLAMA_KEEP_ALIVE=30m to keep model warm
Wrong output formatMissing chat templateRe-import with serve_model.py which generates correct Modelfile
CUDA errorDriver mismatchCheck nvidia-smi and ollama CUDA version compatibility
API returns base model outputGRANITE_USE_FINETUNED=falseSet to true in .env and restart

Production Considerations

For production deployments beyond a single machine:

  • Use vLLM — switch backend for throughput. See Serving Backends.
  • Separate GPU node — run the LLM server on a dedicated GPU machine, point OLLAMA_BASE_URL or VLLM_BASE_URL to its IP.
  • Model versioning — tag models with dates (granite-soc:2025-01-15) to enable rollback.
  • Health monitoring — integrate check_ollama_models() / check_vllm_models() into your monitoring stack.
  • GPU metrics — export nvidia-smi metrics to Prometheus via dcgm-exporter.

Next Steps