Local Deployment Guide
This page walks through deploying AuroraSOC with Granite LLMs on a single machine — the fastest way to go from fine-tuned model to running agents. It covers the automated setup script, manual steps, Docker Compose integration, and verification.
If you want the full platform bring-up with the dashboard, API, MCP services, orchestrator, and specialist agents, use AI Agent Fleet Deployment. This page stays focused on the model-serving side of the stack.
TL;DR — MVP-1 Single-Command Path
For the MVP-1 host-run topology with the 14-agent mesh and the dashboard, three
Makefile targets are all you need after ollama serve and ollama pull granite3.2:8b:
make llm-doctor # Verify Ollama + Granite 3.2 + BeeAI ChatModel end-to-end
make stack-up # Bring up Postgres, Redis, NATS, API, A2A mesh, dashboard
make agents-smoke # Deterministic prompt to every live agent, PASS/FAIL summary
Tear everything down with make stack-down (set KEEP_INFRA=1 to keep
Postgres/Redis/NATS running between iterations).
For container-based topologies see Deployment Modes. The rest of this page is the long-form reference for each layer.
Prerequisites
Before you begin:
| Requirement | Minimum | Recommended |
|---|---|---|
| OS | Ubuntu 22.04 / macOS 14+ | Ubuntu 24.04 |
| RAM | 16 GB | 32 GB |
| GPU VRAM | 4 GB (GGUF q4) | 8+ GB (GGUF q8) |
| Disk | 20 GB free | 50 GB free |
| Ollama | v0.4+ | Latest |
| Docker | 24.0+ | 27.0+ |
| Python | 3.10+ | 3.12 |
Automated Setup – setup_local.sh
The fastest path. This script installs everything and verifies the setup:
chmod +x scripts/setup_local.sh
./scripts/setup_local.sh
What the Script Does
- Checks system dependencies — verifies Python ≥ 3.10, Docker, Docker Compose, NVIDIA drivers (if GPU present)
- Installs Ollama — downloads and installs Ollama if not present
- Pulls the base model —
ollama pull granite4:8b - Creates Python virtualenv — installs AuroraSOC with training extras (
pip install -e ".[training]") - Copies
.env.example→.env— seeds runtime defaults (vLLM default backend; switch to Ollama for CPU-only local mode) - Runs database migrations —
alembic upgrade head - Verifies Ollama inference — sends a test prompt and checks for a valid response
- Prints status summary — shows all service URLs and next steps
When to Use the Script
- First time setup on a new machine
- After cloning the repository on a fresh environment
- When onboarding a new developer who needs everything working quickly
When NOT to Use the Script
- You already have a working environment (just update
.envmanually) - You're deploying to production (use Docker Compose instead)
- You need a GPU-first vLLM setup immediately (the script prepares the Ollama local path)
Manual Setup (Step-by-Step)
If you prefer control over each step:
Step 1: Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh
ollama serve & # Start in background
Step 2: Pull or Import a Model
Option A — Use one installed model, or pull one base model:
Check local inventory first, especially on limited data quotas:
ollama list
If a usable model is already installed, use that same tag for both agent roles. Pull granite4:8b only when no acceptable local model exists:
ollama pull granite4:8b
granite3.2:8b is also a supported single-model tag for the entire 14-agent
fleet — set OLLAMA_MODEL=granite3.2:8b and OLLAMA_ORCHESTRATOR_MODEL=granite3.2:8b
if that is what your ollama list already shows. Either tag runs on one warm
process thanks to GRANITE_SINGLE_MODEL_MODE=true.
For quota-limited systems, do not pull a second model until the installed model has failed a small /api/chat or /api/generate probe. AuroraSOC's local MVP expects one inference service and one shared model tag for every specialist plus the orchestrator.
Option B — Import your fine-tuned GGUF:
# Generate Modelfile + create Ollama model
python training/scripts/serve_model.py ollama \
--gguf training/output/generic/unsloth.Q8_0.gguf \
--name granite-soc:latest
Option C — Import per-agent models:
# Create all agent-specific models at once
python training/scripts/serve_model.py ollama-all \
--output-dir training/output
This creates separate Ollama models for each trained agent:
granite-soc-security-analyst:latestgranite-soc-threat-hunter:latestgranite-soc-incident-responder:latest- (etc.)
Step 3: Configure Environment
make env-init
Review .env and adjust only if you need to deviate from the host-run demo path:
# Backend selection
LLM_BACKEND=ollama
OLLAMA_BASE_URL=http://localhost:11434
# Demo-mode runtime defaults
SYSTEM_MODE=dummy
LOCAL_AUTH_ENABLED=true
DEV_PG_PORT=5432
DEV_REDIS_PORT=6379
DEV_NATS_PORT=4222
PG_HOST=localhost
PG_PORT=5432
PG_DATABASE=aurorasoc
PG_USER=aurora
PG_PASSWORD=aurora_dev
PG_SSLMODE=disable
REDIS_URL=redis://localhost:6379
NATS_URL=nats://localhost:4222
# Model selection (base or fine-tuned)
OLLAMA_MODEL=granite4:8b # Base model name in Ollama
OLLAMA_ORCHESTRATOR_MODEL=granite4:8b # Local single-model fallback for orchestration
GRANITE_SINGLE_MODEL_MODE=true # Force all Ollama agents to one model tag
GRANITE_USE_SHARED_MODEL_POOL=true # Reuse equivalent BeeAI ChatModel clients
GRANITE_MAX_CONCURRENT_REQUESTS=1 # Recommended for 8 GB VRAM laptops
GRANITE_REQUEST_QUEUE_SIZE=16
GRANITE_INFERENCE_TIMEOUT_SECONDS=180
# Optional: generic fine-tuned model tag
# OLLAMA_MODEL=granite-soc:latest
# Optional: use per-agent fine-tuned models (turn off single-model mode first)
GRANITE_USE_FINETUNED=true
# GRANITE_SINGLE_MODEL_MODE=false
Step 4: Verify
# Confirm models are loaded
ollama list
# Test inference directly; replace granite4:8b with your selected installed tag
curl -sS http://127.0.0.1:11434/api/chat \
-d '{"model":"granite4:8b","messages":[{"role":"user","content":"Reply with OK only."}],"stream":false,"options":{"num_predict":8,"num_ctx":1024,"temperature":0}}'
# Test via the Granite module
python -c "
from aurorasoc.granite import get_default_granite_config
config = get_default_granite_config()
print(f'Backend: {config.backend}')
print(f'Model: {config.resolve_model(\"security_analyst\")}')
"
# After the API starts, verify the dashboard-facing registry reports one model
curl -sS http://localhost:8000/api/v1/agents | jq '.agents | map(.model) | unique'
To start the host-run A2A mesh against that same model, run:
make agents-local MODEL=<installed-ollama-tag>
The launcher sets both OLLAMA_MODEL and OLLAMA_ORCHESTRATOR_MODEL to the same tag, points local A2A and MCP discovery at 127.0.0.1, and runs all specialist agents plus the orchestrator on ports 9000 through 9016.
Step 5: Start AuroraSOC
# Start the API server
uvicorn aurorasoc.api.main:app --host 0.0.0.0 --port 8000 --reload
# Or use the Makefile
make api
If port 8000 is already occupied, stop the existing process before restarting AuroraSOC:
lsof -i :8000
kill <pid>
If PostgreSQL, Redis, or NATS already use their default host ports, change
DEV_PG_PORT, DEV_REDIS_PORT, or DEV_NATS_PORT in .env and rerun
docker compose -f docker-compose.dev.yml up -d.
Docker Compose Deployment
For a containerised deployment of the default Compose stack:
The x-granite-env YAML Anchor
The docker-compose.yml uses a YAML anchor to avoid repeating Granite environment variables across services:
x-granite-env: &granite-env
LLM_BACKEND: ${LLM_BACKEND:-vllm}
VLLM_BASE_URL: ${VLLM_BASE_URL:-http://vllm:8000/v1}
VLLM_MODEL: ${VLLM_MODEL:-granite-soc-specialist}
VLLM_ORCHESTRATOR_MODEL: ${VLLM_ORCHESTRATOR_MODEL:-granite-soc-specialist}
OLLAMA_BASE_URL: ${OLLAMA_BASE_URL:-http://ollama:11434}
OLLAMA_MODEL: ${OLLAMA_MODEL:-granite4:8b}
OLLAMA_ORCHESTRATOR_MODEL: ${OLLAMA_ORCHESTRATOR_MODEL:-granite4:8b}
GRANITE_USE_FINETUNED: ${GRANITE_USE_FINETUNED:-false}
Why an anchor? Multiple services (API, workers, health-check) need the same Granite settings. The anchor ensures they stay in sync: change it once, and every service that references *granite-env picks up the change.
Each service merges these variables:
services:
api:
environment:
<<: *granite-env
# ... other service-specific vars
Using Host-Native Ollama With Compose
AuroraSOC now ships a checked-in override file, docker-compose.host-ollama.yml, for the common local case where Ollama already runs on the host machine and you want the Compose-managed API, worker, and agent services to reuse that single local model.
Use it like this:
export LLM_BACKEND=ollama
export OLLAMA_DOCKER_BASE_URL=http://host.docker.internal:11434
export OLLAMA_MODEL=granite4:8b
export OLLAMA_ORCHESTRATOR_MODEL=granite4:8b
docker compose -f docker-compose.yml -f docker-compose.host-ollama.yml up -d
docker compose -f docker-compose.yml -f docker-compose.host-ollama.yml --profile agents up -d
What the override does:
- Repoints API, worker, orchestrator, and specialist agents at the host Ollama endpoint.
- Adds
host.docker.internal:host-gatewayso Linux containers can reach the host cleanly. - Disables the default
ollamaservice unless you explicitly opt into the fallback profile. - Keeps the local single-model path on
granite4:8bfor both specialists and orchestration unless you override the tags and disableGRANITE_SINGLE_MODEL_MODE.
Use this override only when host Ollama is already running and the required model tags are already pulled.
For the minimal orchestrator plus Network Analyzer stack, the Makefile uses the same MODEL value for specialists and orchestration:
make docker-up-minimal MODEL=granite3.2:8b
Replace granite3.2:8b with whichever single model tag already appears in ollama list.
Starting Docker Compose
# Start the default Compose stack (Ollama, API, workers, monitoring)
docker compose up -d
# Check service health
docker compose ps
# View Ollama logs
docker compose logs ollama
# View API logs
docker compose logs api
Add --profile agents when you want the full agent fleet, and add
--profile rust-core only when you need the optional Rust fast path.
Docker Compose Service Architecture
Importing Fine-Tuned Models in Docker
After training, import your GGUF into the Docker Ollama instance:
# Copy GGUF into the Ollama container
docker compose cp training/output/generic/unsloth.Q8_0.gguf ollama:/tmp/
# Exec into the container and create the model
docker compose exec ollama ollama create granite-soc:latest \
-f /tmp/Modelfile
# Or use the serve script which handles this automatically
docker compose exec api python training/scripts/serve_model.py ollama \
--gguf /models/generic/unsloth.Q8_0.gguf \
--name granite-soc:latest
Enabling / Disabling Fine-Tuned Models
Use base model only (no fine-tuning)
# .env
LLM_BACKEND=ollama
GRANITE_USE_FINETUNED=false
OLLAMA_MODEL=granite4:8b
All agents will use the same base model. Good for initial development.
Use a single fine-tuned generic model
# .env
LLM_BACKEND=ollama
GRANITE_USE_FINETUNED=true
OLLAMA_MODEL=granite-soc:latest
All agents share one fine-tuned model. Good after generic training.
Use per-agent fine-tuned models
# .env
LLM_BACKEND=ollama
GRANITE_USE_FINETUNED=true
OLLAMA_MODEL=granite-soc:latest # fallback for agents without a specialist model
The AGENT_MODEL_MAP in aurorasoc/granite/__init__.py maps each agent to its specialist:
AGENT_MODEL_MAP = {
"security_analyst": "granite-soc-security-analyst:latest",
"threat_hunter": "granite-soc-threat-hunter:latest",
"incident_responder": "granite-soc-incident-responder:latest",
# ...
}
The 4-tier resolution automatically selects the right model:
Override → Per-agent fine-tuned → Generic fine-tuned → Base
See Granite Module for the full resolution logic.
Verification Checklist
Run through this checklist after deployment:
# 1. Ollama is running and responsive
curl -s http://localhost:11434/api/tags | jq '.models[].name'
# 2. Expected models are loaded
ollama list | grep granite
# 3. Inference works
curl -s http://localhost:11434/api/chat -d '{
"model": "granite-soc:latest",
"messages": [{"role": "user", "content": "What is lateral movement?"}],
"stream": false
}' | jq '.message.content'
# 4. API server starts without errors
curl -s http://localhost:8000/health | jq
# 5. Granite module resolves models correctly
python -c "
from aurorasoc.granite import get_default_granite_config
cfg = get_default_granite_config()
for agent in ['security_analyst', 'threat_hunter', 'incident_responder']:
print(f'{agent}: {cfg.resolve_model(agent)}')
"
Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
connection refused :11434 | Ollama not running | ollama serve or systemctl start ollama |
model not found | Model not imported | ollama list → then ollama create or ollama pull |
out of memory | GGUF too large for GPU | Use smaller quant (Q4_K_M) or set OLLAMA_GPU_LAYERS=0 for CPU |
| Models load slowly | Cold start | Set OLLAMA_KEEP_ALIVE=30m to keep model warm |
| Wrong output format | Missing chat template | Re-import with serve_model.py which generates correct Modelfile |
CUDA error | Driver mismatch | Check nvidia-smi and ollama CUDA version compatibility |
| API returns base model output | GRANITE_USE_FINETUNED=false | Set to true in .env and restart |
Production Considerations
For production deployments beyond a single machine:
- Use vLLM — switch backend for throughput. See Serving Backends.
- Separate GPU node — run the LLM server on a dedicated GPU machine, point
OLLAMA_BASE_URLorVLLM_BASE_URLto its IP. - Model versioning — tag models with dates (
granite-soc:2025-01-15) to enable rollback. - Health monitoring — integrate
check_ollama_models()/check_vllm_models()into your monitoring stack. - GPU metrics — export
nvidia-smimetrics to Prometheus viadcgm-exporter.
Next Steps
- Serving Backends — Ollama vs vLLM deep dive
- Model Swap & Override — switch models without redeploying
- Training: Overview — go back and train a model