Skip to main content

Federation mesh

Demo-grade but real inter-site federation (ADR 029): NATS leafnodes at the transport layer, health gossip and severity-gated alert sharing on top. The full federation controller (policy bundles, cross-site case handoff, federated search) remains deferred per the architecture document.

Transport

The primary site's NATS exposes a leafnode listener (infra/nats/nats-server.conf, port 7422, TLS); peer sites run their own NATS attached as leafnodes (infra/nats/nats-leaf.conf). Subject interest propagates across the link, so aurora.> traffic published at any site reaches every peer while each side keeps an independent JetStream (distinct domain: values stop cross-site JS API capture). Plaintext dev configs for the local two-site demo live at infra/nats/dev-hub.conf / dev-leaf.conf.

Code map

PieceWhere
Settings (FEDERATION_*)aurorasoc/config/settings/messaging.py (FederationSettings)
Core logicaurorasoc/services/federation.py
Worker loopsaurorasoc/workers/federation_worker.py
NATS subjects + clientsaurorasoc/events/nats_jetstream.py
LifecycleAPI lifespan in aurorasoc/api/main.py (only when enabled)
Teststests/backend/test_federation.py

Health gossip

run_gossip_loop publishes build_health_payload() to aurora.federation.health.<site_id> every heartbeat_interval_seconds (15s default), refreshes the local SiteModel row, and runs mark_stale_links so links whose peers went silent demote to degraded and then down (thresholds: 45s / 120s by default).

run_health_listener consumes peer gossip and upserts the remote SiteModel plus the undirected SiteLinkModel (link-<a>-<b>, lexicographic). Link status derives from heartbeat age; latency is approximated from gossip propagation age - honest on one host, clock-skew-sensitive across real WANs (echo-based measurement is the follow-up when the mesh leaves demo stage).

/api/v1/sites and /api/v1/system/topology serve this state with no contract change; the operator console's SOC Site Topology map renders it directly.

Alert federation

The alert create path calls should_federate(severity) - at or above federate_min_severity (default high) the alert is stamped with its origin (build_federated_alert) and published best-effort to aurora.alerts.federation.<severity>; the local write never blocks on the mesh.

run_alert_listener persists peer alerts via ingest_remote_alert: own-origin echoes are dropped, replays dedup on a hash of (origin_site, origin_alert_id), IOCs normalize to the canonical dict shape, and the row lands with source=federation:<origin> plus origin metadata - which the alert queue renders as the purple origin badge.

Running two sites locally

just stack-up # primary stack (set FEDERATION_ENABLED=true in .env)
just stack-up-site-b # second site: own Postgres/Redis + NATS leaf, API on :8002
just migrate-site-b

tools/scripts/demo/attack_simulator.py --api-base http://localhost:8002 --scenario c2-beacon raises a critical alert at site B; within a heartbeat it appears in the primary site's queue with the origin badge, and the topology map shows the live healthy link.