⚖️ Infrastructure

Multi-instance Deployment

Run any number of tenant-server replicas behind a load balancer with no "must be on the same pod" constraints. Every service component is designed to be stateless — shared state lives in Postgres and Redis.

💡

TL;DR

Point TENANT_BASE_URL at your load balancer. All replicas share the same Postgres and Redis. No sticky sessions required for voice. The customization-agent worker can reach any replica.

Stateless by design

The tenant server holds no in-memory state that must survive a round-robin to a different replica. JWT signing keys live in Postgres (via the livekit_server namespace) and in your deploy-time passwords.yaml. Every replica reads from the same source, so any pod can mint a LiveKit JWT for any user.

The customization-agent (voice/avatar worker) calls back to /internal/customization-agent/* routes. These routes look up the org user in Postgres, validate the tenant, and proceed — no per-replica in-memory state is consulted. Point TENANT_BASE_URL at the load balancer and any replica can answer.

How coordination works across replicas

Anywhere replicas would otherwise step on each other under load balancing, YOffice coordinates through Redis. Each concern below maps to a distributed primitive that keeps every replica in lockstep:

Coordination concern	Why it matters across replicas	How YOffice handles it
Periodic sweeps in server.dart	Without coordination, every replica would fire every sweep → N× the work and duplicate triggers	`LeaderOnlyTimer.periodic(taskName: …)` — a Redis-elected leader runs the sweep
Per-org request rate limiting	An in-memory counter would let each replica enforce the cap on its own slice, so N replicas allow N× the limit	`DistributedRateLimiter` — atomic Redis INCRBY per tenant + time bucket
Per-IP rate limit for public forms	Requests could otherwise bypass the limit by hitting different replicas	`DistributedRateLimiter` keyed by IP + bucket
Embedding concurrency	A per-replica semaphore would multiply by N replicas and exceed provider quotas	`DistributedSemaphore` with cluster-wide capacity
Directive cache invalidation	A local-only invalidate would leave other replicas serving stale directives for up to 5 minutes	Redis generation counter — any replica's invalidate bumps it; all replicas see the bump
Real-time pub/sub fan-out	`postMessage` without `global: true` reaches only the publishing replica's subscribers	All ~170 real-time call sites pass global: true directly on postMessage, so every replica's subscribers are reached`ClusterPubsub.broadcast`

Scaling primitives

All primitives live in command_center_tenant_server/lib/src/scaling/horizontal_scaling.dart. They require only a shared Redis cluster — no separate coordination service.

DistributedLock
Redis SET NX PX advisory lock. Used to coordinate cross-replica mutual exclusion. Auto-expires after a configurable TTL so a crashed owner cannot block the cluster forever. Long-running holders must call renew() within the TTL window.
LeaderOnlyTimer
A drop-in replacement for Timer.periodic that uses a per-task Redis leader lease. Only the elected leader replica runs the body; all other replicas sit idle. Each task elects its own leader independently, spreading load naturally across pods.
DistributedSemaphore
A counting semaphore that bounds cluster-wide concurrency with Redis slot keys. Replaces per-isolate semaphores that only enforce limits locally — important for bulk knowledge-folder embeddings and similar provider-quota-sensitive operations.
DistributedRateLimiter
Atomic INCRBY in Redis keyed by tenant + time bucket. Enforces per-org request caps across every replica simultaneously. Per-org overrides are written by the admin rate-limit settings screen and pushed to Redis within ~30 seconds.
ClusterPubsub
A thin wrapper around Serverpod session.messages that passes global: true so pub/sub events reach subscribers on every replica, not just the one that published. Real-time chat, presence, and workflow stream events all pass global: true the same way, guaranteeing cluster-wide fan-out.

Deployment topology

A typical production setup looks like this:

Clients (Flutter / browser)
        │  HTTPS / WSS
  Load balancer (nginx / Cloudflare / ALB)
     │                   │
tenant-server pod   tenant-server pod   … × N
     │                   │
     ├── Postgres (shared)
     ├── Redis (shared)
     ├── LiveKit Cloud
     └── MinIO / Supabase Storage

customization-agent worker(s)
  └── HTTP callbacks → TENANT_BASE_URL/internal/…
      (any pod can answer)

Deployment checklist

Point TENANT_BASE_URL at the load balancer
Set TENANT_BASE_URL to your load-balancer hostname (e.g. https://tenant.your-domain). Every callback the customization-agent worker makes — /internal/customization-agent/* — carries tenantId, organizationUserId, and a secret header. Any replica can respond; no in-memory state from the JWT-minting replica is required.
Configure shared Postgres and Redis
All replicas must read from the same Postgres instance and the same Redis cluster. The distributed primitives (locks, leader elections, semaphores, rate limits, pub/sub) only work correctly when all pods share a single Redis keyspace.
Enable sticky sessions for WebSocket (optional but recommended)
Sticky sessions are not required for voice — LiveKit Cloud connects clients directly to its own media plane after JWT minting. Sticky sessions ARE helpful for the Serverpod WebSocket so chat clients do not churn through reconnects when a pod restarts. Use ip_hash on nginx or a session-affinity policy on a managed load balancer.
Set CC_INSTANCE_ID per pod (optional)
Each pod derives a stable identity from CC_INSTANCE_ID → HOSTNAME → platform hostname → random fallback. Setting CC_INSTANCE_ID explicitly in your deployment manifest gives you consistent pod names in logs, distributed-lock owner tokens, and the /internal/instance-info debug route.

Voice and LiveKit Cloud

ℹ️

No sticky sessions needed for voice

LiveKit Cloud connects clients directly to its own WebRTC media plane after the tenant server issues a JWT. The tenant server is not in the media path after token issuance, so session affinity does not affect voice quality.

The customization-agent worker registers with LiveKit Cloud using the agent_name=customization identifier. It dispatches callbacks to TENANT_BASE_URL — your load balancer — and any replica can handle them. The worker itself can also run as multiple instances; each instance registers with LiveKit independently.

Per-org rate limits

Organization admins can adjust the per-org callback rate cap in Settings → Rate limit. The new value is written to Postgres and Redis immediately; every replica picks it up without a restart within about 30 seconds — the AI server's own policy-cache refresh interval. The default is 240 requests per minute.Settings → Rate Limits

⚠️

Redis is required for multi-instance mode

The distributed primitives (locks, leader elections, rate limits, pub/sub) all require Redis. If Redis is not available, each one degrades deliberately rather than silently duplicating work: rate limits fail open (requests are allowed rather than double-enforced), semaphores fall back to an unbounded lease, and leader-elected periodic tasks pause cluster-wide instead of every replica running them independently. This avoids the worse bugs described above under load balancing, but those safeguards are effectively off until Redis is back.

← PreviousLiveKit Setup Next →Organization Settings