Home/Docs/Multi-instance Deployment
βš–οΈ Infrastructure

Multi-instance Deployment

Run any number of tenant-server replicas behind a load balancer with no "must be on the same pod" constraints. Every service component is designed to be stateless β€” shared state lives in Postgres and Redis.

πŸ’‘
TL;DR

Point TENANT_BASE_URL at your load balancer. All replicas share the same Postgres and Redis. No sticky sessions required for voice. The customization-agent worker can reach any replica.

Stateless by design

The tenant server holds no in-memory state that must survive a round-robin to a different replica. JWT signing keys live in Postgres (via the livekit_server namespace) and in your deploy-time passwords.yaml. Every replica reads from the same source, so any pod can mint a LiveKit JWT for any user.

The customization-agent (voice/avatar worker) calls back to /internal/customization-agent/* routes. These routes look up the org user in Postgres, validate the tenant, and proceed β€” no per-replica in-memory state is consulted. Point TENANT_BASE_URL at the load balancer and any replica can answer.

How coordination works across replicas

Anywhere replicas would otherwise step on each other under load balancing, YOffice coordinates through Redis. Each concern below maps to a distributed primitive that keeps every replica in lockstep:

Coordination concernWhy it matters across replicasHow YOffice handles it
Periodic sweeps in server.dartWithout coordination, every replica would fire every sweep β†’ NΓ— the work and duplicate triggersLeaderOnlyTimer.periodic(taskName: …) β€” a Redis-elected leader runs the sweep
Per-org request rate limitingAn in-memory counter would let each replica enforce the cap on its own slice, so N replicas allow NΓ— the limitDistributedRateLimiter β€” atomic Redis INCRBY per tenant + time bucket
Per-IP rate limit for public formsRequests could otherwise bypass the limit by hitting different replicasDistributedRateLimiter keyed by IP + bucket
Embedding concurrencyA per-replica semaphore would multiply by N replicas and exceed provider quotasDistributedSemaphore with cluster-wide capacity
Directive cache invalidationA local-only invalidate would leave other replicas serving stale directives for up to 5 minutesRedis generation counter β€” any replica's invalidate bumps it; all replicas see the bump
Real-time pub/sub fan-outpostMessage global: true reaches only the publishing replica's subscribersAll 137 call sites use ClusterPubsub.broadcast (global hardwired)

Scaling primitives

All primitives live in command_center_tenant_server/lib/src/scaling/horizontal_scaling.dart. They require only a shared Redis cluster β€” no separate coordination service.

  1. DistributedLock

    Redis SET NX PX advisory lock. Used to coordinate cross-replica mutual exclusion. Auto-expires after a configurable TTL so a crashed owner cannot block the cluster forever. Long-running holders must call renew() within the TTL window.

  2. LeaderOnlyTimer

    A drop-in replacement for Timer.periodic that uses a per-task Redis leader lease. Only the elected leader replica runs the body; all other replicas sit idle. Each task elects its own leader independently, spreading load naturally across pods.

  3. DistributedSemaphore

    A counting semaphore that bounds cluster-wide concurrency with Redis slot keys. Replaces per-isolate semaphores that only enforce limits locally β€” important for bulk knowledge-folder embeddings and similar provider-quota-sensitive operations.

  4. DistributedRateLimiter

    Atomic INCRBY in Redis keyed by tenant + time bucket. Enforces per-org request caps across every replica simultaneously. Per-org overrides are written by the admin rate-limit settings screen and pushed to Redis within ~30 seconds.

  5. ClusterPubsub

    A thin wrapper around Serverpod session.messages that passes global: true so pub/sub events reach subscribers on every replica, not just the one that published. All real-time chat, presence, and workflow stream events use this wrapper.

Deployment topology

A typical production setup looks like this:

Clients (Flutter / browser)
        β”‚  HTTPS / WSS
  Load balancer (nginx / Cloudflare / ALB)
     β”‚                   β”‚
tenant-server pod   tenant-server pod   … Γ— N
     β”‚                   β”‚
     β”œβ”€β”€ Postgres (shared)
     β”œβ”€β”€ Redis (shared)
     β”œβ”€β”€ LiveKit Cloud
     └── MinIO / Supabase Storage

customization-agent worker(s)
  └── HTTP callbacks β†’ TENANT_BASE_URL/internal/…
      (any pod can answer)

Deployment checklist

  1. Point TENANT_BASE_URL at the load balancer

    Set TENANT_BASE_URL to your load-balancer hostname (e.g. https://tenant.your-domain). Every callback the customization-agent worker makes β€” /internal/customization-agent/* β€” carries tenantId, organizationUserId, and a secret header. Any replica can respond; no in-memory state from the JWT-minting replica is required.

  2. Configure shared Postgres and Redis

    All replicas must read from the same Postgres instance and the same Redis cluster. The distributed primitives (locks, leader elections, semaphores, rate limits, pub/sub) only work correctly when all pods share a single Redis keyspace.

  3. Enable sticky sessions for WebSocket (optional but recommended)

    Sticky sessions are not required for voice β€” LiveKit Cloud connects clients directly to its own media plane after JWT minting. Sticky sessions ARE helpful for the Serverpod WebSocket so chat clients do not churn through reconnects when a pod restarts. Use ip_hash on nginx or a session-affinity policy on a managed load balancer.

  4. Set CC_INSTANCE_ID per pod (optional)

    Each pod derives a stable identity from CC_INSTANCE_ID β†’ HOSTNAME β†’ platform hostname β†’ random fallback. Setting CC_INSTANCE_ID explicitly in your deployment manifest gives you consistent pod names in logs, distributed-lock owner tokens, and the /internal/instance-info debug route.

Voice and LiveKit Cloud

ℹ️
No sticky sessions needed for voice

LiveKit Cloud connects clients directly to its own WebRTC media plane after the tenant server issues a JWT. The tenant server is not in the media path after token issuance, so session affinity does not affect voice quality.

The customization-agent worker registers with LiveKit Cloud using the agent_name=customization identifier. It dispatches callbacks to TENANT_BASE_URL β€” your load balancer β€” and any replica can handle them. The worker itself can also run as multiple instances; each instance registers with LiveKit independently.

Per-org rate limits

Organization admins can adjust the per-org callback rate cap in Settings β†’ Rate Limits. The new value is written to Postgres and pushed to Redis within ~30 seconds, so all replicas pick it up without a restart. The default is 240 requests per minute.

⚠️
Redis is required for multi-instance mode

The distributed primitives (locks, leader elections, rate limits, pub/sub) all require Redis. If Redis is not available, the server falls back to in-process behavior, which is safe for single-replica deployments but will exhibit the bugs described above under load balancing.