🗣️ Infrastructure

Voice providers

The voice assistant turns spoken input into text, runs it through the same chat engine your text conversations use, and speaks the answer back. That needs a speech-to-text and a text-to-speech provider. Your Office AI is provider-agnostic here — you configure Deepgram, a Google Cloud service account, or both, and there is no hardcoded default.

ℹ️
No default brain

There is no built-in voice provider. Nothing speaks until an admin configures at least one bridge. When more than one is configured, the worker picks among them by a cost-ordered fallback, not a hardcoded preference.

How the voice bridge works

The voice worker joins a LiveKit room, transcribes the user's audio, sends that text to the very same chat engine the dock uses — same system prompt, tools, and history — and speaks the reply. Because voice and text write to one chat session, you can switch between them mid-conversation.

🎙️User audioOver LiveKit
📝STTDeepgram / Google
🧠Chat engineSame brain as text
🔊TTSDeepgram / Google
🗣️Spoken replyBack to the user
One chat engine, two input modes — voice and text share the same session.

The two providers

ProviderSTTTTSCredential
DeepgramNova-3Aura-2One API key
Google CloudCloud Speech-to-Text v2Cloud Text-to-SpeechService-account JSON key

Option A — Deepgram

Deepgram is the simplest path: a single API key covers both Nova-3 STT and Aura-2 TTS.

  1. Create a Deepgram API key

    Sign in at the Deepgram console and create an API key. Deepgram provides both speech-to-text (Nova-3) and text-to-speech (Aura-2) under the one key, so a single credential covers the whole pipeline.

  2. Add it in Your Office AI

    Open Settings → Organizations → Org Settings → Voice & AI and paste the Deepgram key into the voice provider card. The key is stored encrypted and redacted on subsequent reads.

  3. Or set it on the voice worker

    For a self-hosted deployment you can instead supply the key to the voice bridge worker via its environment, alongside the LiveKit connection details it uses to join rooms.

Option B — Google Cloud service account

Google's Cloud Speech-to-Text v2 and Cloud Text-to-Speech authenticate with a service-account key, not a plain Gemini API key — the Gemini key only authenticates the Generative Language API.

  1. Create a GCP service account

    In the Google Cloud console, in the project that owns your Google billing, create a service account (for example cc-voice-bridge). Cloud Speech-to-Text v2 and Cloud Text-to-Speech authenticate with Application Default Credentials — a service-account key — not a plain API key.

  2. Grant two least-privilege roles

    Assign Cloud Speech-to-Text Client (roles/speech.client) and Cloud Text-to-Speech User (roles/cloudtts.client). These let the account call STT and TTS and nothing else.

  3. Enable the two APIs

    Enable speech.googleapis.com and texttospeech.googleapis.com in APIs & Services → Library if they are not already on.

  4. Create and upload the JSON key

    Generate a JSON key for the service account. Upload it in Settings → Org Settings → Voice & AI (the card validates the JSON shape and redacts the private key on later reads), or mount it on the voice worker via GOOGLE_APPLICATION_CREDENTIALS for a self-hosted deploy.

⚠️
Service account, not API key

Cloud Speech v2 and Cloud TTS reject a plain Gemini API key. They need Application Default Credentials — i.e. a downloaded service-account JSON. Treat that JSON like a password; anyone holding it can run STT and TTS on your project's bill.

How providers map to voice modes

The voice mode (under Org Settings → Voice & AI) decides which bridge the worker uses. The key point: the mode chooses among the providers you configured — it never introduces a provider of its own.

ModeBehaviour
Auto (recommended)Picks the cheapest configured bridge that still gives full chat-engine parity, in fallback order: Deepgram → Google → a Gemini Live bridge as a last resort. It is a cost-ordered fallback, not a default provider.
UnifiedForces the STT → chat-engine → TTS pipeline. Choose this if you want a loud failure when no bridge provider is configured rather than a silent fallback.
Realtime bridgePins a Gemini Live realtime model as the audio transport with the chat-engine brain attached. Useful when you specifically want the realtime model in the loop.
💡
Auto degrades gracefully

In Auto mode the worker never crashes if a provider credential is missing, unreadable, or unauthorized — it logs the failure and falls through to the next bridge in order. A stale Google key degrades to "voice still works" rather than "voice is broken".

Cost notes

Streaming Cloud Speech-to-Text v2 is roughly $0.024 per minute of audio and Cloud TTS Neural2 voices are about $16 per million characters — and you only pay for the audio you actually transcribe or speak, with no idle session cost. Deepgram is billed similarly per minute / per character. As with every provider, admins set per-organisation spend caps, and a cap of 0 disables voice rather than removing the limit.

ℹ️
Next

See the Unified Voice + Chat guide for how voice behaves in the app, then finish the setup chain with Transactional email.