The voice assistant turns spoken input into text, runs it through the same chat engine your text conversations use, and speaks the answer back. That needs a speech-to-text and a text-to-speech provider. Your Office AI is provider-agnostic here — you configure Deepgram, a Google Cloud service account, or both, and there is no hardcoded default.
There is no built-in voice provider. Nothing speaks until an admin configures at least one bridge. When more than one is configured, the worker picks among them by a cost-ordered fallback, not a hardcoded preference.
The voice worker joins a LiveKit room, transcribes the user's audio, sends that text to the very same chat engine the dock uses — same system prompt, tools, and history — and speaks the reply. Because voice and text write to one chat session, you can switch between them mid-conversation.
| Provider | STT | TTS | Credential |
|---|---|---|---|
| Deepgram | Nova-3 | Aura-2 | One API key |
| Google Cloud | Cloud Speech-to-Text v2 | Cloud Text-to-Speech | Service-account JSON key |
Deepgram is the simplest path: a single API key covers both Nova-3 STT and Aura-2 TTS.
Sign in at the Deepgram console and create an API key. Deepgram provides both speech-to-text (Nova-3) and text-to-speech (Aura-2) under the one key, so a single credential covers the whole pipeline.
Open Settings → Organizations → Org Settings → Voice & AI and paste the Deepgram key into the voice provider card. The key is stored encrypted and redacted on subsequent reads.
For a self-hosted deployment you can instead supply the key to the voice bridge worker via its environment, alongside the LiveKit connection details it uses to join rooms.
Google's Cloud Speech-to-Text v2 and Cloud Text-to-Speech authenticate with a service-account key, not a plain Gemini API key — the Gemini key only authenticates the Generative Language API.
In the Google Cloud console, in the project that owns your Google billing, create a service account (for example cc-voice-bridge). Cloud Speech-to-Text v2 and Cloud Text-to-Speech authenticate with Application Default Credentials — a service-account key — not a plain API key.
Assign Cloud Speech-to-Text Client (roles/speech.client) and Cloud Text-to-Speech User (roles/cloudtts.client). These let the account call STT and TTS and nothing else.
Enable speech.googleapis.com and texttospeech.googleapis.com in APIs & Services → Library if they are not already on.
Generate a JSON key for the service account. Upload it in Settings → Org Settings → Voice & AI (the card validates the JSON shape and redacts the private key on later reads), or mount it on the voice worker via GOOGLE_APPLICATION_CREDENTIALS for a self-hosted deploy.
Cloud Speech v2 and Cloud TTS reject a plain Gemini API key. They need Application Default Credentials — i.e. a downloaded service-account JSON. Treat that JSON like a password; anyone holding it can run STT and TTS on your project's bill.
The voice mode (under Org Settings → Voice & AI) decides which bridge the worker uses. The key point: the mode chooses among the providers you configured — it never introduces a provider of its own.
| Mode | Behaviour |
|---|---|
| Auto (recommended) | Picks the cheapest configured bridge that still gives full chat-engine parity, in fallback order: Deepgram → Google → a Gemini Live bridge as a last resort. It is a cost-ordered fallback, not a default provider. |
| Unified | Forces the STT → chat-engine → TTS pipeline. Choose this if you want a loud failure when no bridge provider is configured rather than a silent fallback. |
| Realtime bridge | Pins a Gemini Live realtime model as the audio transport with the chat-engine brain attached. Useful when you specifically want the realtime model in the loop. |
In Auto mode the worker never crashes if a provider credential is missing, unreadable, or unauthorized — it logs the failure and falls through to the next bridge in order. A stale Google key degrades to "voice still works" rather than "voice is broken".
Streaming Cloud Speech-to-Text v2 is roughly $0.024 per minute of audio and Cloud TTS Neural2 voices are about $16 per million characters — and you only pay for the audio you actually transcribe or speak, with no idle session cost. Deepgram is billed similarly per minute / per character. As with every provider, admins set per-organisation spend caps, and a cap of 0 disables voice rather than removing the limit.
See the Unified Voice + Chat guide for how voice behaves in the app, then finish the setup chain with Transactional email.