🗣️ Infrastructure

Voice providers

The voice assistant turns spoken input into text, runs it through the same chat engine your text conversations use, and speaks the answer back. That needs a speech-to-text and a text-to-speech provider. Your Office AI is provider-agnostic here — you configure Deepgram, a Google Cloud service account, or both, and there is no hardcoded default.

ℹ️

No default brain

There is no built-in voice provider. Nothing speaks until an admin configures at least one bridge. When more than one is configured, the worker picks among them by a cost-ordered fallback, not a hardcoded preference.

How the voice bridge works

The voice worker joins a LiveKit room, transcribes the user's audio, sends that text to the very same chat engine the dock uses — same system prompt, tools, and history — and speaks the reply. Because voice and text write to one chat session, you can switch between them mid-conversation.

🎙️User audioOver LiveKit

📝STTDeepgram / Google

🧠Chat engineSame brain as text

🔊TTSDeepgram / Google

🗣️Spoken replyBack to the user

One chat engine, two input modes — voice and text share the same session.

The two providers

Provider	STT	TTS	Credential
Deepgram	Nova-3	Aura-2	One API key
Google Cloud	Cloud Speech-to-Text v2	Cloud Text-to-Speech	Service-account JSON key

Option A — Deepgram

Deepgram is the simplest path: a single API key covers both Nova-3 STT and Aura-2 TTS.

Create a Deepgram API key
Sign in at the Deepgram console and create an API key. Deepgram provides both speech-to-text (Nova-3) and text-to-speech (Aura-2) under the one key, so a single credential covers the whole pipeline.
Add it in Your Office AI
Open Settings → Organizations → Org Settings → Voice & AI and paste the Deepgram key into the voice provider card. The key is stored encrypted and redacted on subsequent reads.
Or set it on the voice worker
For a self-hosted deployment you can instead supply the key to the voice bridge worker via its environment, alongside the LiveKit connection details it uses to join rooms.

Option B — Google Cloud service account

Google's Cloud Speech-to-Text v2 and Cloud Text-to-Speech authenticate with a service-account key, not a plain Gemini API key — the Gemini key only authenticates the Generative Language API.

Create a GCP service account
In the Google Cloud console, in the project that owns your Google billing, create a service account (for example cc-voice-bridge). Cloud Speech-to-Text v2 and Cloud Text-to-Speech authenticate with Application Default Credentials — a service-account key — not a plain API key.
Grant two least-privilege roles
Assign Cloud Speech-to-Text Client (roles/speech.client) and Cloud Text-to-Speech User (roles/cloudtts.client). These let the account call STT and TTS and nothing else.
Enable the two APIs
Enable speech.googleapis.com and texttospeech.googleapis.com in APIs & Services → Library if they are not already on.
Create and upload the JSON key
Generate a JSON key for the service account. Upload it in Settings → Org Settings → Voice & AI (the card validates the JSON shape and redacts the private key on later reads), or mount it on the voice worker via GOOGLE_APPLICATION_CREDENTIALS for a self-hosted deploy.

⚠️

Service account, not API key

Cloud Speech v2 and Cloud TTS reject a plain Gemini API key. They need Application Default Credentials — i.e. a downloaded service-account JSON. Treat that JSON like a password; anyone holding it can run STT and TTS on your project's bill.

How providers map to voice modes

The voice mode (under Org Settings → Voice & AI) decides which bridge the worker uses. The key point: the mode chooses among the providers you configured — it never introduces a provider of its own.

Mode	Behaviour
Auto (recommended)	Picks the cheapest configured bridge that still gives full chat-engine parity, in fallback order: Deepgram → Google → a Gemini Live bridge as a last resort. It is a cost-ordered fallback, not a default provider.
Unified	Forces the STT → chat-engine → TTS pipeline. Choose this if you want a loud failure when no bridge provider is configured rather than a silent fallback.

💡

Auto degrades gracefully

In Auto mode the worker never crashes if a provider credential is missing, unreadable, or unauthorized — it logs the failure and falls through to the next bridge in order. A stale Google key degrades to "voice still works" rather than "voice is broken".

Cost notes

Streaming Cloud Speech-to-Text v2 is roughly $0.024 per minute of audio and Cloud TTS Neural2 voices are about $16 per million characters — and you only pay for the audio you actually transcribe or speak, with no idle session cost. Deepgram is billed similarly per minute / per character. As with every provider, admins set per-organisation spend caps, and a cap of 0 disables voice rather than removing the limit.

ℹ️

Next

See the Unified Voice + Chat guide for how voice behaves in the app, then finish the setup chain with Transactional email.

← PreviousOAuth & integrations Next →Transactional email