🎙️Admin & Settings

Unified voice + text chat

In Your Office AI, voice always rides on the same chat engine as text, so speaking has the same tools, context, and memory as typing. This page covers how the voice pipeline is chosen and how to provision the credentials it needs.

One brain for voice and text

Your Office AI treats voice as "unified": every voice path routes through the chat engine, so a spoken request can use #-attachments, @-mentions, web search, files, and memory exactly like a typed one. Voice and text share one brain, so speaking always gives you the platform's full feature set — that full-parity guarantee is built into the platform.

The unified pipeline looks like this:

user audio → speech-to-text → chat engine (your normal LLM + tools + memory) → text-to-speech → user audio

🗣️You speakVoice or avatar mode

🎙️Speech-to-textDeepgram / Google

🧠Chat engineYour LLM + tools + memory

🔊Text-to-speechSpoken reply

Voice rides the same chat engine as text — full parity for tools, #-context, and memory.

Because voice and text write to the same chat session, switching mid-conversation preserves context automatically.

The two pipelines

Under the hood, two pipelines can satisfy the parity rule. Both run the chat engine as the brain; they differ in how audio gets in and out.

Pipeline	How it works	Providers
Bridge	Speech-to-text → chat engine → text-to-speech.	Deepgram (Nova-3 STT + Aura-2 TTS) or Google (Cloud Speech v2 + Cloud TTS).
Realtime bridge	Gemini Live as the audio transport, with the chat engine wired in as the thinking layer. Used as a last resort.	Gemini Live (a realtime API key).

Modes

The mode you choose in Org Settings tells the worker how to pick a pipeline.

Mode	Behaviour
Auto (recommended)	Picks the most reliable pipeline that still gives full parity, trying Deepgram, then Google, and only falling back automatically to a Gemini Live realtime bridge as a last resort. If no provider is configured it surfaces a clear spoken message so you always know voice is routing through the full chat engine.
Unified only	Pins the bridge (Deepgram or Google) as the only transport, so an admin who opts in gets a clear, explicit message if no bridge provider is configured rather than a quiet fallback.

💡

Which mode should I pick?

Auto suits almost everyone — it always keeps full chat-engine parity and automatically falls back to the Gemini Live realtime bridge only if no bridge provider is configured. Choose Unified only when you want the bridge to be the sole transport, with a loud failure instead of a silent fallback. Pinning the realtime bridge as a standalone mode is no longer offered — its audio path proved unreliable, so today it only ever runs as an automatic fallback inside Auto.

Why a service account (for the Google bridge)

Cloud Speech-to-Text v2 and Cloud Text-to-Speech authenticate with a Google Cloud service-account key. They reject a plain Gemini API key (which only authenticates the Generative Language API). If you'd rather not use Google for the bridge, configure a Deepgram key instead — the bridge accepts either provider.

Provisioning a Google key — about 3 minutes in GCP

💡

You only do this once

The same JSON key works for every voice session in the organization. Rotate it on whatever cadence your security policy demands — uploading a new key in the UI replaces the old one atomically.

Open the GCP IAM Service Accounts page
In the Google Cloud console, switch to the project that owns your Google billing and open IAM & Admin → Service Accounts. Click Create service account. Name it something memorable like cc-voice-bridge — the name is only used for your own audit log.
Grant two roles
Assign the service account these two predefined roles: Cloud Speech-to-Text Client (roles/speech.client) and Cloud Text-to-Speech User (roles/cloudtts.client). Both are least-privilege — the account can only call STT and TTS, nothing else in your project.
Enable the two APIs
In APIs & Services → Library, enable speech.googleapis.com (Cloud Speech-to-Text API) and texttospeech.googleapis.com (Cloud Text-to-Speech API). If they are already enabled the pages will say so — no action needed.
Create and download the JSON key
On the service account detail page, open the Keys tab and click Add key → Create new key → JSON. A file like cc-voice-bridge-<random>.json is downloaded to your machine. Treat it like a password: anyone who has it can call STT and TTS on your project bill.
Upload it in Your Office AI
Open Settings → Organizations → Org Settings → Voice & AI. In the Cloud voice card, click Upload service-account JSON and pick the file you just downloaded — or paste its contents. The card validates the JSON shape client-side; the private key is redacted on subsequent reads.
Pick a mode
Auto is recommended: the worker chooses the most reliable pipeline that still gives full chat-engine parity, preferring Deepgram, then Google, and falling back to a Gemini Live bridge automatically only as a last resort. Choose Unified only if you want a loud failure when no bridge provider is configured.

Voice with an avatar

Voice can be paired with an animated avatar tile — Simli and Tavus are both supported avatar providers — for an avatar input mode in chat. The voice catalog also recognizes a broad range of STT, TTS, and avatar providers that admins can configure.

Verifying the pipeline is active

After uploading credentials and starting a new voice session, the status row at the top of the card reflects the active pipeline and the derived Google project id and service-account email (when using the Google bridge):

✅

Active state

Unified voice + text chat — active. Voice routes through your normal chat backend (same LLM, tools, and memory).

Removing or rotating credentials

The card has a Remove credentials action that zeroes out the stored JSON, project id, client email, and configured flag in one transaction. To rotate, mint a new JSON key in GCP, upload it in the UI, and revoke the old key in the GCP console. Changes take effect on the next voice session — no restart needed.

Cost notes

With the bridge you pay per minute of speech-to-text and per character of text-to-speech, and only for the audio you actually transcribe and synthesize — there's no idle session cost. Deepgram is generally the cheapest bridge provider; the Gemini Live realtime bridge bills at realtime rates and is reserved for when no bridge provider is available.

Security

ℹ️

Your service-account key stays protected

The service-account JSON is sensitive: it's stored server-side, and the admin-facing API redacts it on every read after the initial upload — only the derived project id and client email come back. The worker fetches the full JSON over a private network path scoped to your tenant, so the key never leaves trusted infrastructure.

← PreviousOrganization Settings Next →Members & Roles