In Your Office AI, voice always rides on the same chat engine as text, so speaking has the same tools, context, and memory as typing. This page covers how the voice pipeline is chosen and how to provision the credentials it needs.
Your Office AI treats voice as "unified": every voice path routes through the chat engine, so a spoken request can use #-attachments, @-mentions, web search, files, and memory exactly like a typed one. Voice and text share one brain, so speaking always gives you the platform's full feature set — that full-parity guarantee is built into the platform.
The unified pipeline looks like this:
user audio → speech-to-text → chat engine (your normal LLM + tools + memory) → text-to-speech → user audio
Because voice and text write to the same chat session, switching mid-conversation preserves context automatically.
Under the hood, two pipelines can satisfy the parity rule. Both run the chat engine as the brain; they differ in how audio gets in and out.
| Pipeline | How it works | Providers |
|---|---|---|
| Bridge | Speech-to-text → chat engine → text-to-speech. | Deepgram (Nova-3 STT + Aura-2 TTS) or Google (Cloud Speech v2 + Cloud TTS). |
| Realtime bridge | Gemini Live as the audio transport, with the chat engine wired in as the thinking layer. Used as a last resort. | Gemini Live (a realtime API key). |
The mode you choose in Org Settings tells the worker how to pick a pipeline.
| Mode | Behaviour |
|---|---|
| Auto (recommended) | Picks the cheapest pipeline that still gives full parity, in order: Deepgram, then Google, then a Gemini Live realtime bridge. If no provider is configured it surfaces a clear spoken message so you always know voice is routing through the full chat engine. |
| Unified only | Pins the bridge (Deepgram or Google) as the only transport, so an admin who opts in gets a clear, explicit message if no bridge provider is configured rather than a quiet fallback. |
| Realtime bridge | Pins Gemini Live as the audio transport with the chat-engine brain attached — the same full feature set, at realtime cost. |
Auto suits almost everyone — it always keeps full chat-engine parity and just picks the most cost-effective transport. Choose Unified only when you want the bridge to be the sole transport, or Realtime bridge to pin Gemini Live as the audio transport with the chat-engine brain attached.
Cloud Speech-to-Text v2 and Cloud Text-to-Speech authenticate with a Google Cloud service-account key. They reject a plain Gemini API key (which only authenticates the Generative Language API). If you'd rather not use Google for the bridge, configure a Deepgram key instead — the bridge accepts either provider.
The same JSON key works for every voice session in the organization. Rotate it on whatever cadence your security policy demands — uploading a new key in the UI replaces the old one atomically.
In the Google Cloud console, switch to the project that owns your Google billing and open IAM & Admin → Service Accounts. Click Create service account. Name it something memorable like cc-voice-bridge — the name is only used for your own audit log.
Assign the service account these two predefined roles: Cloud Speech-to-Text Client (roles/speech.client) and Cloud Text-to-Speech User (roles/cloudtts.client). Both are least-privilege — the account can only call STT and TTS, nothing else in your project.
In APIs & Services → Library, enable speech.googleapis.com (Cloud Speech-to-Text API) and texttospeech.googleapis.com (Cloud Text-to-Speech API). If they are already enabled the pages will say so — no action needed.
On the service account detail page, open the Keys tab and click Add key → Create new key → JSON. A file like cc-voice-bridge-<random>.json is downloaded to your machine. Treat it like a password: anyone who has it can call STT and TTS on your project bill.
Open Settings → Organizations → Org Settings → Voice & AI. In the Cloud voice card, click Upload service-account JSON and pick the file you just downloaded — or paste its contents. The card validates the JSON shape client-side; the private key is redacted on subsequent reads.
Auto is recommended: the worker chooses the cheapest pipeline that still gives full chat-engine parity, preferring Deepgram, then Google, then a Gemini Live bridge. Choose Unified only if you want a loud failure when no bridge provider is configured, or Realtime bridge to pin Gemini Live as the audio transport with the chat-engine brain attached.
Voice can be paired with an animated avatar tile, powered by Simli, for an avatar input mode in chat. The voice catalog also recognizes a broad range of STT, TTS, and avatar providers that admins can configure; Simli is the in-product avatar.
After uploading credentials and starting a new voice session, the status row at the top of the card reflects the active pipeline and the derived Google project id and service-account email (when using the Google bridge):
Unified voice + text chat — active. Voice routes through your normal chat backend (same LLM, tools, and memory).
The card has a Remove credentials action that zeroes out the stored JSON, project id, client email, and configured flag in one transaction. To rotate, mint a new JSON key in GCP, upload it in the UI, and revoke the old key in the GCP console. Changes take effect on the next voice session — no restart needed.
With the bridge you pay per minute of speech-to-text and per character of text-to-speech, and only for the audio you actually transcribe and synthesize — there's no idle session cost. Deepgram is generally the cheapest bridge provider; the Gemini Live realtime bridge bills at realtime rates and is reserved for when no bridge provider is available.
The service-account JSON is sensitive, and YOffice keeps it safe: it's stored server-side and encrypted at rest the same way other tenant secrets are, and the admin-facing API redacts the private key on every read — only the derived project id and client email come back. The worker fetches the full JSON over a private network path scoped to your tenant, so the key never leaves trusted infrastructure.