Home/Docs/Unified Voice + Chat
🎙️Admin & Settings

Unified voice + text chat

In Your Office AI, voice always rides on the same chat engine as text, so speaking has the same tools, context, and memory as typing. This page covers how the voice pipeline is chosen and how to provision the credentials it needs.

One brain for voice and text

Your Office AI treats voice as "unified": every voice path routes through the chat engine, so a spoken request can use #-attachments, @-mentions, web search, files, and memory exactly like a typed one. Voice and text share one brain, so speaking always gives you the platform's full feature set — that full-parity guarantee is built into the platform.

The unified pipeline looks like this:

user audio → speech-to-text → chat engine (your normal LLM + tools + memory) → text-to-speech → user audio
🗣️You speakVoice or avatar mode
🎙️Speech-to-textDeepgram / Google
🧠Chat engineYour LLM + tools + memory
🔊Text-to-speechSpoken reply
Voice rides the same chat engine as text — full parity for tools, #-context, and memory.

Because voice and text write to the same chat session, switching mid-conversation preserves context automatically.

The two pipelines

Under the hood, two pipelines can satisfy the parity rule. Both run the chat engine as the brain; they differ in how audio gets in and out.

PipelineHow it worksProviders
BridgeSpeech-to-text → chat engine → text-to-speech.Deepgram (Nova-3 STT + Aura-2 TTS) or Google (Cloud Speech v2 + Cloud TTS).
Realtime bridgeGemini Live as the audio transport, with the chat engine wired in as the thinking layer. Used as a last resort.Gemini Live (a realtime API key).

Modes

The mode you choose in Org Settings tells the worker how to pick a pipeline.

ModeBehaviour
Auto (recommended)Picks the cheapest pipeline that still gives full parity, in order: Deepgram, then Google, then a Gemini Live realtime bridge. If no provider is configured it surfaces a clear spoken message so you always know voice is routing through the full chat engine.
Unified onlyPins the bridge (Deepgram or Google) as the only transport, so an admin who opts in gets a clear, explicit message if no bridge provider is configured rather than a quiet fallback.
Realtime bridgePins Gemini Live as the audio transport with the chat-engine brain attached — the same full feature set, at realtime cost.
💡
Which mode should I pick?

Auto suits almost everyone — it always keeps full chat-engine parity and just picks the most cost-effective transport. Choose Unified only when you want the bridge to be the sole transport, or Realtime bridge to pin Gemini Live as the audio transport with the chat-engine brain attached.

Why a service account (for the Google bridge)

Cloud Speech-to-Text v2 and Cloud Text-to-Speech authenticate with a Google Cloud service-account key. They reject a plain Gemini API key (which only authenticates the Generative Language API). If you'd rather not use Google for the bridge, configure a Deepgram key instead — the bridge accepts either provider.

Provisioning a Google key — about 3 minutes in GCP

💡
You only do this once

The same JSON key works for every voice session in the organization. Rotate it on whatever cadence your security policy demands — uploading a new key in the UI replaces the old one atomically.

  1. Open the GCP IAM Service Accounts page

    In the Google Cloud console, switch to the project that owns your Google billing and open IAM & Admin → Service Accounts. Click Create service account. Name it something memorable like cc-voice-bridge — the name is only used for your own audit log.

  2. Grant two roles

    Assign the service account these two predefined roles: Cloud Speech-to-Text Client (roles/speech.client) and Cloud Text-to-Speech User (roles/cloudtts.client). Both are least-privilege — the account can only call STT and TTS, nothing else in your project.

  3. Enable the two APIs

    In APIs & Services → Library, enable speech.googleapis.com (Cloud Speech-to-Text API) and texttospeech.googleapis.com (Cloud Text-to-Speech API). If they are already enabled the pages will say so — no action needed.

  4. Create and download the JSON key

    On the service account detail page, open the Keys tab and click Add key → Create new key → JSON. A file like cc-voice-bridge-<random>.json is downloaded to your machine. Treat it like a password: anyone who has it can call STT and TTS on your project bill.

  5. Upload it in Your Office AI

    Open Settings → Organizations → Org Settings → Voice & AI. In the Cloud voice card, click Upload service-account JSON and pick the file you just downloaded — or paste its contents. The card validates the JSON shape client-side; the private key is redacted on subsequent reads.

  6. Pick a mode

    Auto is recommended: the worker chooses the cheapest pipeline that still gives full chat-engine parity, preferring Deepgram, then Google, then a Gemini Live bridge. Choose Unified only if you want a loud failure when no bridge provider is configured, or Realtime bridge to pin Gemini Live as the audio transport with the chat-engine brain attached.

Voice with an avatar

Voice can be paired with an animated avatar tile, powered by Simli, for an avatar input mode in chat. The voice catalog also recognizes a broad range of STT, TTS, and avatar providers that admins can configure; Simli is the in-product avatar.

Verifying the pipeline is active

After uploading credentials and starting a new voice session, the status row at the top of the card reflects the active pipeline and the derived Google project id and service-account email (when using the Google bridge):

Active state

Unified voice + text chat — active. Voice routes through your normal chat backend (same LLM, tools, and memory).

Removing or rotating credentials

The card has a Remove credentials action that zeroes out the stored JSON, project id, client email, and configured flag in one transaction. To rotate, mint a new JSON key in GCP, upload it in the UI, and revoke the old key in the GCP console. Changes take effect on the next voice session — no restart needed.

Cost notes

With the bridge you pay per minute of speech-to-text and per character of text-to-speech, and only for the audio you actually transcribe and synthesize — there's no idle session cost. Deepgram is generally the cheapest bridge provider; the Gemini Live realtime bridge bills at realtime rates and is reserved for when no bridge provider is available.

Security

ℹ️
Your service-account key stays protected

The service-account JSON is sensitive, and YOffice keeps it safe: it's stored server-side and encrypted at rest the same way other tenant secrets are, and the admin-facing API redacts the private key on every read — only the derived project id and client email come back. The worker fetches the full JSON over a private network path scoped to your tenant, so the key never leaves trusted infrastructure.