Answer real phone calls with Gemini: bridge Gemini Live to Sautikit

A caller dials your number. Instead of an IVR tree, a natural voice answers, listens, thinks, and replies. In real time, on any phone, with no app to install. That is what you get when you bridge the Google Gemini Live API to Sautikit's live audio stream, and this tutorial wires it end to end.

TL;DR

Sautikit's <Stream> voice action forks live call audio to your WebSocket as raw PCM frames; you relay them to the Gemini Live API and write Gemini's synthesized PCM back on the same socket to speak into the call.

Use outputSamplingRate="16000" so both legs agree on 16 kHz mono PCM: no resampling guesswork.

<Stream> is returned as application/xml today (JSON stream support is on the roadmap); your WS server must accept the audio.drachtio.org subprotocol.

Chat and app-based assistants assume a smartphone, a data plan, and a download. A phone number assumes none of that. Anyone with a handset (a feature phone on a rural network, a landline, a roaming SIM) can reach an AI voice agent by dialing. For support lines, appointment booking, order status, or after-hours triage, that reach is the whole point: you meet callers where they already are.

The hard part has always been the audio pipe: getting live call audio out to an LLM and synthesized audio back in fast enough to feel like a conversation. Sautikit's Stream verb is that pipe.

Inbound call
  → Sautikit voice_callback returns RAW XML <Stream .../>
  → Sautikit opens a WebSocket to your Node 'ws' server (binary PCM in/out)
  → your server relays PCM ⇄ Google Gemini Live API (WebSocket)
  → Gemini's synthesized PCM is written back on the Sautikit socket
  → audio plays into the live call

Two WebSockets, one bridge process. Sautikit is the telephony leg; Gemini is the intelligence leg. Your server is the translator that keeps sample rates and framing aligned.

When a call connects, Sautikit fetches your voice_callback_url. For realtime AI you return raw XML (not the JSON actions form) with a <Stream> element. Set the Content-Type to application/xml.

import express from "express";
 
const app = express();
app.use(express.urlencoded({ extended: false }));
 
app.post("/voice", (req, res) => {
  // req.body includes From, Digits, etc. for JSON flows; here we go raw XML.
  const xml = `<?xml version="1.0" encoding="UTF-8"?>
<Response>
  <Stream
    name="gemini"
    url="wss://your-server.example.com/gemini"
    track="both_tracks"
    outputSamplingRate="16000"
    statusCallback="https://your-server.example.com/stream-status"
    statusEvents="stream-started stream-stopped stream-error" />
</Response>`;
 
  res.set("Content-Type", "application/xml");
  res.send(xml);
});
 
app.listen(3000);

track="both_tracks" forwards both the caller and any outbound audio; use inbound_track if you only want the caller's voice into Gemini. outputSamplingRate="16000" tells Sautikit to deliver 16 kHz PCM, which is a common rate for realtime LLM audio; keep every leg on the same rate to avoid resampling.

Sautikit connects to your url and requires the audio.drachtio.org WebSocket subprotocol. Reject the handshake if it is absent. Incoming messages are binary PCM frames.

import { WebSocketServer } from "ws";
import { openGeminiSession } from "./gemini.js";
 
const wss = new WebSocketServer({
  port: 8080,
  handleProtocols: (protocols) =>
    protocols.has("audio.drachtio.org") ? "audio.drachtio.org" : false,
});
 
wss.on("connection", async (sautiSocket) => {
  // One Gemini Live session per call.
  const gemini = await openGeminiSession({
    // Gemini → call: write synthesized PCM back on the SAME Sautikit socket.
    onAudio: (pcmChunk) => {
      if (sautiSocket.readyState === sautiSocket.OPEN) {
        sautiSocket.send(pcmChunk); // binary frame plays into the call
      }
    },
    // Barge-in: caller started talking, stop the current reply.
    onInterrupt: () => {
      // Optionally signal Sautikit to flush any buffered playback here.
    },
  });
 
  // Call → Gemini: forward each inbound PCM frame.
  sautiSocket.on("message", (data, isBinary) => {
    if (isBinary) gemini.sendAudio(data); // 16 kHz mono PCM
  });
 
  sautiSocket.on("close", () => gemini.close());
  sautiSocket.on("error", () => gemini.close());
});

The bridge is deliberately thin: bytes in from the call go to Gemini, bytes out from Gemini go back to the call. All the conversation logic lives inside the Gemini session.

The Gemini Live API is itself a WebSocket: you open a session, send a setup message selecting a live model and audio in/out config, then stream audio chunks up and receive synthesized audio chunks down. Model names and exact field names move fast; check the current ai.google.dev Live API docs for the live model ID and config schema. The pattern below is stable.

import WebSocket from "ws";
 
// Extract base64 PCM audio from a Gemini Live server message.
// Verify the exact path against current ai.google.dev docs; the Live API
// returns audio inline as base64 under serverContent model turn parts.
function extractInlineAudio(msg) {
  const parts = msg?.serverContent?.modelTurn?.parts ?? [];
  for (const p of parts) {
    const data = p?.inlineData?.data;
    if (data) return data; // base64-encoded PCM
  }
  return null;
}
 
// NOTE: model id, message field names, and config keys change;
// verify against current ai.google.dev Live API docs before shipping.
const GEMINI_URL =
  "wss://generativelanguage.googleapis.com/…?key=" +
  process.env.GEMINI_API_KEY;
 
export async function openGeminiSession({ onAudio, onInterrupt }) {
  const ws = new WebSocket(GEMINI_URL);
 
  await new Promise((resolve) => ws.on("open", resolve));
 
  // 1) Setup: choose a live model + request audio output at 16 kHz.
  ws.send(
    JSON.stringify({
      setup: {
        model: "models/<current-live-model>", // ← from ai.google.dev
        generationConfig: { responseModalities: ["AUDIO"] },
        systemInstruction: {
          parts: [{ text: "You are a concise phone support agent." }],
        },
      },
    })
  );
 
  ws.on("message", (raw) => {
    const msg = JSON.parse(raw.toString());
 
    // 2) Synthesized audio out → play into the call.
    const audioB64 = extractInlineAudio(msg); // per current schema
    if (audioB64) onAudio(Buffer.from(audioB64, "base64"));
 
    // 3) Barge-in: Gemini reports the caller interrupted the model.
    if (msg?.serverContent?.interrupted) onInterrupt();
  });
 
  return {
    // Send inbound call PCM up as base64 realtime input.
    sendAudio(pcm) {
      ws.send(
        JSON.stringify({
          realtimeInput: {
            mediaChunks: [
              {
                mimeType: "audio/pcm;rate=16000",
                data: pcm.toString("base64"),
              },
            ],
          },
        })
      );
    },
    close() {
      if (ws.readyState === ws.OPEN) ws.close();
    },
  };
}

The load-bearing details: request AUDIO as a response modality, tag uploaded chunks as audio/pcm;rate=16000, and decode the base64 audio Gemini returns before forwarding it to Sautikit. Everything else is prompt and policy.

Mismatched sample rates are the number-one cause of chipmunk or slow-motion audio. You already told Sautikit outputSamplingRate="16000", so:

Frames arriving from Sautikit are 16 kHz mono PCM; send them to Gemini as rate=16000.
Frames Gemini returns should also be 16 kHz; write them straight back to the Sautikit socket, no resampling.

If you ever set outputSamplingRate="8000", you must resample both directions. Staying at 16 kHz end to end keeps the bridge a byte pump.

Natural conversation means the caller can talk over the agent. Gemini's Live API detects this and emits an interruption signal (serverContent.interrupted in the pattern above). When you see it, stop feeding the current reply into the call so the caller is not talking over stale audio. Because playback flows through the Sautikit socket you control, dropping queued outbound chunks on interrupt is enough to make the agent feel responsive.

The statusEvents="stream-started stream-stopped stream-error" on your <Stream> element tells Sautikit to POST lifecycle events to your statusCallback. Each carries a callSessionState of StreamStarted, StreamStopped, or StreamError plus a streamSid.

app.post("/stream-status", express.json(), (req, res) => {
  const { callSessionState, streamSid } = req.body;
  console.log(`[stream] ${streamSid} → ${callSessionState}`);
  // StreamError → alert; StreamStopped → tear down the Gemini session.
  res.sendStatus(200);
});

Use StreamStarted to confirm the pipe is up, StreamError to page yourself, and StreamStopped to close the matching Gemini session and free resources.

Do I need a special endpoint for realtime streaming?

No. You reuse the same voice_callback_url as any Sautikit call. The difference is you return raw <Stream> XML with Content-Type: application/xml instead of the JSON actions array.

Why must the WebSocket accept the audio.drachtio.org subprotocol?

Sautikit negotiates that subprotocol when it opens the socket. If your ws server does not offer it back during the handshake, the connection is refused and no audio flows. Confirm it in your handleProtocols callback.

Can the AI voice agent both listen and speak on one connection?

Yes. <Stream> is bidirectional: audio Sautikit sends you is the caller; binary PCM you send back on the same socket is played into the call. You never open a second connection to Sautikit.

What does an AI voice call cost?

Standard voice pricing applies: inbound calls are free (KES 0) and outbound bills at KES 3.00/min, billed per second from the moment the call connects. The Gemini API is billed separately by Google. See /pricing for the source of truth.

Will there be a JSON version of <Stream>?

Yes, it is on the roadmap. Today <Stream> is returned as application/xml; a JSON form embeddable in the { actions: [...] } response is coming.

Create a Sautikit workspace and claim a number (from KES 116, instant).
Top up over M-Pesa; no card required.
Deploy your ws bridge, point a number's voice_callback_url at the webhook above, and dial in to talk to Gemini.

Start with Sautikit → · See pricing → · Need SMS, WhatsApp & an agent desk? Helloduty →

Voice actions reference: every verb, including <Stream> attributes.
Place your first call: the POST /v1/calls basics under the streaming layer.

TL;DR

Sautikit's <Stream> voice action forks live call audio to your WebSocket as raw PCM frames; you relay them to the Gemini Live API and write Gemini's synthesized PCM back on the same socket to speak into the call.

Use outputSamplingRate="16000" so both legs agree on 16 kHz mono PCM: no resampling guesswork.

<Stream> is returned as application/xml today (JSON stream support is on the roadmap); your WS server must accept the audio.drachtio.org subprotocol.