A call center used to mean a room full of headsets and a queue that only moves as fast as the people in it. You can now put an AI voice agent on the front line: it answers every inbound call instantly, handles the routine ones end to end, and hands the hard ones to a human. This guide wires that whole system on Sautikit, from the number all the way to human overflow.
TL;DR
An inbound Sautikit number returns <Stream> from your voice webhook, which forks live call audio to your WebSocket bridge as raw PCM; the bridge relays it to the Gemini Live API and writes synthesized PCM back on the same socket to speak into the call.
Keep every leg at 16 kHz mono PCM (outputSamplingRate="16000"), and your WS server must accept the audio.drachtio.org subprotocol.
When the AI escalates or every agent is busy, your flow returns <Dial> to route the caller to a human; <Conference> gives you a shared queue.
Inbound is KES 0, billed per second; the Gemini/LLM bill is yours directly on Google. No per-minute AI tax.
The whole system is four moving parts wired in a line, and it is the same media pipe the Gemini bridge tutorial describes, extended with routing.
Inbound call to your Sautikit number
→ Sautikit fetches your voice webhook → you return RAW XML <Stream .../>
→ Sautikit opens a WebSocket to your Node 'ws' bridge (binary PCM in/out)
→ your bridge relays PCM ⇄ Google Gemini Live API (WebSocket)
→ Gemini's synthesized PCM is written back on the Sautikit socket → plays into the call
── escalation / overflow ──
→ AI decides to hand off (or all agents busy)
→ your flow returns <Dial> → the call rings a human agent (or a <Conference> queue)
Two WebSockets, one bridge process. Sautikit is the telephony leg; Gemini is the intelligence leg; your bridge is the byte pump that keeps sample rates and framing aligned. The routing layer (queue + human overflow) is plain voice actions your webhook returns, so you never leave the Sautikit control plane to escalate.
Two things make this a call center rather than a single-agent demo: the caller has to be answered even when humans are unavailable, and the AI has to be able to give up and pass the call to a person. Both are handled by returning different voice actions, which we get to in Step 4.
From now on, every call to that number causes Sautikit to POST to https://your-app.example.com/voice. What you return there decides what the caller hears. For the AI call center, the first thing you return is a <Stream>.
When a call connects, Sautikit fetches your webhook. For a realtime AI agent you return raw XML (not the JSON actions form) with a <Stream> element, and set Content-Type: application/xml.
The attributes that carry weight: url (a wss:// endpoint, required), track (inbound_track, outbound_track, or both_tracks; required), and outputSamplingRate (8000 or 16000; required — use 16000 for AI so you never resample). name is an optional label; statusCallback plus statusEvents (space-separated) POST stream lifecycle events so you know when the pipe is live. Audio on the socket is 16-bit little-endian signed PCM, mono.
Native JSON stream is rolling out; it emits the JSON shape ahead of the runtime. Until it lands on the server, return the XML <Stream> form from Step 2 as the working path. The stream() helper is safe to adopt now for typing and structure — just serve XML at the webhook.
Sautikit connects to your url and requires the audio.drachtio.org WebSocket subprotocol. Reject the handshake if it is absent. Incoming messages are binary PCM frames from the caller; binary frames you send back are played into the call. This mirrors the Gemini bridge tutorial exactly — reuse it verbatim.
import { WebSocketServer } from "ws";import { openGeminiSession } from "./gemini.js";const wss = new WebSocketServer({ port: 8080, handleProtocols: (protocols) => protocols.has("audio.drachtio.org") ? "audio.drachtio.org" : false,});wss.on("connection", async (sautiSocket) => { // One Gemini Live session per call. const gemini = await openGeminiSession({ // Gemini → call: write synthesized PCM back on the SAME Sautikit socket. onAudio: (pcmChunk) => { if (sautiSocket.readyState === sautiSocket.OPEN) { sautiSocket.send(pcmChunk); // binary frame plays into the call } }, // Barge-in: caller started talking, drop the current reply. onInterrupt: () => { // Optionally flush buffered outbound audio here. }, // The agent asked to escalate → flag the call for human overflow. onEscalate: (reason) => escalate(sautiSocket, reason), }); // Call → Gemini: forward each inbound PCM frame. sautiSocket.on("message", (data, isBinary) => { if (isBinary) gemini.sendAudio(data); // 16 kHz mono PCM }); sautiSocket.on("close", () => gemini.close()); sautiSocket.on("error", () => gemini.close());});
The Gemini Live session is itself a WebSocket: open it, send a setup message selecting a live model with audio in/out, then stream audio up and receive synthesized audio down. Model IDs and exact field names move fast — check the current ai.google.dev Live API docs for the live model ID and config schema. The pattern is stable.
import WebSocket from "ws";// Extract base64 PCM audio from a Gemini Live server message.// Verify the exact path against current ai.google.dev docs; the Live API// returns audio inline as base64 under serverContent model turn parts.function extractInlineAudio(msg) { const parts = msg?.serverContent?.modelTurn?.parts ?? []; for (const p of parts) { const data = p?.inlineData?.data; if (data) return data; // base64-encoded PCM } return null;}// NOTE: model id, message field names, and config keys change;// verify against current ai.google.dev Live API docs before shipping.const GEMINI_URL = "wss://generativelanguage.googleapis.com/…?key=" + process.env.GEMINI_API_KEY;export async function openGeminiSession({ onAudio, onInterrupt, onEscalate }) { const ws = new WebSocket(GEMINI_URL); await new Promise((resolve) => ws.on("open", resolve)); // Setup: choose a live model + request audio output at 16 kHz. ws.send( JSON.stringify({ setup: { model: "models/<current-live-model>", // ← from ai.google.dev generationConfig: { responseModalities: ["AUDIO"] }, systemInstruction: { parts: [ { text: "You are a concise call-center agent. Answer routine questions. " + "If the caller needs a human, is angry, or asks for one, say you " + "are transferring them and stop talking.", }, ], }, }, }) ); ws.on("message", (raw) => { const msg = JSON.parse(raw.toString()); // Synthesized audio out → play into the call. const audioB64 = extractInlineAudio(msg); if (audioB64) onAudio(Buffer.from(audioB64, "base64")); // Barge-in: Gemini reports the caller interrupted the model. if (msg?.serverContent?.interrupted) onInterrupt?.(); // Escalation: your prompt/tool signals a handoff. Detect it however you // model it (a tool call, a sentinel phrase) and bubble it up. const text = msg?.serverContent?.modelTurn?.parts ?.map((p) => p?.text ?? "") .join(""); if (text && /transferring you|connecting you to an agent/i.test(text)) { onEscalate?.("agent-requested"); } }); return { // Send inbound call PCM up as base64 realtime input. sendAudio(pcm) { ws.send( JSON.stringify({ realtimeInput: { mediaChunks: [ { mimeType: "audio/pcm;rate=16000", data: pcm.toString("base64") }, ], }, }) ); }, close() { if (ws.readyState === ws.OPEN) ws.close(); }, };}
The load-bearing details are unchanged from the Gemini post: request AUDIO as a response modality, tag uploaded chunks as audio/pcm;rate=16000, and decode the base64 audio Gemini returns before writing it back to Sautikit. The only addition here is the onEscalate hook — a signal your bridge raises when the agent decides the call belongs to a human.
The AI handles the routine calls. The center needs an answer for the rest: the caller asks for a person, the model gets stuck, or it is a case you never want AI to touch. That is the overflow path, and it is plain voice actions.
When your bridge raises onEscalate, or when a status check says every agent is busy, redirect the live call to a webhook that returns <Dial> to a human's number:
import fetch from "node-fetch";// Move the live call off the Stream and onto a human-routing flow.async function escalate(sautiSocket, reason) { await fetch(`https://api.sautikit.com/v1/calls/${callIdFor(sautiSocket)}/redirect`, { method: "POST", headers: { Authorization: `Bearer ${process.env.SAUTIKIT_API_KEY}`, "Content-Type": "application/json", }, body: JSON.stringify({ routing_url: "https://your-app.example.com/handoff" }), });}
Your /handoff webhook then picks a human. The simplest form dials one agent; a real center dials a list, ringing the next free person:
app.post("/handoff", (req, res) => { const agents = ["+254711111111", "+254722222222"]; // your on-call agents const xml = `<?xml version="1.0" encoding="UTF-8"?><Response> <Say language="en-KE">Connecting you to an agent. Please hold.</Say> <Dial timeout="20"> <Number>${agents[0]}</Number> </Dial></Response>`; res.set("Content-Type", "application/xml"); res.send(xml);});
<Dial> bridges the caller to a human; when the agent hangs up, the call ends. Give it a timeout so an unanswered ring falls through to voicemail or a second agent instead of leaving the caller in silence.
The escalation decision is yours to model. Common triggers: the caller says "agent" or "human", the model emits a handoff tool call, sentiment turns negative, or the topic is on a never-AI list (disputes, cancellations). Whichever you pick, the mechanics are the same — stop the Stream, return <Dial> or <Conference>.
A caller feels a natural conversation when the agent replies within ~800 ms of them finishing, and can interrupt mid-sentence (barge-in). With the Gemini Live bridge, endpointing and interruption are handled inside the live session: Gemini emits an interruption signal (serverContent.interrupted) when the caller talks over the model, and because playback flows through the Sautikit socket you control, dropping queued outbound chunks on interrupt is enough to feel responsive. For the full latency-budget breakdown (STT, LLM first token, TTS first frame) and the turn-based alternative, see the pillar guide Ship an AI voice agent.
The economics are the point. Sautikit bills the call, per second, in KES:
Inbound calls are free (KES 0) — an AI agent answering every inbound call costs nothing on the telephony leg.
Outbound (including a <Dial> leg to an agent on the PSTN) is KES 3.00/min, billed per second from the moment the leg connects.
Numbers are from KES 116/month, activated instantly.
The LLM/Gemini cost is on your own Google bill, billed directly by Google for the tokens and audio you use. That is deliberate: there is no per-minute "AI voice" surcharge layered on top of your model spend. You pay Sautikit for telephony and Google for intelligence, each at cost, with no middle markup taxing every minute.
Can I use OpenAI or another model instead of Gemini?
Yes. The bridge is model-agnostic. Swap openGeminiSession for a session against OpenAI's realtime API, Claude, or a self-hosted model. Sautikit only delivers and receives PCM; what you connect it to is your choice. Keep both legs at 16 kHz mono PCM and the telephony side does not change.
Does the native JSON stream action work on the server today?
Not yet. @sautikit/node's stream() emits the JSON shape ahead of the runtime, so you can adopt it for typing now, but you must return the XML <Stream> form from your webhook. Native JSON stream is rolling out.
How do callers wait when every agent is busy?
Put them in a <Conference> queue with hold music; the next free agent joins and takes the oldest caller. For a small team, a <Dial> with a timeout that falls through to voicemail is enough.
What must my WebSocket server do to accept the audio?
Offer the audio.drachtio.org subprotocol back during the handshake (in your handleProtocols callback). If it is absent, Sautikit refuses the connection and no audio flows. Treat incoming binary frames as 16-bit LE PCM at 16 kHz.