A customer calls your restaurant. Instead of a busy tone or an overwhelmed till operator, an AI voice agent picks up, greets them, and takes the whole order in natural language: "two chicken burgers, one with no onions, a large fries, and a soda." It reads the order back, confirms the total, and drops a clean, structured ticket into your kitchen system. No app, no menu link, no waiting on hold. Just a phone call.
This tutorial wires that end to end, mirroring the Gemini Live bridge pattern from our realtime voice engine post, with the LLM calling your menu and order tools.
TL;DR
A caller dials your number → your webhook returns the <Stream> verb → Sautikit forks live call audio to your WebSocket bridge → you relay it to the Gemini Live API → synthesized speech flows back into the call.
Give Gemini tools (addItem, applyModifier, getTotal, placeOrder) so the model turns spoken orders into structured tickets.
Use outputSamplingRate="16000" for 16 kHz mono PCM end to end; your WS server must accept the audio.drachtio.org subprotocol.
<Stream> is served as XML today. A native JSON stream form is rolling out; XML is the working path right now.
A quick-service chain in Nairobi loses orders every busy evening: lines are engaged, staff are at the counter, and callers hang up. Voice ordering fixes the bottleneck without hiring a night shift. A phone number reaches everyone (a feature phone, a landline, a roaming SIM) with nothing to install, and as you expand to new cities across Africa the same line scales with you.
The hard part was always turning a spoken, messy order into structured data your kitchen can act on. An LLM that can call your functions mid-conversation does exactly that: it hears "make that two, no onions," and calls addItem and applyModifier with the right arguments.
Inbound call
→ Sautikit voice_callback returns RAW XML <Stream .../>
→ Sautikit opens a WebSocket to your Node 'ws' bridge (binary PCM in/out)
→ your bridge relays PCM ⇄ Google Gemini Live API (WebSocket)
→ Gemini calls YOUR tools: addItem / applyModifier / getTotal / placeOrder
→ Gemini's synthesized PCM is written back on the Sautikit socket
→ audio plays into the live call; the order lands in your system
Two WebSockets, one bridge process. Sautikit is the telephony leg; Gemini is the intelligence leg. Your bridge translates bytes between them and hosts the menu tools the model calls.
Claim a number (from KES 116, instant), then set its routing_url to the webhook that returns your stream directive. You can do this from the dashboard or with a PATCH:
Sautikit connects to your url and requires the audio.drachtio.org subprotocol. Reject the handshake if it is absent. Incoming messages are binary PCM frames. Each call gets one Gemini Live session, and that session is handed your menu tools.
import { WebSocketServer } from "ws";import { openGeminiSession } from "./gemini.js";import { createOrder } from "./order.js";const wss = new WebSocketServer({ port: 8080, handleProtocols: (protocols) => protocols.has("audio.drachtio.org") ? "audio.drachtio.org" : false,});wss.on("connection", async (sautiSocket) => { // A fresh order cart for this caller. const order = createOrder(); const gemini = await openGeminiSession({ // The tools Gemini may call while the caller talks. tools: { addItem: ({ name, quantity = 1 }) => order.addItem(name, quantity), applyModifier: ({ item, modifier }) => order.applyModifier(item, modifier), getTotal: () => ({ total_kes: order.total() }), placeOrder: () => order.place(), // files the ticket into your system }, // Gemini → call: write synthesized PCM back on the SAME Sautikit socket. onAudio: (pcmChunk) => { if (sautiSocket.readyState === sautiSocket.OPEN) { sautiSocket.send(pcmChunk); // 16-bit LE PCM, plays into the call } }, onInterrupt: () => { // Caller barged in: drop queued outbound audio here. }, }); // Call → Gemini: forward each inbound PCM frame. sautiSocket.on("message", (data, isBinary) => { if (isBinary) gemini.sendAudio(data); // 16 kHz mono PCM }); sautiSocket.on("close", () => gemini.close()); sautiSocket.on("error", () => gemini.close());});
The bridge stays thin: bytes from the call go up to Gemini, bytes from Gemini go back into the call. The conversation logic and the menu tools live in the session.
Inside the Gemini session you declare those tools and dispatch tool-call messages to your handlers. Model IDs and exact field names move fast; check the current ai.google.dev Live API docs for the live model ID and config schema. The pattern is stable.
import WebSocket from "ws";// NOTE: model id, message field names, and config keys change;// verify against current ai.google.dev Live API docs before shipping.const GEMINI_URL = "wss://generativelanguage.googleapis.com/…?key=" + process.env.GEMINI_API_KEY;export async function openGeminiSession({ tools, onAudio, onInterrupt }) { const ws = new WebSocket(GEMINI_URL); await new Promise((resolve) => ws.on("open", resolve)); // Setup: pick a live model, request AUDIO out, declare the menu tools. ws.send( JSON.stringify({ setup: { model: "models/<current-live-model>", // ← from ai.google.dev generationConfig: { responseModalities: ["AUDIO"] }, systemInstruction: { parts: [ { text: "You take phone orders for a quick-service restaurant. " + "Use addItem and applyModifier as the caller speaks. " + "Read the order back and confirm the total before placeOrder.", }, ], }, tools: [ { functionDeclarations: [ { name: "addItem", description: "Add a menu item with a quantity." }, { name: "applyModifier", description: "Apply a modifier, e.g. no onions." }, { name: "getTotal", description: "Return the running order total in KES." }, { name: "placeOrder", description: "File the confirmed order." }, ], }, ], }, }) ); ws.on("message", async (raw) => { const msg = JSON.parse(raw.toString()); // Synthesized audio out → play into the call. const audioB64 = extractInlineAudio(msg); // per current schema if (audioB64) onAudio(Buffer.from(audioB64, "base64")); // The model wants to call one of YOUR menu tools. for (const call of msg?.toolCall?.functionCalls ?? []) { const handler = tools[call.name]; const result = handler ? await handler(call.args ?? {}) : { error: "unknown" }; ws.send( JSON.stringify({ toolResponse: { functionResponses: [{ name: call.name, response: result }], }, }) ); } // Barge-in: caller interrupted the model. if (msg?.serverContent?.interrupted) onInterrupt(); }); return { sendAudio(pcm) { ws.send( JSON.stringify({ realtimeInput: { mediaChunks: [ { mimeType: "audio/pcm;rate=16000", data: pcm.toString("base64") }, ], }, }) ); }, close() { if (ws.readyState === ws.OPEN) ws.close(); }, };}// Pull base64 PCM out of a Gemini Live server message.// Verify the exact path against current ai.google.dev docs.function extractInlineAudio(msg) { const parts = msg?.serverContent?.modelTurn?.parts ?? []; for (const p of parts) { if (p?.inlineData?.data) return p.inlineData.data; // base64 PCM } return null;}
The load-bearing details: request AUDIO as a response modality, tag uploaded chunks as audio/pcm;rate=16000, declare your tools in setup, and answer each toolCall with a matching toolResponse. When the caller says "add a large fries," the model calls addItem({ name: "fries", quantity: 1 }) and applyModifier({ item: "fries", modifier: "large" }); when they confirm, it calls placeOrder().
Good ordering agents read the order back before committing. That is prompt work: instruct the model to summarize the cart and confirm the KES total (via getTotal) before calling placeOrder. Because placeOrder is your function, that is where you write the ticket to your kitchen display, POS, or database, and where you can reject anything invalid.
For edge cases the model cannot resolve (a special request, an allergy question, a payment dispute), fall back to a human. End the stream and dial a staff phone with the <Dial> verb:
<Response> <Say language="en-KE">One moment, connecting you to our team.</Say> <Dial> <Number>+254700000001</Number> </Dial></Response>
You can return this from a follow-up webhook when your placeOrder handler (or a tool the model calls, like escalate) decides the call needs a person.
Inbound calls are billed per second in KES from the moment the call connects; see /pricing for the current inbound rate and number rental. There is no premium for using <Stream> or AI: it is standard inbound voice.
The Gemini/LLM cost is on your own bill. Google charges you directly for Live API usage; Sautikit only meters the telephony leg. Budget the two separately.
How does the AI turn speech into a structured order?
Through tool calls. You declare addItem, applyModifier, getTotal, and placeOrder in the Gemini setup message. As the caller talks, the model calls them with structured arguments, and your handlers build a clean ticket, no transcription parsing on your side.
Why must the WebSocket accept the audio.drachtio.org subprotocol?
Sautikit negotiates that subprotocol when it opens the socket. If your ws server does not offer it back in the handshake, the connection is refused and no audio flows. Confirm it in your handleProtocols callback.
Can I use JSON instead of the XML <Stream> today?
Return the XML form for now. The @sautikit/nodestream() helper emits a JSON stream action ahead of runtime support; native JSON <Stream> is still rolling out. XML is the working path.