A customer dials your delivery line. Instead of "press 1 for order status," a natural voice picks up: "Hi, which order are you calling about?" They read out a tracking number, the agent looks up the shipment, and says "Your parcel is out for delivery, about 40 minutes away. Should I keep the same address in Westlands?" No IVR tree, no app, no data plan. That is an AI voice agent for logistics, and this tutorial wires it end to end.
TL;DR
Point a number's routing at your webhook; return a <Stream> voice action so Sautikit forks live call audio to your WebSocket as raw PCM frames.
Relay those frames to the Google Gemini Live API, expose your logistics tools (lookup shipment, ETA, reschedule) as function calls, and write Gemini's synthesized PCM back on the same socket to speak into the call.
Use outputSamplingRate="16000" so every leg agrees on 16 kHz mono PCM. Your WS server must accept the audio.drachtio.org subprotocol.
Same pattern works outbound: POST /v1/calls to proactively call a customer about a delivery, then run the identical <Stream> flow on answer.
Last-mile delivery across African cities (Nairobi, Lagos, Accra, Kampala) runs on voice. A customer whose parcel is late calls the courier. A driver who can't find the gate calls dispatch. A recipient who moved offices wants to change the drop-off. Each of those is a phone call that today lands on a human, or worse, on nobody, and the delivery fails.
Failed deliveries are expensive: a second dispatch, a return-to-warehouse, an unhappy customer. And the call volume is spiky and repetitive, exactly the shape of work an AI voice agent handles well. The caller speaks naturally ("where's my order, it was meant to come this morning"), the agent understands intent, looks up the shipment, and answers or reschedules, in one call, with no keypad gymnastics.
The hard part has always been the audio pipe: getting live call audio out to an LLM and synthesized speech back in fast enough to feel conversational. Sautikit's <Stream> verb is that pipe, and the Gemini Live API is the brain.
Inbound: customer/driver dials your number
→ Sautikit fetches your routing_url webhook
→ webhook returns RAW XML <Stream .../>
→ Sautikit opens a WebSocket to your Node 'ws' bridge (binary PCM in/out)
→ bridge relays PCM ⇄ Google Gemini Live API (WebSocket)
→ Gemini calls YOUR tools: trackShipment(), reschedule(), confirmAddress()
→ Gemini's synthesized PCM is written back on the Sautikit socket
→ audio plays into the live call
Outbound: POST /v1/calls to proactively call about a delivery
→ on answer, the same routing_url returns the same <Stream> → same bridge
Two WebSockets, one bridge process. Sautikit is the telephony leg; Gemini is the intelligence leg; your bridge is the translator that keeps sample rates aligned and runs the logistics tool-calls. The LLM never touches your database directly: it asks for a lookup or a reschedule via a function call, and your code decides what to do.
When a call connects, Sautikit fetches your webhook. For a realtime AI agent you return raw XML with a <Stream> element and set Content-Type: application/xml.
import express from "express";const app = express();app.use(express.urlencoded({ extended: false }));app.post("/voice", (req, res) => { // req.body carries From, To, etc. Use From to pre-load the caller's // recent shipments before Gemini even asks (see Step 3). const xml = `<?xml version="1.0" encoding="UTF-8"?><Response> <Stream name="dispatch-agent" url="wss://your-app.example.com/audio" track="both_tracks" outputSamplingRate="16000" statusCallback="https://your-app.example.com/stream-status" statusEvents="stream-started stream-stopped stream-error" /></Response>`; res.set("Content-Type", "application/xml"); res.send(xml);});app.listen(3000);
track="both_tracks" forwards both the caller and any outbound audio; use inbound_track if you only want the caller's voice into Gemini. outputSamplingRate="16000" tells Sautikit to deliver 16 kHz mono PCM, the common rate for realtime LLM audio. Keep every leg at the same rate to avoid resampling.
Stream attributes worth knowing:url (wss, required), track (inbound_track | outbound_track | both_tracks, required), outputSamplingRate (8000 | 16000, required; use 16000 for AI), plus optional name, headerMetadata (JSON handshake headers), openMetadata (an opaque UTF-8 first text frame you can use to pass the caller's From or an order id to your bridge), statusCallback, and statusEvents. Audio is always 16-bit LE PCM.
Sautikit connects to your url and requires the audio.drachtio.org WebSocket subprotocol. Reject the handshake if it is absent. Incoming messages are binary PCM frames. The bridge relays audio both ways and wires Gemini's function calls to your logistics backend.
import { WebSocketServer } from "ws";import { openGeminiSession } from "./gemini.js";import { trackShipment, reschedule } from "./logistics.js";const wss = new WebSocketServer({ port: 8080, handleProtocols: (protocols) => protocols.has("audio.drachtio.org") ? "audio.drachtio.org" : false,});wss.on("connection", async (sautiSocket) => { // One Gemini Live session per call. const gemini = await openGeminiSession({ // Gemini → call: write synthesized PCM back on the SAME Sautikit socket. onAudio: (pcmChunk) => { if (sautiSocket.readyState === sautiSocket.OPEN) { sautiSocket.send(pcmChunk); // binary frame plays into the call } }, // Gemini asks to run one of your logistics tools. onToolCall: async (name, args) => { if (name === "trackShipment") { // Look up by tracking id, or fall back to the caller's From number. return await trackShipment(args.trackingId, args.callerFrom); } if (name === "reschedule") { return await reschedule(args.trackingId, args.newDate, args.address); } return { error: "unknown_tool" }; }, // Barge-in: caller started talking, stop the current reply. onInterrupt: () => { // Optionally drop queued outbound chunks here. }, }); // Call → Gemini: forward each inbound PCM frame. sautiSocket.on("message", (data, isBinary) => { if (isBinary) gemini.sendAudio(data); // 16 kHz mono PCM }); sautiSocket.on("close", () => gemini.close()); sautiSocket.on("error", () => gemini.close());});
The Gemini Live session declares the tools and dispatches function calls back to the bridge. Model names and exact field names move fast; check the current ai.google.dev Live API docs for the live model ID and config schema. The pattern below is stable.
import WebSocket from "ws";// The tools Gemini may call. Descriptions are the "prompt" the model reads.const tools = [ { functionDeclarations: [ { name: "trackShipment", description: "Look up a delivery by tracking id, or by the caller's phone " + "number if they don't have one. Returns status and ETA.", parameters: { type: "object", properties: { trackingId: { type: "string" }, callerFrom: { type: "string", description: "E.164 caller number" }, }, }, }, { name: "reschedule", description: "Reschedule a delivery to a new date and/or address.", parameters: { type: "object", properties: { trackingId: { type: "string" }, newDate: { type: "string", description: "ISO date" }, address: { type: "string" }, }, required: ["trackingId"], }, }, ], },];// NOTE: model id, message field names, and config keys change;// verify against current ai.google.dev Live API docs before shipping.const GEMINI_URL = "wss://generativelanguage.googleapis.com/…?key=" + process.env.GEMINI_API_KEY;export async function openGeminiSession({ onAudio, onToolCall, onInterrupt }) { const ws = new WebSocket(GEMINI_URL); await new Promise((resolve) => ws.on("open", resolve)); // Setup: live model, AUDIO out at 16 kHz, and our logistics tools. ws.send( JSON.stringify({ setup: { model: "models/<current-live-model>", // ← from ai.google.dev generationConfig: { responseModalities: ["AUDIO"] }, systemInstruction: { parts: [ { text: "You are a delivery dispatch agent for a courier company. " + "Help callers check ETAs, confirm addresses, and reschedule " + "drop-offs. Be concise. Use the tools to look up real data; " + "never invent a shipment status or time.", }, ], }, tools, }, }) ); ws.on("message", async (raw) => { const msg = JSON.parse(raw.toString()); // Synthesized audio out → play into the call. const parts = msg?.serverContent?.modelTurn?.parts ?? []; for (const p of parts) { const b64 = p?.inlineData?.data; if (b64) onAudio(Buffer.from(b64, "base64")); // base64 PCM } // Tool call: run it and send the result back so Gemini can speak it. const calls = msg?.toolCall?.functionCalls ?? []; for (const call of calls) { const result = await onToolCall(call.name, call.args ?? {}); ws.send( JSON.stringify({ toolResponse: { functionResponses: [ { id: call.id, name: call.name, response: { result } }, ], }, }) ); } // Barge-in: caller interrupted the model. if (msg?.serverContent?.interrupted) onInterrupt(); }); return { sendAudio(pcm) { ws.send( JSON.stringify({ realtimeInput: { mediaChunks: [ { mimeType: "audio/pcm;rate=16000", data: pcm.toString("base64") }, ], }, }) ); }, close() { if (ws.readyState === ws.OPEN) ws.close(); }, };}
Your logistics.js is ordinary business code, nothing AI-specific:
export async function trackShipment(trackingId, callerFrom) { // Prefer the tracking id; fall back to the most recent shipment // for the caller's phone number. const shipment = trackingId ? await db.shipments.byTrackingId(trackingId) : await db.shipments.latestForPhone(callerFrom); if (!shipment) return { found: false }; return { found: true, status: shipment.status, // e.g. "out_for_delivery" etaMinutes: shipment.etaMinutes, address: shipment.address, };}export async function reschedule(trackingId, newDate, address) { await db.shipments.update(trackingId, { newDate, address }); return { rescheduled: true, newDate, address };}
The load-bearing details: request AUDIO as a response modality, tag uploaded chunks as audio/pcm;rate=16000, declare your tools in setup, and return each toolCall result via toolResponse so the model can voice it. Everything else is prompt and policy.
The same bridge powers proactive calls: "Your parcel is 15 minutes away, will someone be there?" Trigger it from your dispatch logic when a driver goes out for delivery.
await fetch("https://api.sautikit.com/v1/calls", { method: "POST", headers: { Authorization: `Bearer ${process.env.SAUTIKIT_API_KEY}`, "Content-Type": "application/json", }, body: JSON.stringify({ to: ["+254712345678"], // the customer from: "+254711000001", // your claimed number clientRequestId: "delivery-42", // your idempotency/reference tag }),});
When the customer answers, Sautikit fetches the same routing_url, gets the same <Stream> XML, and connects to the same bridge. Pass the shipment id via openMetadata on the <Stream> element (or look it up by the to number) so the agent opens with context instead of asking the caller to identify themselves.
Voice is billed per second in KES from the moment the call connects, not rounded up to a full minute. Inbound is free (KES 0); outbound bills at KES 3.00/min (KES 0.05/sec). So a 40-second outbound ETA call costs about KES 2.00. See /pricing for the source of truth.
The Gemini (or other LLM) usage is billed separately by Google on your own bill; Sautikit only meters the telephony leg. Streaming audio through an LLM does not change Sautikit's per-second voice rate.
How does the agent know which shipment the caller means?
Two ways: the caller reads a tracking id, or your trackShipment tool falls back to the caller's From number to find their most recent delivery. You can pre-load that lookup in the /voice webhook using req.body.From and pass it into the session via openMetadata.
Do inbound and outbound need different code?
No. Outbound differs only in the trigger (POST /v1/calls). On answer, both directions fetch the same routing_url, return the same <Stream>, and run the same bridge and tools.
Why must the WebSocket accept the audio.drachtio.org subprotocol?
Sautikit negotiates that subprotocol when it opens the socket. If your ws server does not offer it back during the handshake, the connection is refused and no audio flows. Confirm it in your handleProtocols callback.