An AI appointment-booking voicebot answers an inbound call and lets the caller book, reschedule, or cancel in ordinary spoken language. No "press 1 for bookings" menu, no rigid script. The caller says "I'd like to move my cleaning to next Thursday afternoon," and the agent confirms the slot, updates the calendar, and reads back the new time.
With Sautikit you get the real-time audio pipe and the phone number; you bring the intelligence. The <Stream> verb forks live caller audio to your WebSocket server, you relay it to the LLM of your choice for speech-to-text, reasoning, and text-to-speech, and you stream audio back into the call. You own the model, the prompt, and the booking logic. Sautikit bills the call by the second.
The loop is real-time and full-duplex:
routing_url points to your voice webhook.<Response> containing a <Stream> verb.url in that verb. Your server must advertise the audio.drachtio.org subprotocol during the handshake, or the connection is rejected.Because the stream is bidirectional on one socket, the caller and the agent can talk over each other, and the agent can barge in or pause — exactly what natural conversation needs.
Your server owns the conversation state and the calendar. Use the call SID (sent in the stream metadata) to key a session, run tool-calls from the LLM against your booking database or calendar API, and confirm each change back to the caller in speech before you commit it.
Endpoints you call:
POST /v1/numbers: claim a phone number for the voicebot.PATCH /v1/numbers/{number_id}: set or update the routing_url (your voice webhook).GET /v1/calls/{call_sid}: fetch the call detail record after the call ends, for logging and billing reconciliation.Voice actions used:
Stream: fork live caller audio to your WebSocket and play audio back. This is the core of the real-time loop.Say: optional TTS greeting before the stream opens.Dial: optional human handoff — connect the caller to a receptionist when the AI hits something it can't resolve.When Sautikit POSTs to your routing_url, reply with this. It opens a bidirectional stream at 16 kHz — the right sampling rate for AI models.
<Response>
<Stream
name="booking-agent"
url="wss://your-app.example.com/audio"
track="both_tracks"
outputSamplingRate="16000"
statusCallback="https://your-app.example.com/stream-status"
statusEvents="stream-started stream-stopped stream-error" />
</Response>Attribute notes:
url (required): your wss:// WebSocket endpoint.track (required): inbound_track, outbound_track, or both_tracks. Use both_tracks so the agent hears the caller and its own output.outputSamplingRate (required): 8000 or 16000. Use 16000 for AI models.name (optional): a label echoed back in stream events.statusCallback / statusEvents (optional): where and which lifecycle events (stream-started stream-stopped stream-error) are POSTed.You can also pass headerMetadata (a JSON blob sent as HTTP handshake headers, handy for auth) and openMetadata (opaque UTF-8 delivered in the first text frame).
This is the glue between the Sautikit socket and your LLM. Advertising the audio.drachtio.org subprotocol is mandatory.
import { WebSocketServer } from "ws";
import { connectLLM } from "./llm.js"; // your Gemini Live / OpenAI wrapper
const wss = new WebSocketServer({
port: 8080,
// MUST advertise this subprotocol or Sautikit rejects the handshake
handleProtocols: (protocols) =>
protocols.has("audio.drachtio.org") ? "audio.drachtio.org" : false,
});
wss.on("connection", (sautikit) => {
// Open your model session (STT + reasoning + TTS)
const llm = connectLLM({
systemPrompt:
"You are the booking agent for Whitedent Clinic, Nairobi. " +
"Book, reschedule, or cancel appointments. Confirm the date and " +
"time back to the caller before committing. Speak the caller's language.",
onAudio: (pcm) => sautikit.send(pcm), // model audio -> back into the call
});
sautikit.on("message", (data, isBinary) => {
if (isBinary) {
// Live caller audio: 16-bit LE signed PCM -> feed the model
llm.pushAudio(data);
} else {
// First text frame carries openMetadata (call SID, etc.)
const meta = JSON.parse(data.toString());
llm.setContext({ callSid: meta.call_sid });
}
});
sautikit.on("close", () => llm.close());
});Your connectLLM wrapper is where booking tool-calls live: when the model decides to write a slot, call your calendar API, then have the model confirm the result to the caller.
curl -X PATCH "https://api.sautikit.com/v1/numbers/{number_id}" \
-H "Authorization: Bearer $SAUTIKIT_API_KEY" \
-H "Content-Type: application/json" \
-d '{"routing_url": "https://your-app.example.com/voice"}'Sautikit bills the inbound call per second in KES for the time the call is live on the platform — nothing more. There is no per-minute "AI voice" surcharge and no fee for the number of audio frames or WebSocket bytes.
The AI cost — STT, the LLM, and TTS — is billed by your own provider (Gemini, OpenAI, or your self-hosted GPU bill). That separation is the point: you pay the telephony leg to Sautikit at raw per-second rates and the intelligence leg to whoever you chose, with no platform tax stacked between them. A three-minute booking call costs you three minutes of per-second inbound telephony plus whatever your model provider charges for three minutes of audio.