SautiKit
PricingDevelopersBlogAbout
Sign inStart building

AI receptionist: a 24/7 virtual front desk that never misses a call

Answer every inbound call with an AI voice agent using the Stream voice action, your own LLM, and a warm transfer to a human via Dial.

use-caseai-voice-agentstreamreceptionistllm

Next Steps

  • Voice Actions DSLVoice Actions are the JSON DSL Sautikit uses to control call flow. Your voice_callback_url returns a JSON array of verbs; the platform executes them in order against the live call.
  • Answer real phone calls with Gemini: bridge Gemini Live to SautikitA flagship realtime tutorial: bridge live phone audio from Sautikit's Stream verb to the Google Gemini Live API over WebSocket, so an AI voice agent answers real calls on any phone.
  • Ship an AI voice agent that answers calls: a 2026 developer guideA pillar guide to building a phone AI voice agent: the telephony, STT, LLM, and TTS layers, turn-based vs full-duplex builds, and where Sautikit fits as the voice layer.
  • AI support voice agent: deflect and triage inbound calls, escalate liveStream live caller audio to your own LLM over a WebSocket, resolve routine support requests automatically, and escalate to a human on the same call with Dial when the agent can't help.
SautiKit

Programmable voice infrastructure for Africa. Buy numbers, place calls, and bill per second, all in local currency, via API.

Product

NumbersCalls & routingRecordingsWallet & billingPricing

Developers

DocumentationAPI referenceQuickstartAI prompt

Company

AboutBlogConsole

© 2026 Sautikit. All rights reserved • Powered by Helloduty

Terms of ServicePrivacy Policy

Sautikit provides voice API services for application developers. Numbers provisioned on this platform are not configured for emergency calling (e.g. 999 / 112). Do not use Sautikit numbers as a replacement for a primary phone line.

Summary

An AI receptionist answers every inbound call, greets the caller by voice, understands what they want in natural language, answers the common questions, and hands off to a human only when it needs to. No hold music, no voicemail, no missed calls after hours. The caller talks; your AI listens, thinks, and talks back in real time.

Sautikit makes this a media problem, not a telephony problem. You attach a webhook to a number, return a <Stream> voice action, and Sautikit forks the live call audio to your WebSocket server as raw PCM. You relay that audio to any LLM voice model (Gemini Live, OpenAI Realtime, or a self-hosted stack) and send the synthesised reply back on the same socket. When the AI decides a human is needed, your flow returns a <Dial> to warm-transfer the caller.

Who this is for

  • SMEs and professional-services offices — a law firm, property agency, clinic, or accountancy — that cannot afford to miss an inbound call.
  • Teams expanding across Africa that need a front desk in every timezone without hiring one per office.
  • Developers building an AI voice agent who want the media transport handled and their own model in control of the conversation.
  • Product managers replacing legacy voicemail and after-hours forwarding with a measurable, always-on agent.

How it works — the real-time loop

  1. A caller dials your Sautikit number. The number's routing_url points at your voice webhook.
  2. Sautikit POSTs the call details to that webhook. Your server responds with an XML <Response> containing a <Stream> action.
  3. Sautikit opens a WebSocket to the url in your <Stream>. Your server must advertise the audio.drachtio.org subprotocol on the handshake, or the connection is rejected.
  4. Sautikit forks the live caller audio down that socket as binary PCM frames (16-bit little-endian).
  5. You relay those frames to your LLM (Gemini Live, OpenAI Realtime, or self-hosted). The model transcribes, reasons, and generates a spoken reply.
  6. You send the reply back as PCM frames on the same socket. Sautikit plays them into the call. This is full-duplex: audio flows both ways at once, so the caller can interrupt.
  7. When the AI decides to escalate, your webhook flow returns a <Dial> to a human's number and the caller is warm-transferred.
ℹ

Use outputSamplingRate="16000" for AI agents. The wider 16 kHz band gives the model cleaner audio than the 8 kHz PSTN default, which noticeably improves transcription and voice quality.

API surface

Endpoints you call:

  • POST /v1/numbers: claim a phone number for the front desk.
  • PATCH /v1/numbers/{number_id}: set the routing_url to your voice webhook.
  • GET /v1/calls/{call_sid}: fetch the call detail record after the call ends.

Voice actions used:

  • Stream: fork live call audio to your WebSocket for real-time AI.
  • Dial: warm-transfer the caller to a human when the AI escalates.
  • Say: a TTS greeting fallback if the media socket is unavailable.
✎

Stream ships today via the raw XML form only — return the <Stream> element in an application/xml response and Sautikit forwards it to the media layer unchanged. A native JSON stream action is on the roadmap; until then, use the XML form for real-time media forking.

Example

1. The XML your webhook returns

When the number is dialled, your webhook replies with an application/xml body opening the media stream:

<Response>
  <Stream
    name="receptionist"
    url="wss://your-app.example.com/audio"
    track="both_tracks"
    outputSamplingRate="16000"
    statusCallback="https://your-app.example.com/stream-status"
    statusEvents="stream-started stream-stopped stream-error" />
</Response>

track="both_tracks" forks both call legs so your model hears the caller and its own playback. outputSamplingRate is the PCM rate Sautikit sends and expects back. Audio on the wire is 16-bit little-endian PCM.

2. The WebSocket bridge (Node.js)

Your server terminates the socket, relays PCM to your LLM, and pipes the model's PCM back. The one hard requirement: advertise the audio.drachtio.org subprotocol.

import { WebSocketServer } from "ws";
import { connectToLLM } from "./llm.js"; // Gemini Live / OpenAI Realtime / self-hosted
 
const wss = new WebSocketServer({
  port: 8080,
  handleProtocols: () => "audio.drachtio.org", // required by Sautikit
});
 
wss.on("connection", async (call) => {
  const llm = await connectToLLM({ sampleRate: 16000 });
 
  // Caller audio (binary PCM) -> LLM
  call.on("message", (frame, isBinary) => {
    if (isBinary) llm.sendAudio(frame);
  });
 
  // LLM audio (binary PCM) -> back into the call on the same socket
  llm.on("audio", (pcm) => call.send(pcm, { binary: true }));
 
  // When the model decides to escalate, close the stream so your
  // voice webhook flow can return the <Dial> below.
  llm.on("handoff", () => call.close());
});

3. Escalating to a human

When the AI hands off, end the stream and let your flow return a <Dial> to the human's number, warm-transferring the caller:

<Response>
  <Say>Connecting you to the front desk now.</Say>
  <Dial>+254700000001</Dial>
</Response>

Pricing notes

The inbound call leg is billed per second in KES for as long as the call is live on the Sautikit platform — the same rate whether the AI is handling the caller or the call has been transferred. Once <Dial> connects a human, the outbound leg is billed per second too, for the duration of the connected call.

There is no separate Sautikit charge for opening the media stream or for the WebSocket round-trips. Your LLM and voice-model costs (Gemini Live, OpenAI, or self-hosted compute) are billed by that provider on their own metering — Sautikit only moves the audio.

⚠

The AI holds the call open while it thinks and speaks, and per-second billing runs the whole time. Keep model latency low and end the stream promptly on hang-up or handoff so you are not paying for dead air.

Next steps

  • Voice actions concept: the full <Stream> attribute table and the action-response loop.
  • Build an AI voice engine with Gemini: wiring Stream to Gemini Live end to end.
  • How to build an AI voice agent: design patterns for real-time voice agents.
  • Dial voice action: warm transfer, caller ID, and connected-leg options.
  • AI support agent use case: the same Stream loop applied to inbound support.