SautiKit
PricingDevelopersBlogAbout
Sign inStart building

AI support voice agent: deflect and triage inbound calls, escalate live

Build an AI voice agent that answers support FAQs, checks account context, resolves simple issues, and hands off to a human on the same call.

use-caseai-voice-agentstreamcustomer-supportllm

Next Steps

  • Voice Actions DSLVoice Actions are the JSON DSL Sautikit uses to control call flow. Your voice_callback_url returns a JSON array of verbs; the platform executes them in order against the live call.
  • Answer real phone calls with Gemini: bridge Gemini Live to SautikitA flagship realtime tutorial: bridge live phone audio from Sautikit's Stream verb to the Google Gemini Live API over WebSocket, so an AI voice agent answers real calls on any phone.
  • Ship an AI voice agent that answers calls: a 2026 developer guideA pillar guide to building a phone AI voice agent: the telephony, STT, LLM, and TTS layers, turn-based vs full-duplex builds, and where Sautikit fits as the voice layer.
  • AI receptionist: a 24/7 virtual front desk that never misses a callBuild a 24/7 AI receptionist with the Stream voice action. Sautikit forks live caller audio to your WebSocket, you relay it to an LLM, and warm-transfer to a human with Dial.
SautiKit

Programmable voice infrastructure for Africa. Buy numbers, place calls, and bill per second, all in local currency, via API.

Product

NumbersCalls & routingRecordingsWallet & billingPricing

Developers

DocumentationAPI referenceQuickstartAI prompt

Company

AboutBlogConsole

© 2026 Sautikit. All rights reserved • Powered by Helloduty

Terms of ServicePrivacy Policy

Sautikit provides voice API services for application developers. Numbers provisioned on this platform are not configured for emergency calling (e.g. 999 / 112). Do not use Sautikit numbers as a replacement for a primary phone line.

Summary

An AI customer-support voice agent answers inbound support calls, resolves the routine ones, and escalates the rest. It handles FAQs, looks up account or order context from your backend, walks the caller through simple fixes, and connects a human agent on the same call when it hits something it can't solve. The goal is deflection: most callers get an answer without ever waiting in a queue.

With Sautikit you drive the agent in real time. Your number's routing_url returns a <Stream> voice action; Sautikit forks live caller audio to your WebSocket server; your bridge relays that audio to the LLM of your choice and streams synthesized speech back into the call. Because your server owns the conversation, the same LLM turn that decides "I can't resolve this" can end the stream and return a <Dial> to a human.

Who this is for

  • Fintechs and ISPs running a support line where most calls are balance checks, outage status, password resets, or plan questions.
  • Backend teams that already have account, billing, or ticketing APIs and want a voice front end over them.
  • Product managers who need to cut queue times and human-agent minutes without dropping the option to reach a person.
  • Anyone building a real-time voice agent on their own LLM (Gemini Live, OpenAI Realtime, or self-hosted) rather than a closed vendor bot.

How it works

The real-time loop:

  1. A caller dials your Sautikit number. Its routing_url points at your voice webhook.
  2. Your webhook returns an XML <Response> containing a <Stream> action with your wss:// URL.
  3. Sautikit opens a WebSocket to your server. Your server must advertise the audio.drachtio.org subprotocol on the handshake, or Sautikit rejects the connection.
  4. Sautikit forks live caller audio to that socket as binary frames: 16-bit little-endian PCM at the outputSamplingRate you requested (use 16000 for AI models).
  5. Your bridge relays the audio to your LLM. The model can call your internal tools and APIs mid-conversation to pull account context, order status, or ticket history.
  6. You send synthesized PCM back on the same socket. Sautikit plays it into the call. This continues turn by turn.
  7. When the agent can't resolve the issue, your flow stops the stream and returns a <Dial> to a human agent, connecting them on the same call.
ℹ

Real-time streaming ships via the raw XML <Stream> element. Native JSON stream support is on the roadmap; until then, return <Stream> in an application/xml response. See the voice actions concept.

State management

Your server owns the conversation. Key the LLM session and any tool results by the call SID from the stream handshake, so a mid-call escalation to <Dial> can pass everything the human agent needs (verified identity, account ID, what the caller already tried).

API surface

Endpoints you call:

  • POST /v1/numbers: claim a phone number for your support line.
  • PATCH /v1/numbers/{number_id}: set or update the routing_url (your voice webhook).
  • GET /v1/calls/{call_sid}: fetch the call detail record after the call ends.

Voice actions used:

  • Stream: fork live call audio to your WebSocket for real-time AI.
  • Dial: connect the caller to a human agent for escalation.
  • Say: optional TTS for a greeting or a fallback message before streaming starts.

<Stream> attributes

AttributeRequiredNotes
urlyeswss:// endpoint Sautikit connects to.
trackyesinbound_track, outbound_track, or both_tracks.
outputSamplingRateyes8000 or 16000. Use 16000 for AI models.
namenoFriendly identifier echoed in stream status events.
headerMetadatanoJSON headers sent on the WebSocket handshake.
openMetadatanoOpaque UTF-8 payload sent as the first text frame.
statusCallbacknoURL Sautikit POSTs stream status events to.
statusEventsnoSpace-separated subset of stream-started, stream-stopped, stream-error.

Audio on the socket is 16-bit little-endian PCM. Your server must accept the audio.drachtio.org subprotocol.

Example

1. The XML your webhook returns

When the number is dialled, Sautikit POSTs to your routing_url. Reply with application/xml:

<Response>
  <Stream
    name="support-agent"
    url="wss://your-app.example.com/audio"
    track="both_tracks"
    outputSamplingRate="16000"
    statusCallback="https://your-app.example.com/stream-status"
    statusEvents="stream-started stream-stopped stream-error" />
</Response>

2. WebSocket bridge (Node.js)

This sketch shows where the LLM plugs in, where you'd invoke an internal lookup, and how you'd signal an escalation. Wire the LLM client and PCM plumbing to your provider.

import { WebSocketServer } from "ws";
 
// Sautikit negotiates the `audio.drachtio.org` subprotocol on connect.
const wss = new WebSocketServer({
  port: 8080,
  handleProtocols: (protocols) =>
    protocols.has("audio.drachtio.org") ? "audio.drachtio.org" : false,
});
 
wss.on("connection", (ws) => {
  const llm = startLLMSession({
    // Your internal tools the model can call mid-conversation.
    tools: {
      async getAccount({ msisdn }) {
        const res = await fetch(
          `https://internal.example.com/accounts?phone=${msisdn}`,
        );
        return res.json(); // balance, plan, open tickets, outage status...
      },
    },
    // The model calls this when it can't resolve the issue.
    onEscalate: (reason) => escalateToHuman(ws, reason),
  });
 
  ws.on("message", (data, isBinary) => {
    if (isBinary) {
      // Live caller audio: 16-bit LE PCM at 16000 Hz. Feed it to the model.
      llm.pushAudio(data);
    }
  });
 
  // Model output: PCM back on the same socket. Sautikit plays it into the call.
  llm.on("audio", (pcm) => ws.send(pcm, { binary: true }));
});
 
function escalateToHuman(ws, reason) {
  // Close the stream, then return a <Dial> from your routing flow so the
  // caller is connected to a human agent on the same call.
  ws.close();
  // e.g. redirect the call to a webhook that responds with:
  //   <Response><Dial><Number>+254720000010</Number></Dial></Response>
}
✎

<Stream> forks audio; it does not, by itself, end or redirect the call. To escalate, hand the call back to your routing flow and return a <Dial> to your human agent (or an SIP address / queue) as the next action.

Pricing notes

  • Call time: the inbound leg is billed per second in KES for as long as the call is live on Sautikit, including the AI-handled portion. Per-second billing means a 40-second deflected FAQ costs 40 seconds, not a rounded-up minute.
  • Escalation leg: once <Dial> connects a human agent, the per-second rate continues across the connected legs.
  • LLM cost: model inference runs on your provider (Gemini, OpenAI, or your own hardware) and is billed by them, not by Sautikit.
  • Net effect: every call the agent resolves end to end is a human-agent minute you didn't pay a person for, on top of the shorter queue.

Next steps

  • Voice actions concept: full <Stream> reference, status events, and the media handshake.
  • How to build an AI voice agent: end-to-end walkthrough of the streaming loop.
  • AI voice engine with Gemini and Sautikit: wiring a Gemini Live model to the audio fork.
  • Dial voice action: connecting callers to a human agent for escalation.
  • AI receptionist use case: the same streaming pattern applied to front-desk call handling.