AI support voice agent: deflect and triage inbound calls, escalate live

An AI customer-support voice agent answers inbound support calls, resolves the routine ones, and escalates the rest. It handles FAQs, looks up account or order context from your backend, walks the caller through simple fixes, and connects a human agent on the same call when it hits something it can't solve. The goal is deflection: most callers get an answer without ever waiting in a queue.

With Sautikit you drive the agent in real time. Your number's routing_url returns a <Stream> voice action; Sautikit forks live caller audio to your WebSocket server; your bridge relays that audio to the LLM of your choice and streams synthesized speech back into the call. Because your server owns the conversation, the same LLM turn that decides "I can't resolve this" can end the stream and return a <Dial> to a human.

Fintechs and ISPs running a support line where most calls are balance checks, outage status, password resets, or plan questions.
Backend teams that already have account, billing, or ticketing APIs and want a voice front end over them.
Product managers who need to cut queue times and human-agent minutes without dropping the option to reach a person.
Anyone building a real-time voice agent on their own LLM (Gemini Live, OpenAI Realtime, or self-hosted) rather than a closed vendor bot.

The real-time loop:

A caller dials your Sautikit number. Its routing_url points at your voice webhook.
Your webhook returns an XML <Response> containing a <Stream> action with your wss:// URL.
Sautikit opens a WebSocket to your server. Your server must advertise the audio.drachtio.org subprotocol on the handshake, or Sautikit rejects the connection.
Sautikit forks live caller audio to that socket as binary frames: 16-bit little-endian PCM at the outputSamplingRate you requested (use 16000 for AI models).
Your bridge relays the audio to your LLM. The model can call your internal tools and APIs mid-conversation to pull account context, order status, or ticket history.
You send synthesized PCM back on the same socket. Sautikit plays it into the call. This continues turn by turn.
When the agent can't resolve the issue, your flow stops the stream and returns a <Dial> to a human agent, connecting them on the same call.

Your server owns the conversation. Key the LLM session and any tool results by the call SID from the stream handshake, so a mid-call escalation to <Dial> can pass everything the human agent needs (verified identity, account ID, what the caller already tried).

Endpoints you call:

POST /v1/numbers: claim a phone number for your support line.
PATCH /v1/numbers/{number_id}: set or update the routing_url (your voice webhook).
GET /v1/calls/{call_sid}: fetch the call detail record after the call ends.

Voice actions used:

Stream: fork live call audio to your WebSocket for real-time AI.
Dial: connect the caller to a human agent for escalation.
Say: optional TTS for a greeting or a fallback message before streaming starts.

Attribute	Required	Notes
`url`	yes	`wss://` endpoint Sautikit connects to.
`track`	yes	`inbound_track`, `outbound_track`, or `both_tracks`.
`outputSamplingRate`	yes	`8000` or `16000`. Use `16000` for AI models.
`name`	no	Friendly identifier echoed in stream status events.
`headerMetadata`	no	JSON headers sent on the WebSocket handshake.
`openMetadata`	no	Opaque UTF-8 payload sent as the first text frame.
`statusCallback`	no	URL Sautikit POSTs stream status events to.
`statusEvents`	no	Space-separated subset of `stream-started`, `stream-stopped`, `stream-error`.

Audio on the socket is 16-bit little-endian PCM. Your server must accept the audio.drachtio.org subprotocol.

When the number is dialled, Sautikit POSTs to your routing_url. Reply with application/xml:

<Response>
  <Stream
    name="support-agent"
    url="wss://your-app.example.com/audio"
    track="both_tracks"
    outputSamplingRate="16000"
    statusCallback="https://your-app.example.com/stream-status"
    statusEvents="stream-started stream-stopped stream-error" />
</Response>

This sketch shows where the LLM plugs in, where you'd invoke an internal lookup, and how you'd signal an escalation. Wire the LLM client and PCM plumbing to your provider.

import { WebSocketServer } from "ws";
 
// Sautikit negotiates the `audio.drachtio.org` subprotocol on connect.
const wss = new WebSocketServer({
  port: 8080,
  handleProtocols: (protocols) =>
    protocols.has("audio.drachtio.org") ? "audio.drachtio.org" : false,
});
 
wss.on("connection", (ws) => {
  const llm = startLLMSession({
    // Your internal tools the model can call mid-conversation.
    tools: {
      async getAccount({ msisdn }) {
        const res = await fetch(
          `https://internal.example.com/accounts?phone=${msisdn}`,
        );
        return res.json(); // balance, plan, open tickets, outage status...
      },
    },
    // The model calls this when it can't resolve the issue.
    onEscalate: (reason) => escalateToHuman(ws, reason),
  });
 
  ws.on("message", (data, isBinary) => {
    if (isBinary) {
      // Live caller audio: 16-bit LE PCM at 16000 Hz. Feed it to the model.
      llm.pushAudio(data);
    }
  });
 
  // Model output: PCM back on the same socket. Sautikit plays it into the call.
  llm.on("audio", (pcm) => ws.send(pcm, { binary: true }));
});
 
function escalateToHuman(ws, reason) {
  // Close the stream, then return a <Dial> from your routing flow so the
  // caller is connected to a human agent on the same call.
  ws.close();
  // e.g. redirect the call to a webhook that responds with:
  //   <Response><Dial><Number>+254720000010</Number></Dial></Response>
}

Call time: the inbound leg is billed per second in KES for as long as the call is live on Sautikit, including the AI-handled portion. Per-second billing means a 40-second deflected FAQ costs 40 seconds, not a rounded-up minute.
Escalation leg: once <Dial> connects a human agent, the per-second rate continues across the connected legs.
LLM cost: model inference runs on your provider (Gemini, OpenAI, or your own hardware) and is billed by them, not by Sautikit.
Net effect: every call the agent resolves end to end is a human-agent minute you didn't pay a person for, on top of the shorter queue.

Voice actions concept: full <Stream> reference, status events, and the media handshake.
How to build an AI voice agent: end-to-end walkthrough of the streaming loop.
AI voice engine with Gemini and Sautikit: wiring a Gemini Live model to the audio fork.
Dial voice action: connecting callers to a human agent for escalation.
AI receptionist use case: the same streaming pattern applied to front-desk call handling.

Fintechs and ISPs running a support line where most calls are balance checks, outage status, password resets, or plan questions.
Backend teams that already have account, billing, or ticketing APIs and want a voice front end over them.
Product managers who need to cut queue times and human-agent minutes without dropping the option to reach a person.
Anyone building a real-time voice agent on their own LLM (Gemini Live, OpenAI Realtime, or self-hosted) rather than a closed vendor bot.