An AI voice agent is a program that answers or places a phone call, listens, thinks, and talks back: end to end, no human on the line. Building one in 2026 is mostly about wiring four layers together and keeping latency low enough that the caller never feels the machine.
TL;DR
A phone AI voice agent has four layers: telephony (place/receive calls + fork audio), speech-to-text, an LLM for reasoning, and text-to-speech.
Two build paths: turn-based (record → STT → LLM → TTS → loop) is simplest; realtime full-duplex via <Stream> + a realtime LLM gives natural barge-in.
Sautikit is the telephony layer any LLM plugs into: a working number costs ~KES 116 and activates instantly, billed in KES.
Every phone AI voice agent, regardless of framework, is the same pipeline:
Telephony layer: places and receives the PSTN/SIP call, and exposes the live audio. This is Sautikit: POST /v1/calls to dial out, a voice callback to handle inbound, and the <Stream> verb to fork media to your code.
Speech-to-text (STT): turns caller audio into text. Deepgram, Whisper, or a realtime provider.
LLM reasoning: decides what to say and which tools to call. GPT, Claude, or Gemini.
Text-to-speech (TTS): turns the LLM's reply back into audio the caller hears.
Two cross-cutting concerns make or break the experience: turn-taking / barge-in (letting the caller interrupt) and the latency budget (how long from "caller stops talking" to "agent starts talking").
The caller perceives a natural conversation when the agent responds in under ~800 ms after they stop speaking. That budget is shared:
Endpointing (deciding the caller is done): 100–300 ms
STT final transcript: 100–200 ms
LLM first token: 200–500 ms
TTS first audio frame: 100–300 ms
You cannot afford to run these strictly in series and stay under budget. The trick is streaming: start LLM inference on partial transcripts, and start TTS on the first LLM tokens. Which path you pick (turn-based or full-duplex) decides how much streaming you get.
The simplest agent uses Sautikit voice actions and never touches raw audio. Each turn: record the caller, transcribe, ask the LLM, say the answer, and redirect back into the loop. Your voice callback returns JSON actions.
import express from "express";const app = express();app.use(express.urlencoded({ extended: false }));// First turn: greet, then record the caller and post to /turnapp.post("/voice", (req, res) => { res.json({ actions: [ { say: { text: "Hi, this is Sauti support. How can I help?", language: "en-KE" } }, { record: { action: "/turn", maxLength: 20, transcribe: true } }, ], });});// Each subsequent turn: LLM answers, then we record againapp.post("/turn", async (req, res) => { const caller = req.body.transcription_text ?? ""; const reply = await askLLM(caller); // your GPT/Claude/Gemini call res.json({ actions: [ { say: { text: reply, language: "en-KE" } }, { record: { action: "/turn", maxLength: 20, transcribe: true } }, ], });});app.listen(3000);
This is robust and easy to debug, but the caller cannot interrupt; they must wait for say to finish before the next record starts. Good for IVR-style flows, form-filling, and confirmations. Not good for open conversation.
For a natural agent that supports barge-in, you fork the live call audio to your own WebSocket with the <Stream> verb and pipe it into a realtime LLM. Return the raw XML form from your voice callback:
Sautikit then opens a WebSocket to your server (which must accept the audio.drachtio.org subprotocol) and streams raw binary PCM frames. The socket is bidirectional: send binary PCM frames back on the same socket to play audio into the call. That is where your STT → realtime LLM → TTS pipeline lives, and where barge-in happens: when the caller starts talking, you stop your outbound audio.
import { WebSocketServer } from "ws";const wss = new WebSocketServer({ port: 8080, handleProtocols: () => "audio.drachtio.org" });wss.on("connection", (ws) => { ws.on("message", (frame, isBinary) => { if (!isBinary) return handleStatusEvent(JSON.parse(frame.toString())); // frame is raw 16 kHz PCM from the caller; feed to STT / realtime LLM pipeline.write(frame); }); // to speak: ws.send(pcmChunk) with a binary Buffer of PCM pipeline.on("audio", (pcm) => ws.send(pcm));});
Inbound is free (KES 0): point a claimed number at your voice callback and answer for nothing. Outbound is KES 3.00/min with a billed per second from the moment the call connects.
A useful agent does more than talk: it looks up an order, checks a balance, books a slot. That is standard LLM tool calling: your LLM emits a tool call, your handler runs it (query your DB, hit M-Pesa, etc.), and you feed the result back before generating the spoken reply. The telephony layer does not change; only your reasoning loop grows. Keep tool calls under ~400 ms or speak a filler line ("Let me check that for you") while the tool runs, so the latency budget holds.
Managed platforms like Vapi, Retell, and Bland bundle telephony + STT + LLM + TTS behind one config. They are fast to start and fine for prototypes. Teams move to build-your-own when they need: control over the exact prompt and voice, the freedom to swap models, data residency, and predictable pricing. Building on a raw voice API like Sautikit gives you that control, plus KES-denominated, prepaid billing instead of USD-per-minute markup. A working Sautikit number is ~KES 116 and activates instantly, versus the ~KES 5,000 setup plus ~KES 2,500/month a legacy Kenyan provider charges to provision a number. See the full breakdown in Sautikit vs Vapi, Retell & Bland.
A program that handles a phone call end to end: it answers or places the call, transcribes what the caller says, uses an LLM to decide a response, and speaks back with TTS, with no human on the line.
Do I need machine learning expertise to build one?
No. You call hosted STT, LLM, and TTS APIs and wire them together. The hard parts (endpointing, barge-in, and staying inside the latency budget) are engineering, not model training.
Turn-based or realtime: which should I build?
Start turn-based (record → STT → LLM → say) for confirmations, OTPs, and form-filling. Move to realtime <Stream> full-duplex when you need natural interruption and open-ended conversation.
Can I use any LLM with Sautikit?
Yes. Sautikit is the telephony layer only and is model-agnostic: plug in GPT, Claude, Gemini, or a self-hosted model. It just delivers and receives the audio.
How much does it cost to run?
Sautikit charges KES 3.00/min outbound (billed per second from the moment the call connects) and KES 0 inbound. Numbers are from KES 116/month. Your STT/LLM/TTS providers bill separately.