Build an AI voice agent for fintech: support calls + payment reminders

A borrower calls to ask about their balance and a natural voice answers, looks up the loan, and explains the next payment. The same agent, running outbound, dials a list of due accounts and politely reminds each borrower, offers to take a promise-to-pay, and hands off to a human when the conversation needs one. No IVR tree, no app to install, on any handset. This tutorial wires that agent end to end for a fintech: Sautikit's <Stream> verb bridged to the Google Gemini Live API, with your own account tools in the loop.

TL;DR

One AI voice agent does two jobs: inbound support (borrower dials your number) and outbound reminders (you POST /v1/calls). Both use the same <Stream> bridge.

Sautikit forks live call audio to your WebSocket as raw 16-bit LE PCM @ 16 kHz; you relay it to Gemini Live and write synthesized PCM back on the same socket. Your WS server must accept the audio.drachtio.org subprotocol.

The LLM calls your account tools: look up a balance by caller From, record a promise-to-pay, start an M-Pesa payment intent. When it is out of depth, return <Dial> to a human agent.

Outbound collections are regulated: get consent, honour opt-out and quiet hours, and record lawfully. See the compliance callout before you dial.

A fintech gets two distinct wins from a voice agent, and they share almost all the same plumbing:

Inbound support. A customer dials your support line to check a balance, ask when a payment is due, or dispute a charge. The agent answers instantly, 24/7, and deflects the routine questions that clog a contact centre.
Outbound reminders and collections. You dial borrowers whose payment is due or overdue. The agent greets them by name, states the amount and date, offers to take a payment or a promise-to-pay, and escalates the hard cases to a human.

Both are the same real-time audio bridge. The only differences are who initiates the call and what the system prompt tells the agent to do. Build the bridge once; point it at either direction.

Inbound: customer dials your number
  → Sautikit fetches your routing_url → you return RAW <Stream> XML
Outbound: you POST /v1/calls → on answer, Sautikit fetches the same routing_url

  → Sautikit opens a WebSocket to your Node 'ws' bridge (binary PCM in/out)
  → your bridge relays PCM ⇄ Google Gemini Live API (WebSocket)
  → Gemini calls YOUR tools: getAccount(), recordPromiseToPay(), startMpesaIntent()
  → Gemini's synthesized PCM is written back on the Sautikit socket → plays into the call
  → out of depth? return <Dial> to a human agent

Two WebSockets, one bridge process. Sautikit is the telephony leg; Gemini is the intelligence leg; your account tools are what make it a fintech agent instead of a chatbot. The bridge itself stays thin: bytes in from the call go up to Gemini, bytes down from Gemini go back into the call.

Claim a number, then set its routing URL to your webhook. When any call on that number connects (inbound or an answered outbound call), Sautikit fetches this URL to learn what to do.

curl -s -X PATCH https://api.sautikit.com/v1/numbers/{number_id} \
  -H "Authorization: Bearer $SAUTIKIT_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "routing_url": "https://your-app.example.com/voice" }'

For the outbound half, you place calls with POST /v1/calls. On answer, Sautikit fetches the same routing_url, so one webhook serves both directions:

curl -X POST https://api.sautikit.com/v1/calls \
  -H "Authorization: Bearer $SAUTIKIT_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"to":["+254712345678"],"from":"+254711000001","clientRequestId":"reminder-42"}'

clientRequestId is your correlation handle: stamp it with the loan or ticket ID so a retry never double-dials and your webhook can tie the audio session back to the borrower.

When the call connects, Sautikit fetches your webhook. For real-time AI you return raw XML (not the JSON actions form) with a <Stream> element, served as application/xml.

import express from "express";
 
const app = express();
app.use(express.urlencoded({ extended: false }));
 
app.post("/voice", (req, res) => {
  // req.body carries From, To, etc. Use From to pre-load the borrower's account.
  const xml = `<?xml version="1.0" encoding="UTF-8"?>
<Response>
  <Stream
    name="fintech-agent"
    url="wss://your-app.example.com/audio"
    track="both_tracks"
    outputSamplingRate="16000"
    statusCallback="https://your-app.example.com/stream-status"
    statusEvents="stream-started stream-stopped stream-error" />
</Response>`;
 
  res.set("Content-Type", "application/xml");
  res.send(xml);
});
 
app.listen(3000);

track="both_tracks" forwards both the caller and any outbound audio; use inbound_track if you only want the borrower's voice into Gemini. outputSamplingRate="16000" tells Sautikit to deliver 16 kHz mono PCM, the rate the Gemini bridge expects. Keep every leg at 16 kHz so the bridge stays a byte pump with no resampling.

If you prefer the SDK's typed builder, the shape reads like this:

import { stream, voiceResponse } from "@sautikit/node";
 
export function buildVoiceResponse() {
  return voiceResponse(
    stream({
      name: "fintech-agent",
      url: "wss://your-app.example.com/audio",
      track: "both_tracks",
      outputSamplingRate: 16000,
      statusCallback: "https://your-app.example.com/stream-status",
      statusEvents: ["stream-started", "stream-stopped", "stream-error"],
    })
  );
}

Sautikit connects to your url and requires the audio.drachtio.org WebSocket subprotocol. Reject the handshake if it is absent. Incoming messages are binary 16-bit LE PCM @ 16 kHz frames. The bridge relays audio both ways and lets Gemini call your account tools.

import { WebSocketServer } from "ws";
import { openGeminiSession } from "./gemini.js";
import { getAccount, recordPromiseToPay, startMpesaIntent } from "./accounts.js";
 
const wss = new WebSocketServer({
  port: 8080,
  handleProtocols: (protocols) =>
    protocols.has("audio.drachtio.org") ? "audio.drachtio.org" : false,
});
 
wss.on("connection", async (sautiSocket, req) => {
  // Identify the borrower up front. The caller's number arrives in the
  // stream handshake metadata (or pass your own via headerMetadata/openMetadata).
  const caller = new URL(req.url, "http://x").searchParams.get("from");
 
  const gemini = await openGeminiSession({
    // Give the model the account tools it may call mid-conversation.
    tools: {
      // Look up balance / loan by caller number or account id.
      async getAccount({ accountId } = {}) {
        return getAccount(accountId ?? caller);
      },
      // Persist a promise-to-pay the borrower agrees to on the call.
      async recordPromiseToPay({ accountId, amount, payBy }) {
        return recordPromiseToPay({ accountId: accountId ?? caller, amount, payBy });
      },
      // Kick off an M-Pesa STK push intent for an in-call payment.
      async startMpesaIntent({ accountId, amount }) {
        return startMpesaIntent({ accountId: accountId ?? caller, amount });
      },
    },
    // Gemini → call: write synthesized PCM back on the SAME Sautikit socket.
    onAudio: (pcmChunk) => {
      if (sautiSocket.readyState === sautiSocket.OPEN) sautiSocket.send(pcmChunk);
    },
    // Barge-in: borrower started talking, stop the current reply.
    onInterrupt: () => {
      // Drop any queued outbound chunks so the agent stops talking over them.
    },
  });
 
  // Call → Gemini: forward each inbound PCM frame (16-bit LE PCM @ 16 kHz).
  sautiSocket.on("message", (data, isBinary) => {
    if (isBinary) gemini.sendAudio(data);
  });
 
  sautiSocket.on("close", () => gemini.close());
  sautiSocket.on("error", () => gemini.close());
});

Your tool functions are ordinary async code hitting your own core-banking or ledger service. A minimal accounts.js:

// accounts.js — your ledger, your rules. These are illustrative shapes.
export async function getAccount(idOrMsisdn) {
  const acct = await db.loans.findByCallerOrId(idOrMsisdn);
  if (!acct) return { found: false };
  return {
    found: true,
    accountId: acct.id,
    name: acct.borrowerName,
    balanceDue: acct.balanceDueMinor / 100, // KES
    dueDate: acct.dueDate,                   // e.g. "2026-07-10"
    status: acct.status,                     // current | overdue
  };
}
 
export async function recordPromiseToPay({ accountId, amount, payBy }) {
  const ptp = await db.ptp.create({ accountId, amountMinor: Math.round(amount * 100), payBy });
  return { ok: true, promiseId: ptp.id, amount, payBy };
}
 
export async function startMpesaIntent({ accountId, amount }) {
  // Trigger your M-Pesa STK push so the borrower repays on the same handset.
  const intent = await mpesa.stkPush({ accountId, amountMinor: Math.round(amount * 100) });
  return { ok: true, checkoutRequestId: intent.checkoutRequestId };
}

The Gemini Live session is itself a WebSocket: you open it, send a setup message selecting a live model with audio in/out and your tool declarations, then stream PCM up and receive synthesized PCM down. Model IDs and exact field names move fast, so check the current ai.google.dev Live API docs for the live model ID, the tool-declaration schema, and the audio config. The pattern below is stable.

import WebSocket from "ws";
 
// Declare the tools the model may call. Names must match the `tools` map
// passed into openGeminiSession; verify the schema against ai.google.dev.
const toolDeclarations = [
  { name: "getAccount", description: "Look up balance and loan by account id or caller number." },
  { name: "recordPromiseToPay", description: "Record a promise-to-pay: accountId, amount, payBy date." },
  { name: "startMpesaIntent", description: "Start an M-Pesa STK push for an in-call payment." },
];
 
// The Live API returns audio inline as base64 under serverContent model-turn
// parts; verify the exact path against current ai.google.dev docs.
function extractInlineAudio(msg) {
  const parts = msg?.serverContent?.modelTurn?.parts ?? [];
  for (const p of parts) {
    if (p?.inlineData?.data) return p.inlineData.data; // base64 PCM
  }
  return null;
}
 
const GEMINI_URL =
  "wss://generativelanguage.googleapis.com/…?key=" + process.env.GEMINI_API_KEY;
 
export async function openGeminiSession({ tools, onAudio, onInterrupt }) {
  const ws = new WebSocket(GEMINI_URL);
  await new Promise((resolve) => ws.on("open", resolve));
 
  ws.send(
    JSON.stringify({
      setup: {
        model: "models/<current-live-model>", // ← from ai.google.dev
        generationConfig: { responseModalities: ["AUDIO"] },
        systemInstruction: {
          parts: [
            {
              text:
                "You are a polite fintech support and payment-reminder agent. " +
                "Verify the borrower, look up their account with getAccount before " +
                "quoting figures, offer to take a payment or a promise-to-pay, and " +
                "never disclose account details before confirming identity.",
            },
          ],
        },
        // Declare the callable tools; the runtime returns tool-call requests.
        tools: [{ functionDeclarations: toolDeclarations }],
      },
    })
  );
 
  ws.on("message", async (raw) => {
    const msg = JSON.parse(raw.toString());
 
    const audioB64 = extractInlineAudio(msg);
    if (audioB64) onAudio(Buffer.from(audioB64, "base64"));
 
    if (msg?.serverContent?.interrupted) onInterrupt();
 
    // Tool call: run your function, send the result back to the model.
    for (const call of msg?.toolCall?.functionCalls ?? []) {
      const fn = tools[call.name];
      const result = fn ? await fn(call.args ?? {}) : { error: "unknown_tool" };
      ws.send(
        JSON.stringify({
          toolResponse: {
            functionResponses: [{ id: call.id, name: call.name, response: result }],
          },
        })
      );
    }
  });
 
  return {
    sendAudio(pcm) {
      ws.send(
        JSON.stringify({
          realtimeInput: {
            mediaChunks: [{ mimeType: "audio/pcm;rate=16000", data: pcm.toString("base64") }],
          },
        })
      );
    },
    close() {
      if (ws.readyState === ws.OPEN) ws.close();
    },
  };
}

The load-bearing details: request AUDIO as a response modality, tag uploaded chunks as audio/pcm;rate=16000, declare your tools so the model can call them, and feed each tool result back as a toolResponse. Everything else is prompt and policy.

Some conversations should not be handled by a bot: a dispute, a hardship case, a borrower who asks for a person. Detect that (a dedicated escalate tool the model can call, or a phrase trigger in your bridge) and end the stream by returning a <Dial> to a human agent.

// When your escalate tool fires, respond to Sautikit's next webhook fetch
// with a Dial instead of a Stream, connecting the borrower to an agent.
app.post("/voice/escalate", (req, res) => {
  const xml = `<?xml version="1.0" encoding="UTF-8"?>
<Response>
  <Say language="en-KE">Let me connect you to an agent.</Say>
  <Dial record="true">
    <Number>+254711222333</Number>
  </Dial>
</Response>`;
  res.set("Content-Type", "application/xml");
  res.send(xml);
});

record="true" on the <Dial> keeps a recording of the human leg. For collections, recording is not just QA: a timestamped, recorded interaction is exactly the kind of evidence a regulated lender wants on file when a borrower later disputes what was agreed. Store recordings against the clientRequestId / loan ID so you can retrieve the right call in seconds.

Outbound reminders and collections are regulated. Build consent and lawful recording in from the start, not as an afterthought.

Consent to call. Only dial borrowers who agreed to phone contact at onboarding. Keep opt-in on record per account.
Opt-out and quiet hours. Honour "do not call" immediately and suppress that number on every future run. Do not dial outside lawful calling hours, and never contact a borrower's third-party contacts.
Lawful recording. Disclose that the call is recorded, record for a lawful purpose (evidence, dispute resolution, QA), and retain and secure recordings per your data-protection obligations. In several African markets, data-protection law and financial-conduct rules both apply — check the regime where your borrowers are.
Honesty and tone. The agent must identify itself and your business, state that it is an automated assistant, and stay non-harassing. Debt-collection harassment is unlawful in a growing number of jurisdictions.

Treat these as hard requirements. The engineering above is the easy part; staying compliant is what keeps the channel open.

Voice bills per second from the moment the call connects, in KES. Inbound support calls are free (KES 0); outbound reminder calls bill at KES 3.00/min (KES 0.05/sec), so a 25-second reminder costs about KES 1.25, a fraction of a human agent dial. The Gemini/LLM usage is billed separately by Google on your own bill — Sautikit does not mark it up or meter it. See /pricing for the source of truth.

Can one webhook serve both inbound support and outbound reminders?

Yes. Point the number's routing_url at your /voice webhook. Inbound calls fetch it on connect; answered outbound calls (placed via POST /v1/calls) fetch the same URL. Branch on direction or on clientRequestId if you want a different system prompt per job.

How does the agent know which borrower is on the line?

For inbound, the caller's number arrives in the request and stream handshake, so you look up the account by From. For outbound, you already know the loan ID: pass it through clientRequestId and, if needed, openMetadata on the <Stream> so your bridge can bind the audio session to the account.

Do I need a separate endpoint for streaming?

No. You reuse the same webhook as any Sautikit call. The difference is you return raw <Stream> XML with Content-Type: application/xml instead of the JSON actions array. And your WebSocket server must offer the audio.drachtio.org subprotocol during the handshake.

Build an AI call center on Sautikit: the same bridge, scaled to a full inbound desk.
Bridge Gemini Live to Sautikit's Stream verb: the flagship real-time bridge this tutorial builds on.
AI support agent use case: where an inbound voice agent fits your support stack.
Outbound lead qualification use case: the outbound-dial pattern, applied to sales.
Voice actions reference: every verb, including <Stream> and <Dial> attributes.

TL;DR

One AI voice agent does two jobs: inbound support (borrower dials your number) and outbound reminders (you POST /v1/calls). Both use the same <Stream> bridge.

Sautikit forks live call audio to your WebSocket as raw 16-bit LE PCM @ 16 kHz; you relay it to Gemini Live and write synthesized PCM back on the same socket. Your WS server must accept the audio.drachtio.org subprotocol.

The LLM calls your account tools: look up a balance by caller From, record a promise-to-pay, start an M-Pesa payment intent. When it is out of depth, return <Dial> to a human agent.

Outbound collections are regulated: get consent, honour opt-out and quiet hours, and record lawfully. See the compliance callout before you dial.