End-to-end tutorial using Twilio, Ultravox, and FastAPI
What we’re building
A phone number that, when called, is answered by an AI agent that can:
- Greet the caller and verify their identity
- Answer questions from a knowledge base
- Schedule meetings on a calendar
- Hand off the transcript to a workflow engine when the call ends
No human in the loop. Real audio, real conversation, real tool calls.
The three actors
flowchart LR
Caller([Caller])
Twilio[Twilio]
VoxFlow[VoxFlow
FastAPI]
Ultravox[Ultravox AI]
Caller -- PSTN --> Twilio
Twilio -- HTTPS webhook --> VoxFlow
VoxFlow -- WSS --> Ultravox
Twilio <-- Media Stream WSS --> VoxFlow
- Twilio owns the phone number and the audio pipe.
- Ultravox is a hosted speech-to-speech AI (think GPT-4o realtime but multi-vendor).
- VoxFlow is the FastAPI middleware that glues them together.
The call lifecycle
- Inbound HTTP — Twilio POSTs to
/incoming-callwhen a call arrives. - TwiML response — VoxFlow returns XML instructing Twilio to open a WebSocket to
/media-stream. - Dual WebSockets — VoxFlow now holds two WebSockets simultaneously: one to Twilio (caller audio in/out) and one to Ultravox (AI audio in/out).
- Audio relay — Bytes flow in both directions with format conversion (μ-law ↔ PCM, see post 2).
- Tool calls — When the AI decides to call a tool (
schedule_meeting,queryCorpus), VoxFlow validates the params and dispatches. - Cleanup — When either side closes, the session is popped and the transcript is shipped to n8n.
Minimum code to receive a call
1 |
|
Two outbound calls (n8n for the greeting, Ultravox to provision an AI session) followed by a TwiML response. That’s the entire handshake.
What makes this hard
The naive demo is easy. Production is hard because:
- Two WebSockets, one fail-fast lifecycle — if either closes you must close the other. See post 3 on
asyncio.TaskGroup. - Audio format mismatch — Twilio sends 8kHz μ-law, Ultravox expects 16-bit PCM. Post 2.
- LLM tools can hallucinate parameters — never trust the JSON. Post 7.
- One prompt is not enough — verification, main conversation, and summary need distinct system prompts. Post 4.
What you need to follow along
| Service | Cost to try | Purpose |
|---|---|---|
| Twilio | Free trial credit | Phone number + audio streaming |
| Ultravox | Free tier | Speech-to-speech AI |
| n8n | Self-host or cloud free | Workflow engine for greetings + transcripts |
| ngrok | Free | Expose your localhost to Twilio webhooks |
Clone the repo, copy .env.example to .env, fill in the keys, and run:
1 | uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload |
Call your Twilio number. The AI picks up.