Twilio sometimes goes quiet without sending stop. Your handler waits forever, your worker leaks, and your bill grows. Here’s the 15-line fix.
The bug you’ll only see in production
In every Twilio Media Stream tutorial, the receive loop looks like this:
1 | while True: |
This works for 99% of calls. The 1% kills you.
What actually happens at scale:
- Network partition between Twilio and your pod — TCP keepalive eventually trips, but not for minutes.
- Twilio container restart — they drop the stream without sending
stop. - Caller hangs up in a weird state — Twilio sometimes sends nothing.
- Mobile dies in a tunnel — connection just stops emitting bytes.
In all four cases, await receive_text() blocks forever. Your worker is now:
- Tying up an asyncio task
- Holding an open Ultravox WebSocket (which costs money)
- Holding session state in memory
- Counted as a “live” connection by your load balancer
Multiply by your daily call volume × the fraction that hit one of these states. After a week, your replicas need restarting “because of a memory leak.”
The fix: wrap the receive in a timeout
1 | WS_IDLE_TIMEOUT_SECONDS = 60 # tune to your call profile |
Three things to notice:
1. We raise WebSocketDisconnect, not cancel(). This is the key insight. The whole call lifecycle is wrapped in a TaskGroup that already knows how to tear down on WebSocketDisconnect. Reusing that path means cleanup runs uniformly — sockets close, transcripts flush, sessions get popped — whether the disconnect was natural or synthetic. One teardown code path, zero special cases.
2. We label the disconnect. This is why voxflow_call_disconnects_total{reason="idle_timeout"} from the metrics post matters. Without the label, your “disconnect” graph blurs together call ends and zombies. With it, idle timeouts become a dedicated SLI: a sudden spike means an upstream network problem.
3. We log the timeout duration. When you tune WS_IDLE_TIMEOUT_SECONDS later, you want to see which calls actually crossed the line. A bare “timeout” log line is useless six months in.
How to pick the threshold
Your timeout needs to be:
- Longer than the longest legitimate silence in a call. During a 30-second “let me look that up” hold, Twilio still ships periodic media frames (silence is still bytes), so this is rarely the limit.
- Shorter than your replica’s lifetime budget. If you scale down pods after 1 hour of “no new calls,” your zombies must die before then or they’ll prevent scale-down.
- Longer than your ping interval, if you have one. Twilio doesn’t, but if you add app-level keepalives, the timeout must be >= 2× your keepalive interval.
60 seconds is the sweet spot for voice. 30 seconds is too aggressive (you’ll kill real calls during long backend lookups). 5 minutes is too lenient (zombies linger).
Make it an env var (WS_IDLE_TIMEOUT_SECONDS) and tune from metrics, not from gut feel.
Why not just rely on TCP keepalive?
OS-level TCP keepalive defaults are:
- Linux: 2-hour idle before first probe, then 9 probes × 75s.
- Most cloud LBs: drop idle TCP at 60s–10min.
Two problems:
- 2 hours is much too slow to catch the zombie before it hurts you.
- The LB will drop your TCP socket without telling the application layer, so your
receive_text()still blocks until the next write attempt errors out.
Application-layer timeout is the only one that’s both fast and reliable.
Why not WebSocket ping/pong?
You could. The websockets library on the Ultravox side does (we set ping_interval=20.0). For Twilio’s Media Stream, the protocol doesn’t standardize ping behavior, so the simpler approach — if a frame doesn’t arrive in N seconds, we’re done — is more portable and matches the actual failure mode you care about (no media = no call).
The graph that proves it works
After deploying the timeout, you should see a flat-ish line on:
1 | sum(rate(voxflow_call_disconnects_total{reason="idle_timeout"}[5m])) |
A non-zero baseline is normal (zombies exist). A sudden spike means an upstream problem — Twilio incident, your TLS termination flaking, ISP routing event. Either way it’s noticed in 5 minutes instead of “next time the on-call notices replicas are slow.”
Takeaway
Every long-lived WebSocket loop needs an idle timeout, every timeout needs to feed the normal disconnect path, and every disconnect needs a labeled metric. Fifteen lines of code prevents a class of production memory leaks that’s invisible in dev and obvious in prod. Add it on day one, not after the first incident.