Generic “instrument your FastAPI” advice doesn’t help. Here are the five signals that actually predict and explain voice-agent incidents.
“Use Prometheus” is not a metrics strategy
Most metrics tutorials end at “expose /metrics and scrape it.” That’s the easy part. The hard part is picking which metrics, with which labels, that will actually answer the questions your on-call asks at 2am:
- Why did call volume drop?
- Which tool is the LLM calling that’s failing?
- Is the slowness in our code or in n8n?
- How many calls just… end without an explanation?
For a voice agent, five metric families cover ~90% of those questions.
The shortlist
| Metric | Type | Labels | What it answers |
|---|---|---|---|
voxflow_calls_total |
Counter | direction (inbound/outbound) |
Volume, day-over-day deltas, “are we still taking calls?” |
voxflow_call_disconnects_total |
Counter | reason (normal/idle_timeout/error) |
Quality-of-service. The unexplained drops live here. |
voxflow_tool_invocations_total |
Counter | tool, outcome (ok/invalid_params/unknown/error) |
Which LLM tool is broken? Is the model hallucinating tool names? |
voxflow_n8n_requests_total |
Counter | outcome (2xx/4xx/5xx/timeout/transport_error) |
Is our backend integration healthy? |
voxflow_n8n_request_duration_seconds |
Histogram | — | Latency for the dominant downstream — feeds p95 SLO alerts. |
Five. That’s the whole list. Resist the urge to add more before you’ve used these.
Why these labels and no others
Don’t put call_sid on a counter. Unbounded label cardinality kills Prometheus. The call_sid belongs in logs (see the contextvars post), not in metrics.
Do put outcome on every counter that can fail. A single counter with {outcome="ok"|"error"} lets you compute error rate as rate(...{outcome="error"}[5m]) / rate(...[5m]) — the canonical Google SRE “fraction of bad requests” recipe.
Do split disconnects by reason. normal (call ended) and idle_timeout (Twilio went silent for 60s) and error (exception in handler) need different alerts. Lumping them together hides the only one that actually means something is wrong.
Wiring it up
The whole metrics module is ~40 lines:
1 | # app/core/metrics.py |
And the endpoint:
1 |
|
Instrumentation points (where, exactly)
You only have to touch four places:
1 | # api/endpoints/calls.py — at the top of /incoming-call: |
That’s it. No middleware, no wrapper functions, no metaclasses.
Five queries that answer real questions
Drop these in a Grafana dashboard and you’re done:
Call volume (1-minute resolution):
1 | sum(rate(voxflow_calls_total[1m])) by (direction) |
Tool error rate (per tool):
1 | sum(rate(voxflow_tool_invocations_total{outcome!="ok"}[5m])) by (tool) |
n8n p95 latency:
1 | histogram_quantile(0.95, |
Disconnect breakdown (the gold one):
1 | sum(rate(voxflow_call_disconnects_total[5m])) by (reason) |
Calls in flight (gauge derived from counters):
1 | sum(rate(voxflow_calls_total[1m])) * 60 * <avg_call_duration_seconds> |
…or instrument an actual Gauge if you need exact concurrency.
Alerts that won’t page you for nothing
Two are enough to start:
1 | - alert: VoxFlowToolFailureRateHigh |
10% failures for 10 minutes, or 5s p95 for 10 minutes. Both correspond to “users are noticing right now.”
What you’ll resist adding (and shouldn’t)
- Per-LLM-token counters. Belongs in your LLM provider’s dashboard, not yours.
- Per-prompt-version metrics. Use logs + traces.
- Audio-quality histograms. Belongs in Twilio Voice Insights.
Every metric is a permanent commitment. Start with five.
Takeaway
For a voice agent, the five metrics worth instrumenting are: calls (by direction), disconnects (by reason), tool invocations (by tool + outcome), n8n requests (by outcome), and n8n latency. Five counters/histograms, four instrumentation sites, two alerts. Anything else can wait until you have a question those don’t answer.