Putting a voice model on a real phone number is no longer mostly an AI demo exercise. OpenAI now documents low-latency live audio sessions, SIP call ingress, and server-side sideband control in its Realtime stack. Twilio documents the WebSocket bridge, greeting behavior, language settings, interruptions, and session callbacks through Conversation Relay. Once that stack is live on a published number, the real question shifts. It is not just whether the model can talk. It is whether the call can be answered, directed, handed off, and accounted for without losing the caller along the way.
What changed technically
OpenAI's Realtime and voice-agent guides make a useful distinction between live speech-to-speech sessions and chained voice pipelines. If the conversation needs to feel immediate, with barge-in, natural turn-taking, and realtime tool use, speech-to-speech is the better fit. If the business needs tighter control over transcription, text reasoning, and speech output, a chained pipeline is often the safer choice. That matters because most companies do not need a general-purpose talking assistant on their phone line. They need a system that can greet callers, understand intent, ask a few qualifying questions, and route the call cleanly.
The more important shift is that OpenAI now documents the phone entry path directly. The SIP guide shows how to connect a real number through a SIP provider such as Twilio, point the trunk at OpenAI's SIP endpoint, receive a realtime.call.incoming webhook, and use the returned call_id to accept, reject, monitor, refer, or hang up the call. That moves the setup out of browser-demo territory. There is now an official path for real inbound calls, plus the control points needed to decide what should happen before the model ever answers.
Twilio's role in that stack matters just as much. Conversation Relay under <Connect> sends a live call into a WebSocket-based application flow, handles speech-to-text and text-to-speech, and passes structured speech input to the app while turning the app's text responses back into speech. Twilio's documentation also shows that the action callback fires when the Conversation Relay session ends and can return session status and other call details. In practical terms, that means the call is observable. You can log it, inspect it, and use its end state in follow-up workflows.
Why this becomes a telephony operations project
The moment an AI agent sits behind a public business number, caller experience becomes an operational responsibility. Every branch in the call flow needs an answer. Who gets through. What greeting plays first. Whether the caller can interrupt. How long silence is tolerated. What happens when the model cannot complete the request. When the call should go to a person. OpenAI's SIP docs expose accept, reject, and refer flows. Twilio exposes greeting, interruption, timeout, DTMF, and session-callback controls. Those are not prompt tweaks. They are call handling rules.
Twilio's <ConversationRelay> attributes make that concrete. You can explicitly set welcomeGreetingInterruptible, interruptible, interruptSensitivity, speechTimeout, dtmfDetection, and reportInputDuringAgentSpeech. Twilio even notes that the default for reportInputDuringAgentSpeech changed. That is a good reminder that production call behavior should be specified, not left to defaults. If you want to capture caller speech while the agent is talking without interrupting playback, that is a deliberate choice. If you want keypad input sent to the app, that is another one. Both have direct business consequences.
Multilingual behavior is another area where teams tend to underestimate the work. Twilio lets you set one shared language for speech-to-text and text-to-speech, override them separately with transcriptionLanguage and ttsLanguage, and define provider and voice settings per language with nested <Language> elements. The docs also warn that automatic language detection in multi mode requires a specific provider combination: Deepgram for transcription and ElevenLabs for text-to-speech. Use the wrong combination and the session errors out. That is exactly the sort of detail that can make a multilingual demo look good in testing and fail badly on a real number.
Keep the tools and rules on the server
One of the most useful pieces of guidance in OpenAI's docs has little to do with voice quality. It is about control. The Realtime server-controls guide recommends keeping tool use and business logic on the application server, using a sideband connection so the live session has one connection for the caller leg and another for the server. That server-side connection can monitor the session, update instructions dynamically, and answer tool calls. For SIP calls, the application server connects to the same session over WebSocket using the call_id from the incoming-call webhook and keeps that connection alive for the duration of the call.
That architecture is what makes a phone agent usable in a real business. Lead routing, calendar checks, CRM lookups, branching rules, and escalation logic belong in server-side code, not buried in a fragile prompt and not exposed on a client-facing surface. The SIP flow also gives you room to decide whether to accept the call at all, reject unsupported cases with a SIP status code, or refer an active call to another telephone number or SIP URI. A useful voice agent should not try to improvise its way through every edge case. It should know when to hand the call over.
Twilio's session callbacks reinforce the same design choice. The action callback examples include normal completions, failures, and sessions ended by the application, including handoff data. Operationally, that gives you a place to record why the AI session ended, whether a human escalation was requested, and whether the transport failed before the conversation finished. If the WebSocket leg drops, that is not a model-quality issue. It is an operational fault, and it needs fallback behavior, alerting, and retry logic.
Cost control is part of call design
OpenAI's Realtime cost guide is a useful correction to the idea that voice spend can be sorted out later. Realtime voice sessions accumulate text and audio tokens across turns, and input transcription can also be billed separately when enabled. The docs show where to read usage from response.done events and from completed input-audio transcription events. So cost monitoring does not have to be an afterthought. It can be part of the implementation from the start.
The same guide also explains why session design affects spend. Prompt caching can reduce input-token cost in multi-turn sessions, but changing instructions or tool definitions mid-session can reduce cache efficiency. Truncation matters too. Once the conversation grows past the input window, older items are dropped, and repeated truncation can hurt caching further. OpenAI documents practical levers such as a smaller post-instruction token window and a truncation retention ratio below 1.0 to create more headroom between truncations. In business terms, the cheapest phone agent is usually the one with a narrow objective, limited memory, and a fast route to human handoff once the call stops being automatable.
What a sensible implementation looks like
- Define the call objective first: lead capture, triage, after-hours overflow, appointment qualification, or simple routing.
- Set admission and escalation rules before writing prompts: which calls are accepted, which are rejected, which are transferred, and what information must be collected before handoff.
- Keep tools, routing logic, and internal business rules on the server-side control channel so the phone experience can change without exposing private logic at the edge.
- Configure Twilio language, interruption, timeout, and reporting behavior explicitly instead of relying on platform defaults.
- Track session status, transfer reasons, transcript quality, token usage, and failure modes from day one so the number can be operated, not just demonstrated.
This is the sort of work where Greg is useful as a freelance operator. The job is not just choosing a model or making the voice sound polished. It is defining the call flow, wiring Twilio and OpenAI across SIP, webhooks, and WebSockets, keeping sensitive business logic on the server, setting handoff rules, tuning multilingual behavior, and adding transcript, retry, and cost controls around the system. The success metric is simple enough: qualify or route calls without damaging lead capture. Once that is the bar, an AI phone agent on a real number is clearly a telephony operations project.
Need help with this kind of work?
If you need this scoped and wired without turning lead capture into a telecom experiment, Greg can design the call flow, server controls, and handoff rules. Get in touch with Greg.