OpenAI is stepping up its game in the enterprise voice AI market with the new gpt-realtime model. Built to handle intricate instructions, it delivers voices that feel natural and expressive—ideal for applications like customer support and real-time translation.
Accessible via the updated Realtime API, this model brings fresh voices such as Cedar and Marin alongside improved versions of existing ones. OpenAI worked closely with customers who are creating voice applications to fine-tune gpt-realtime, ensuring it fits a range of real-world scenarios—from academic tutoring to helping users navigate customer service funnels.
The system operates on a speech-to-speech framework that not only understands spoken prompts but also responds as if you were talking with another person. For example, T-Mobile showcased an AI assistant helping customers find new phones, while Zillow demonstrated a similar agent guiding users through neighbourhood choices.
What sets gpt-realtime apart is its ability to switch languages mid-sentence and follow detailed instructions, like speaking with a French accent when needed. The model also shows a marked accuracy improvement, boosting its benchmark score from 65.6% to 82.8%. Although we haven’t seen a direct side-by-side comparison with competitors like ElevenLabs, OpenAI is clearly focused on refining instruction-following and function-calling capabilities.
The updated Realtime API now supports more features—including MCP, image recognition for real-time visual input, and the Session Initiation Protocol (SIP) which connects applications to phones, making it even more suitable for contact centers. Users can also save and reuse prompts, further streamlining their workflow.
Early tests suggest that gpt-realtime is well-received, even as OpenAI reduces its cost by 20%—pricing audio input tokens at $32 per million and output tokens at $64 per million. If you’ve ever struggled to find an AI solution that’s both secure and conversationally natural, this new model could be the solution you’ve been waiting for.