Stay informed with the latest developments in AI technology through our daily and weekly newsletters.

OpenAI has faced challenges in the past involving its voice AI models and celebrity impersonations, yet the organization persists in enhancing its offerings in this space.

Recently, the AI research firm announced three novel proprietary voice models: gpt-4o-transcribe, gpt-4o-mini-transcribe, and gpt-4o-mini-tts. These models are initially available through the company’s application programming interface (API), allowing third-party developers to integrate them into their applications. Additionally, users can experiment with these models via a demo site, OpenAI.fm, which is designed for limited testing and enjoyment.

The gpt-4o-mini-tts model features vocal customization capabilities, allowing users to modify various attributes such as accent, pitch, and tone through text prompts. This new flexibility is expected to help address prior concerns regarding the mimicry of specific individuals’ voices—an issue highlighted by OpenAI’s previous interactions with actress Scarlett Johansson. Although the company had denied imitating her voice, it subsequently removed the relevant voice option. Users now have the autonomy to dictate how their AI voice interacts with them.

During a demonstration with VentureBeat, OpenAI’s Jeff Harris illustrated the versatility of the new models, showcasing the ability to alter voices to sound like diverse characters such as a mad scientist or a peaceful yoga instructor simply through text prompts.

Exploring Enhanced Capabilities of GPT-4o

These newly introduced models build upon the existing GPT-4o framework that was launched in May 2024, which currently supports the voice and text functionalities of ChatGPT for numerous users. OpenAI has refined the base model with additional data focusing on enhancing transcription and speech processing capabilities. However, the company has not disclosed a timeline for when these models will be integrated into ChatGPT.

“ChatGPT’s requirements differ regarding cost and performance, so while I anticipate a future integration, our current focus is on API users,” Harris remarked.

The new models intend to replace OpenAI’s Whisper text-to-speech model, which had been in use for the past two years, boasting lower word error rates in various industry assessments. They are optimized to work in noisy environments and can handle diverse accents and varying speech speeds across more than 100 languages.

A comparison on OpenAI’s website illustrates that the gpt-4o-transcribe models achieve an impressive word error rate of just 2.46% in English, significantly outperforming Whisper across 33 languages.

“Our models incorporate noise cancellation and a semantic voice activity detector, which enhances transcription accuracy by determining when a speaker concludes a thought,” explained Harris.

It’s important to note that the gpt-4o-transcribe models do not offer “diarization” capabilities—meaning they do not differentiate between multiple speakers. They are primarily designed to process single or multiple voices as a unified input channel, responding with a singular output voice.

OpenAI is also encouraging creativity within its community by hosting a competition to find innovative uses for its demo voice site, OpenAI.fm. Participants can share their experiences online and tag the @openAI account on X. The competition’s winner will receive a custom-made Teenage Engineering radio featuring the OpenAI logo, one of only three in existence, according to Olivier Godement, OpenAI’s Head of Product.

A Wealth of Audio Application Opportunities

The advancements introduced with these models make them ideal candidates for applications in customer service centers, transcription of meetings, and AI-driven assistance. Furthermore, OpenAI’s newly launched Agents SDK facilitates developers in enhancing their existing applications that utilize the text-based GPT-4o model with seamless voice interactions using minimal coding—approximately “nine lines of code,” according to an OpenAI YouTube event.

For instance, an online retail app based on GPT-4o could quickly adapt to respond verbally to user queries such as “Tell me about my last orders” by implementing these new models with minor adjustments.

“For the first time, we’re enabling streaming speech-to-text, allowing developers to receive a continuous text stream from audio input in real time, enhancing the naturalness of conversations,” said Harris.

Developers seeking low-latency, real-time AI voice experiences are advised to utilize OpenAI’s speech-to-speech models via the Realtime API.

Availability and Pricing

The new voice models are available for immediate use through OpenAI’s API with the following pricing structure:

• gpt-4o-transcribe: $6.00 per 1M audio input tokens (~$0.006 per minute)

• gpt-4o-mini-transcribe: $3.00 per 1M audio input tokens (~$0.003 per minute)

• gpt-4o-mini-tts: $0.60 per 1M text input tokens, $12.00 per 1M audio output tokens (~$0.015 per minute)

However, these models enter a highly competitive arena within the AI transcription and speech technology market. Companies like ElevenLabs have also launched their Scribe model, which includes diarization and boasts a word error rate of 3.3% in English, priced at $0.40 per hour of input audio (roughly $0.006 per minute).

Additionally, Hume AI has introduced Octave TTS, a model allowing for detailed customization of pronunciation and emotional tone based purely on user instructions, rather than pre-set voices. While Octave TTS pricing varies, it does feature a free tier allowing users to utilize 10 minutes of audio.

The open-source community is also advancing in audio and speech technology, with models like Orpheus 3B available under a permissive Apache 2.0 license, enabling developers to access and run the model without incurring costs, assuming they have the appropriate hardware or cloud resources.

Adoption and Initial Feedback

Many organizations have already integrated OpenAI’s new audio models into their systems, reporting notable enhancements in voice AI capabilities, as evidenced by testimonials provided to VentureBeat.

EliseAI—focused on automating property management—discovered that OpenAI’s text-to-speech model enriched tenant interactions with a more natural and emotionally engaging approach. The upgraded voices resulted in improved satisfaction among tenants and higher rates of successful issue resolution in calls.

Decagon, which specializes in AI-driven voice solutions, experienced a 30% boost in transcription accuracy with the integration of OpenAI’s speech recognition technology. This accuracy improvement has enabled Decagon’s AI agents to function more reliably in real-world conditions, even amidst background noise. The implementation of the new model was quick, taking just one day to incorporate it into their system.

However, not all feedback surrounding OpenAI’s recent release has been positive. Ben Hylak, co-founder of Dawn AI app analytics software and a former human interfaces designer at Apple, expressed concerns on X, stating that the new models may signify a step back from real-time voice capabilities, implying a departure from OpenAI’s previous emphasis on low-latency conversational AI associated with ChatGPT.

The announcement faced an unexpected leak on X (formerly Twitter), where TestingCatalog News shared details about the upcoming models prior to the official reveal, attributing the information to the user @StivenTheDev, which rapidly gained traction.

Looking forward, OpenAI is dedicated to continually enhancing its audio models and is exploring tailored voice functionalities while prioritizing safety and responsible AI usage. Beyond audio advancements, the company is making significant investments in multimodal AI technologies, including video, to foster more dynamic and engaging agent interactions.

Source
venturebeat.com

OpenAI’s New Voice AI Model, GPT-4o-Transcribe, Instantly Adds Speech Capabilities to Your Text Apps

Exploring Enhanced Capabilities of GPT-4o

A Wealth of Audio Application Opportunities

Availability and Pricing

Adoption and Initial Feedback

From a Whimsical Idea to Hosting 1,000 Events Annually: The Journey of a Catering Business

AI Is Replacing Prompt Engineers

Innovative Eyewear Unveils Reebok Smart Eyewear

Warning Systems for Floods, Hurricanes, and Famine Are Hampered by Donald Trump’s Data Purge

NASA Launches Biological Research on Space Station

Satellite Mission Aims to “Weigh” the World’s 1.5 Trillion Trees

Breaking news

Satellite Mission Aims to “Weigh” the World’s 1.5 Trillion Trees

Longtime Agent of John Elway in Critical Condition Following Golf Cart Accident: Report

Trump Criticizes ‘Biden’s Stock Market’ Following Worst 100-Day Performance in Decades

U.S. Economy Shrinks in Q1 Due to Tariff-Driven Imports – National

Israeli Forces Detain Leading Palestinian Journalist in West Bank

Rajinikanth: Indian Youth Unaware of Traditions While Westerners Embrace Indian Culture

China’s Exports to the U.S. Decline Sharply Due to Tariffs, Raising Concerns Over Potential Product Shortages