Gavin Purcell, co-host of the AI for Humans podcast, shared an intriguing video on Reddit showcasing an interaction between a human player posing as an embezzler and an AI-driven boss. The exchange is so convincing that it blurs the line between the human participant and the AI, demonstrating the advanced capabilities of this technology.

Achieving “Near-Human Quality”

At its core, Sesame’s Conversation Simulation Model (CSM) merges two sophisticated AI systems, a backbone and a decoder, crafted on Meta’s Llama framework. This setup processes text and audio in tandem, allowing for incredibly realistic speech synthesis. The models vary in scale, with the most complex version scaling up to 8.3 billion parameters, leveraging nearly one million hours of predominantly English audio data to enhance its performance.

Differentiating itself from conventional text-to-speech mechanisms, which typically operate in a two-step manner—first generating semantic concepts and then acoustic details—Sesame’s CSM employs a unified, single-stage process. This multimodal transformer approach allows it to handle text and audio inputs simultaneously, paralleling techniques seen in OpenAI’s voice technology.

In evaluations devoid of conversational context, human judges found it challenging to distinguish between audio generated by the CSM and natural human speech. The results indicated that the model is nearly indistinguishable in isolated speech scenarios. However, when speakers engaged in conversations, evaluators consistently favored human voices, pointing to a significant area for further development in contextual understanding and delivery.

Brendan Iribe, co-founder of Sesame, openly discussed the model’s limitations during a discussion on Hacker News. He pointed out that the AI often exhibits inappropriate tone and pacing, along with challenges in managing interruptions and the overall flow of conversation. “Today, we’re firmly in the valley, but we’re optimistic we can climb out,” he expressed, highlighting a commitment to overcoming these challenges and improving the technology.

Source
arstechnica.com

Uncanny AI Voice Demo Generates Both Awe and Unease Online

Achieving “Near-Human Quality”

Automakers Struggle to Understand the Impact of Trump’s Tariffs

Are Chatbot Responses Considered Protected Speech? Court Under Pressure for Clarity.

Warning Systems for Floods, Hurricanes, and Famine Are Hampered by Donald Trump’s Data Purge

Complete Guide to All Gundam Wing Skins in Overwatch 2 and How to Unlock Them

Providing Digital Safety Resources for Domestic Violence Survivors (Viewpoint)

Sunrise on the Reaping: Plot Details, Cast, and Release Date

Breaking news

Providing Digital Safety Resources for Domestic Violence Survivors (Viewpoint)

First-Person: Myanmar Aid Workers Confront Conflict and Adverse Conditions to Assist Earthquake Victims

Evason Appointed Canada Coach, with Flames’ Huska as Assistant for World Hockey Championship

Did Ibrahim Ali Khan Just ‘Confirm’ His Romance with Palak Tiwari Through THIS Heartwarming Gesture? | Watch Now

Trump Suggests Trade Policies Could Lead to Fewer, More Expensive Toys for Children

Vice President JD Vance Expresses Feeling ‘Highly Empowered’ by Trump

Norway Urges Britain: Stay Committed to Oil Investment