Photo credit: arstechnica.com
Gavin Purcell, co-host of the AI for Humans podcast, shared an intriguing video on Reddit showcasing an interaction between a human player posing as an embezzler and an AI-driven boss. The exchange is so convincing that it blurs the line between the human participant and the AI, demonstrating the advanced capabilities of this technology.
Achieving “Near-Human Quality”
At its core, Sesame’s Conversation Simulation Model (CSM) merges two sophisticated AI systems, a backbone and a decoder, crafted on Meta’s Llama framework. This setup processes text and audio in tandem, allowing for incredibly realistic speech synthesis. The models vary in scale, with the most complex version scaling up to 8.3 billion parameters, leveraging nearly one million hours of predominantly English audio data to enhance its performance.
Differentiating itself from conventional text-to-speech mechanisms, which typically operate in a two-step manner—first generating semantic concepts and then acoustic details—Sesame’s CSM employs a unified, single-stage process. This multimodal transformer approach allows it to handle text and audio inputs simultaneously, paralleling techniques seen in OpenAI’s voice technology.
In evaluations devoid of conversational context, human judges found it challenging to distinguish between audio generated by the CSM and natural human speech. The results indicated that the model is nearly indistinguishable in isolated speech scenarios. However, when speakers engaged in conversations, evaluators consistently favored human voices, pointing to a significant area for further development in contextual understanding and delivery.
Brendan Iribe, co-founder of Sesame, openly discussed the model’s limitations during a discussion on Hacker News. He pointed out that the AI often exhibits inappropriate tone and pacing, along with challenges in managing interruptions and the overall flow of conversation. “Today, we’re firmly in the valley, but we’re optimistic we can climb out,” he expressed, highlighting a commitment to overcoming these challenges and improving the technology.
Source
arstechnica.com