Photo credit: venturebeat.com
Over the weekend, Meta unveiled its newly developed AI language model, Llama 4, much to the surprise of the tech community. The parent company of popular platforms such as Facebook, Instagram, WhatsApp, and Quest VR introduced not just one, but three upgraded versions of this model, leveraging the innovative “Mixture-of-Experts” architecture and a novel training approach featuring fixed hyperparameters termed MetaP.
These versions come equipped with extensive context windows, allowing the models to process significant amounts of information in a single interaction.
However, the reaction from the AI community post-announcement, especially concerning the release of the Llama 4 Scout and Llama 4 Maverick models for immediate download, has been mixed, largely leaning towards criticism.
Llama 4 Sparks Confusion and Criticism Among AI Users
A post that surfaced on the North American Chinese language forum 1point3acres—allegedly from a researcher within Meta’s GenAI team—claimed that Llama 4 struggled with performance on third-party benchmarks. The post suggested that management proposed blending test sets from various benchmarks during the post-training phase to achieve more favorable results. “It was an effort to present a better outcome across different metrics,” the post stated.
Despite skepticism surrounding its authenticity, reactions from community members indicated doubts about the benchmarks themselves.
One user on X, identified as @cto_junior, expressed strong doubts, stating, “I suspect Meta miscalculated some aspects in the released weights … if not, they should reconsider their workforce instead of acquiring Nous.” This reference highlighted an independent test revealing Llama 4 Maverick’s subpar performance on a benchmark called aider polyglot, knocking it down to a meager 16% success rate, substantially lower than older models like DeepSeek V3 and Claude 3.7 Sonnet.
AI researcher and author Andriy Burkov weighed in, critiquing the claimed 10 million-token context window of Llama 4 Scout. He remarked, “Though Meta claims a 10M context, it is virtual; no model has been trained on prompts longer than 256k tokens. Thus, using more than that leads to subpar outputs.”
On Reddit, user Dr_Karminski echoed disappointment, showcasing Llama 4’s lackluster performance against DeepSeek’s V3 model in coding challenges, such as simulating complex movements.
Nathan Lambert, a former Meta researcher now with the Allen Institute for Artificial Intelligence, highlighted discrepancies on his blog regarding a benchmark comparison released by Meta. He noted the comparison used a version of Llama 4 Maverick that had not been made officially available. Lambert criticized this as a misleading tactic: “This raises questions about the integrity of the results presented, as it appears that a version of the model optimized for conversational output was used instead of the one accessible to the public.”
Amidst a barrage of criticism, Ahmad Al-Dahle, Meta’s VP and Head of GenAI, addressed the concerns on X, stating, “We’re excited to get Llama 4 into your hands and are already receiving positive feedback. However, we acknowledge reports of varying quality across different applications. We anticipate it will take time for public implementations to stabilize as we address pertinent issues.”
Al-Dahle went on to refute claims of training on test sets, emphasizing, “Those allegations are untrue, and we firmly oppose any such practices. Variability in quality likely stems from the need to stabilize the implementations further.”
Even with this statement, the response consisted largely of further complaints regarding performance and requests for more comprehensive documentation on Llama 4’s specifications and training processes. Concerns were raised as to why this release appears more problematic than previous iterations.
This reaction follows the recent departure of Joelle Pineau, a leading figure in Meta’s AI research, who expressed her gratitude for her experiences with the company just as Llama 4 was being launched.
The initial reception of Llama 4 among the AI community has not been overwhelmingly positive, and concerns about its performance persist. The upcoming Meta LlamaCon on April 29 will likely serve as a platform for further discussion and inquiry regarding these developments. The AI community is watching closely as the situation evolves.
Source
venturebeat.com