Photo credit: venturebeat.com
The landscape of artificial intelligence experienced a significant transformation in January 2025, following the groundbreaking release of DeepSeek R1, an advanced open-source language reasoning model developed by the relatively obscure Chinese startup DeepSeek, a branch of Hong Kong’s High-Flyer Capital Management. This model showed impressive capabilities, surpassing established American companies like Meta.
As DeepSeek’s technology gained traction among researchers and businesses, it reportedly sent Meta into a state of alarm. The company discovered that the R1 model had been developed at a fraction of the cost of many major competitors — purportedly for just a few million dollars, which is comparable to the salaries of some of Meta’s own AI leaders.
Until this revelation, Meta’s strategy had centered around developing top-tier open-source models branded as “Llama” to empower researchers and businesses, particularly targeting those with fewer than 700 million monthly users who could utilize these models for free, while requiring larger entities to negotiate special licensing terms. The performance of DeepSeek R1, achieved with considerably lower financial investment, alarmingly highlighted deficiencies in Meta’s previously released Llama models, particularly the latest version, Llama 3.3, which was released just a month before.
In response, Mark Zuckerberg, Meta’s CEO, took to Instagram to announce the launch of a new series of Llama models. Among the initially available offerings are Llama 4 Maverick, with 400 billion parameters, and Llama 4 Scout, featuring 109 billion parameters. Developers can access these models now through llama.com or the AI code-sharing platform Hugging Face.
Additonally, a 2-trillion parameter model, the Llama 4 Behemoth, was previewed, though it is still undergoing training, and a specific release date has not yet been disclosed. The parameters of these models indicate their complexity and capacity, with more parameters generally correlating to enhanced performance across various tasks.
One of the standout features of the Llama 4 models is their multimodal capabilities, enabling them to process and generate not just text but also video and images, although audio capabilities were not explicitly mentioned. Furthermore, these new models boast extensive context windows, allowing for 1 million tokens in Llama 4 Maverick and an impressive 10 million tokens in Llama 4 Scout. This capacity translates to the ability to handle vast amounts of information in single interactions, which is particularly beneficial for fields requiring detailed input, such as medicine and engineering.
A Commitment to Mixture-of-Experts Architecture
All three models leverage the innovative “mixture-of-experts” (MoE) architecture, a technique popularized by previous iterations from OpenAI and Mistral. This approach incorporates multiple smaller models, or “experts,” tailored to specific tasks within a larger framework, enhancing performance and efficiency. Each Llama 4 model comprises a mix of 128 different experts, running only the necessary ones for particular tasks, which optimizes resource usage.
According to Meta’s blog post, utilizing this architecture results in significant resource savings: “Only a subset of the total parameters are activated while serving these models,” which reduces costs and speeds up inference time. Llama 4 Maverick’s deployment is made simpler, allowing it to operate on a single Nvidia H100 DGX host or via distributed inference for enhanced efficiency.
While Scout and Maverick are publicly available for self-hosting, Meta has not introduced hosted APIs or pricing tiers for its infrastructure. Instead, the company is focused on distributing models through open downloads and integration with its platforms like WhatsApp and Messenger.
Meta estimates the cost to run Llama 4 Maverick at between $0.19 and $0.49 for every million tokens processed, highlighting its cost-effective nature compared to proprietary models, which might incur charges as high as $4.38 per million tokens.
The Llama 4 series is particularly geared towards reasoning, coding, and problem-solving tasks, though it does not exhibit the advanced reasoning chains seen in models like OpenAI’s “o” series or DeepSeek R1. Instead, it aims to compete with traditional LLMs and multimodal models, such as OpenAI’s GPT-4o and DeepSeek’s V3, with Llama 4 Behemoth potentially posing a threat to DeepSeek R1.
To bolster reasoning capabilities, Meta has instituted specialized training methodologies, such as:
- Removing over half of “easy” prompts during supervised fine-tuning to enhance challenge levels.
- Implementing continuous reinforcement learning that increases the complexity of prompts over time.
- Utilizing pass@k evaluation combined with curriculum sampling to boost performance across mathematical, logical, and coding tasks.
- Introducing MetaP, a new technique for tuning hyperparameters that could improve training efficiency across various model sizes and media formats.
MetaP stands out as a potentially transformative tool, enabling researchers to apply learnings from smaller models to larger counterparts, streamlining the development process for future offerings.
A Robust but Not Overwhelmingly Dominant Model Family
In Zuckerberg’s announcement, he stated, “Our goal is to build the world’s leading AI, open source it, and make it universally accessible,” reinforcing Meta’s commitment to open-source developments. However, while Llama 4 models are touted as among the most powerful in their class, they do not necessarily outperform every competitor.
Meta has highlighted specific models that Llama 4 beats, including:
Llama 4 Behemoth
- Outperforms GPT-4.5, Gemini 2.0 Pro, and Claude Sonnet 3.7 on:
- MATH-500: 95.0
- GPQA Diamond: 73.7
- MMLU Pro: 82.2
Llama 4 Maverick
- Beats GPT-4o and Gemini 2.0 Flash on several multimodal reasoning benchmarks:
- ChartQA, DocVQA, MathVista, MMMU
- Comparable performance to DeepSeek v3.1, utilizing less than half the active parameters (17B).
- Benchmark scores:
- ChartQA: 90.0 (versus GPT-4o’s 85.7)
- DocVQA: 94.4 (versus 92.8)
- MMLU Pro: 80.5
- Cost-efficient: $0.19–$0.49 per million tokens.
Llama 4 Scout
- Matches or outperforms other models such as Mistral 3.1, Gemini 2.0 Flash-Lite, and Gemma 3 on:
- DocVQA: 94.4
- MMLU Pro: 74.3
- MathVista: 70.7
- Notable 10 million token context length, well-suited for managing extensive documents, codebases, and multi-turn analyses.
- Optimally designed for deployment on a single H100 GPU.
Assessing Llama 4 Against DeepSeek
While the Llama 4 family makes significant advancements, it’s essential to compare it with other reasoning-focused models, particularly DeepSeek R1 and OpenAI’s “o” series. When analyzing Llama 4 Behemoth against DeepSeek R1 and OpenAI o1 models, nuances emerge:
Benchmark | Llama 4 Behemoth | DeepSeek R1 | OpenAI o1-121 |
---|---|---|---|
MATH-500 | 95.0 | 97.3 | 96.4 |
GPQA Diamond | 73.7 | 71.5 | 75.7 |
MMLU | 82.2 | 90.8 | 91.8 |
The comparisons indicate that:
- In MATH-500, Llama 4 Behemoth is slightly behind both DeepSeek R1 and OpenAI o1.
- For GPQA Diamond, Behemoth surpasses DeepSeek R1 but falls short of OpenAI o1.
- In MMLU, it trails behind both competitors but still outperforms earlier models like Gemini 2.0 Pro and GPT-4.5.
In summary, while DeepSeek R1 and OpenAI o1 maintain advantages in specific metrics, Llama 4 Behemoth remains a formidable contender in the reasoning landscape.
Emphasizing Safety and Political Neutrality
Meta has prioritized model alignment and safety with initiatives like Llama Guard, Prompt Guard, and CyberSecEval, which aid developers in identifying unsafe responses and potential vulnerabilities. Additionally, the company asserts that Llama 4 demonstrates marked improvements regarding political bias, stating that earlier models often leaned towards leftist perspectives. Currently, they claim Llama 4 is more balanced, appealing to a broader spectrum of political opinions.
Current Status of Llama 4
Meta’s Llama 4 models signify a blend of efficiency, accessibility, and sophisticated performance across diverse applications, with Scout and Maverick publicly available and Behemoth on the horizon. This suite of models positions Meta as a competitive player in the open-source AI landscape, challenging the dominance of proprietary offerings from entities like OpenAI, Anthropic, DeepSeek, and Google.
For businesses and researchers aiming to develop advanced AI solutions — whether for enterprise-scale applications, research initiatives, or extensive analytical tasks — Llama 4 delivers tailored options with a strong emphasis on reasoning and multimodal capabilities.
Source
venturebeat.com