Photo credit: venturebeat.com
Amid ongoing discussions about its Llama 4 model family, Meta is facing challenges as Nvidia introduces a fully open-source large language model (LLM) based on the prior Llama-3.1-405B-Instruct model. Nvidia claims its new model, Llama-3.1-Nemotron-Ultra-253B-v1, delivers exceptional performance, surpassing notable competitors like the DeepSeek R1 model across various third-party benchmarks.
This innovative model consists of 253 billion parameters and is tailored for robust reasoning, instruction-following, and AI assistant applications. Nvidia initially unveiled it during the annual GPU Technology Conference held in March.
The release of the model highlights Nvidia’s ongoing commitment to enhancing performance through architectural advancements and meticulous post-training processes.
As of April 7, 2025, the model’s code is available on Hugging Face, featuring open weights and post-training datasets. It is particularly designed to function effectively in both “reasoning on” and “reasoning off” modes, enabling developers to switch easily between complex reasoning tasks and simpler outputs based on user prompts.
Optimized for Efficient Inference
The Llama-3.1-Nemotron-Ultra-253B is an extension of Nvidia’s prior efforts in crafting inference-optimized LLMs. Through a process known as Neural Architecture Search (NAS), the model employs distinct structural modifications, including omitted attention layers, fused feedforward networks (FFNs), and adjustable FFN compression ratios.
This restructured design not only diminishes the model’s memory and computational needs but also maintains high output quality, facilitating deployment on a single 8x H100 GPU node.
Nvidia asserts that the model combines strong performance with economy in deployment scenarios, compatible with Nvidia’s B100 and Hopper microarchitectures and validated in BF16 and FP8 precision modes.
Enhanced Through Post-Training
The model’s foundational capabilities have been refined through a comprehensive post-training pipeline consisting of multiple phases. This includes supervised fine-tuning across diverse areas such as mathematics, coding, conversational AI, and tool utilization, followed by reinforcement learning via Group Relative Policy Optimization (GRPO) to elevate instruction adherence and reasoning capabilities.
During this process, the model was subjected to knowledge distillation involving 65 billion tokens and further pretraining using an additional 88 billion tokens.
The training datasets comprised various sources such as FineWeb, Buzz-V1.2, and Dolma, with post-training prompts generated from both publicly available and synthetic sources that trained the model to differentiate its reasoning capabilities.
Superior Performance Across Diverse Benchmarks
Evaluation metrics indicate meaningful improvements, especially when the model is operated in reasoning-enabled mode. For example, performance on the MATH500 benchmark increased significantly from 80.40% in standard mode to 97.00% when reasoning was enabled.
Furthermore, on the AIME25 benchmark, scores improved from 16.67% to 72.50%, and the LiveCodeBench results more than doubled, rising from 29.03% to 66.31%.
Impressive advancements were also recorded in tool-utilizing tasks like BFCL V2 and function composition, as well as in general question-answering scenarios (GPQA), where the model achieved a score of 76.01% in reasoning mode compared to 56.60% without.
All benchmarks utilized a maximum sequence length of 32,000 tokens, with each assessment being repeated up to 16 times to ensure reliability.
When compared to DeepSeek R1, an advanced MoE model boasting 671 billion parameters, Llama-3.1-Nemotron-Ultra-253B demonstrates competitive results despite having fewer than half the parameters. It outperforms in tasks such as GPQA (76.01 vs. 71.5), IFEval instruction following (89.45 vs. 83.3), and LiveCodeBench coding tasks (66.31 vs. 65.9).
Conversely, DeepSeek R1 shows notable advantages in certain mathematical assessments, particularly AIME25 (79.8 vs. 72.50) and slightly on MATH500 (97.3 vs. 97.00).
These findings suggest that despite its density, Nvidia’s model either meets or surpasses Mixture of Experts (MoE) alternatives in reasoning and alignment tasks, while it falls slightly behind in mathematically intensive evaluations.
Integration and Usability
The model supports the Hugging Face Transformers library (version 4.48.3 recommended) and is capable of managing input and output sequences up to 128,000 tokens.
Developers can influence the model’s reasoning capabilities through system prompts and can select decoding methods suited to specific tasks.
For tasks requiring reasoning, Nvidia advocates utilizing temperature sampling set at 0.6 along with a top-p value of 0.95, while deterministic outputs can be achieved using greedy decoding.
Llama-3.1-Nemotron-Ultra-253B is designed to be multilingual, effectively handling English and several other languages, including German, French, Italian, Portuguese, Hindi, Spanish, and Thai.
This model is also versatile enough for a range of applications including chatbot creation, AI agent functionalities, retrieval-augmented generation (RAG), and code production.
Commercial Licensing
The model has been made available under the Nvidia Open Model License, governed by the Llama 3.1 Community License Agreement, and is authorized for commercial deployment.
Nvidia has stressed the necessity of responsible AI practices, urging teams to examine the model’s alignment, safety, and bias considerations pertinent to their intended applications.
As Oleksii Kuchaiev, Director of AI Model Post-Training at Nvidia, noted in a recent post, the team is enthusiastic about the public release, describing it as a sophisticated 253 billion parameter model with unique toggle options for reasoning capabilities, all released with open weights and training data.
Source
venturebeat.com