Fastest, Largest, Strongest: NVIDIA Blackwell Sweeps MLPerf Training 6.0
Every breakthrough AI model starts the same way: with a training run. The infrastructure running those training jobs shapes everything: how fast teams can iterate, what scale of model they can build and whether those jobs complete reliably. As models grow in size, complexity and intelligence, the demands on training infrastructure are also rising.
Key Takeaways
- NVIDIA delivers the performance, scale and reliability that frontier training requires - in benchmarks and beyond.
Performance: Fastest Time to Train on Every Benchmark MLPerf Training 6.
- Within each rack-scale system, fifth-generation NVIDIA NVLink Switches connect all 72 GPUs with high bandwidth, into a unified pool of compute and memory, enabling them to act as one giant GPU.
Large-scale MoE training faces the same all-to-all communication challenge as MoE inference - tokens must be routed across GPUs to reach the right expert subnetwork - and NVLink's bandwidth advantage is what makes that fast and efficient at scale.
- 6x Performance Over GB200 NVL72: In this round, GB300 NVL72 delivered up to 1.
- NVIDIA also submitted results at 5,120 GPUs with NVIDIA GB200 NVL72 systems on Llama 3.
- 02 minutes at 8,192-GPU scale using GB300 NVL72 systems connected with Spectrum-X Ethernet networking.
Stats & Key Facts
- #NVIDIA continues to push low-precision training innovation across different model architectures, most recently using NVFP4 to pretrain the massive 550-billion-parameter NVIDIA Nemotron 3 Ultra model.
- #6x Performance Over GB200 NVL72: In this round, GB300 NVL72 delivered up to 1.
- #On DeepSeek-V3 671B, the largest MoE model in the suite, NVIDIA scaled its submission to 8,192 GPUs using GB200 NVL72 systems, the largest-scale Blackwell-based submission in MLPerf Training to date.
- #07 minutes, the fastest time to train for this benchmark.

NVIDIA delivers the performance, scale and reliability that frontier training requires - in benchmarks and beyond. Every breakthrough AI model starts the same way: with a training run. The infrastructure running those training jobs shapes everything: how fast teams can iterate, what scale of model they can build and whether those jobs complete reliably.
As models grow in size, complexity and intelligence, the demands on training infrastructure are also rising. Performance: Fastest Time to Train on Every Benchmark MLPerf Training 6. 0 added two new mixture-of-experts (MoE) pretraining workloads to the suite: DeepSeek-V3 671B and GPT-OSS-20B, reflecting the growing centrality of MoE architectures.
The NVIDIA platform was the only one to be submitted across every benchmark, and delivered the fastest time to train on all seven. This round, NVIDIA submitted results on both NVIDIA GB200 NVL72 and GB300 NVL72 rack-scale systems. Within each rack-scale system, fifth-generation NVIDIA NVLink Switches connect all 72 GPUs with high bandwidth, into a unified pool of compute and memory, enabling them to act as one giant GPU.
Large-scale MoE training faces the same all-to-all communication challenge as MoE inference - tokens must be routed across GPUs to reach the right expert subnetwork - and NVLink's bandwidth advantage is what makes that fast and efficient at scale. NVIDIA also showcased NVFP4 training methods that increase performance while meeting strict accuracy requirements across large- and small-scale pretraining as well as fine-tuning workloads. NVIDIA continues to push low-precision training innovation across different model architectures, most recently using NVFP4 to pretrain the massive 550-billion-parameter NVIDIA Nemotron 3 Ultra model.
Key Blackwell Ultra capabilities such as higher compute density with NVFP4, expanded memory capacity and a higher power ceiling that lets the GPU sustain peak performance drive this improvement. Scale: Largest Blackwell Cluster in MLPerf Training To support distributed training at scale, NVIDIA offers two complementary scale-out networking platforms - NVIDIA Quantum InfiniBand and NVIDIA Spectrum-X Ethernet - giving data centers the flexibility to build large-scale clusters optimized for their infrastructure. On DeepSeek-V3 671B, the largest MoE model in the suite, NVIDIA scaled its submission to 8,192 GPUs using GB200 NVL72 systems, the largest-scale Blackwell-based submission in MLPerf Training to date.
NVIDIA also submitted results at 5,120 GPUs with NVIDIA GB200 NVL72 systems on Llama 3. 1 405B, one of the largest dense LLMs in the suite. This round's results also reflect the deep co-engineering between NVIDIA and its partners on system architecture, networking and software: Microsoft Azure scaled Llama 3.
1 405B training to 8,192 GPUs using GB200 NVL72 systems, and reached the reference quality target in 7. 07 minutes, the fastest time to train for this benchmark. CoreWeave delivered the fastest time to train for DeepSeek-V3 671B, reaching the quality target in 2.
For more details please read the original article at NVIDIA Blog.
Continue Learning
Comments
Sign in to join the conversation