Accelerating Transformers Fine-Tuning with NVIDIA NeMo AutoModel

Overview

v5 ships the MoE foundations: expert backends, dynamic weight loading, and distributed execution that make MoE extensible and easy to build on. NVIDIA NeMo AutoModel is an open library part of the NVIDIA NeMo framework for building custom generative AI models at scale. NeMo AutoModel builds cleanly on top of v5, adding Expert Parallelism, DeepEP fused all-to-all dispatch, and TransformerEngine kernels, and it leans on v5's dynamic weight loading to bring those optimizations to a broad and growing set of model families.

Key Takeaways

7x higher training throughput and 29-32% less GPU memory on fine-tuning MoE models than native Transformers v5, using the same from_pretrained() API: a single import line, with no other code changes.
This blog details how this combination works and how users can fine-tune MoE models faster without changing their APIs.
In addition, v5 made distributed training first-class by integrating PyTorch's DeviceMesh directly into from_pretrained().
NeMo AutoModel builds on top of v5 by subclassing AutoModelForCausalLM, and adding Expert Parallelism (EP), DeepEP fused all-to-all dispatch, and TransformerEngine kernels.
The next section walks through how the two work together and the performance gains we measured, from full fine-tuning NVIDIA Nemotron 3 Ultra 550B A55B across 16 nodes down to single-node models such as Qwen3-30B-A3B and Nemotron 3 Nano 30B A3B .
NeMo AutoModel: Same API, More Performance One of NeMo AutoModel's goals is API compatibility with HuggingFace Transformers to enable open-source community.
Only the import changes: That single import does a lot of work.
For popular MoE architectures like Qwen3, NVIDIA Nemotron , GPT-OSS, and DeepSeek V3, NeMo AutoModel ships hand-tuned implementations with TransformerEngine attention, fused linear layers, and custom expert kernels.
Where NeMo AutoModel really shines is scaling MoE models to multi-GPU training.

Stats & Key Facts

#7x higher training throughput and 29-32% less GPU memory on fine-tuning MoE models than native Transformers v5, using the same from_pretrained() API: a single import line, with no other code changes.
#The next section walks through how the two work together and the performance gains we measured, from full fine-tuning NVIDIA Nemotron 3 Ultra 550B A55B across 16 nodes down to single-node models such as Qwen3-30B-A3B and Nemotron 3 Nano 30B A3B .

7x higher training throughput and 29-32% less GPU memory on fine-tuning MoE models than native Transformers v5, using the same from_pretrained() API: a single import line, with no other code changes. This blog details how this combination works and how users can fine-tune MoE models faster without changing their APIs. Background The rise of MoE models has introduced new challenges to efficient training: Routing tokens across hundreds of experts, fusing expert matmuls into a single kernel, sharding weights across GPUs, and overlapping communication with computation all require infrastructure beyond what a general-purpose library provides out of the box.

Transformers v5 ("v5") introduced first-class MoE support such as expert backends , dynamic weight loading , and tensor parallel plans for distributed execution. In addition, v5 made distributed training first-class by integrating PyTorch's DeviceMesh directly into from_pretrained(). NeMo AutoModel builds on top of v5 by subclassing AutoModelForCausalLM, and adding Expert Parallelism (EP), DeepEP fused all-to-all dispatch, and TransformerEngine kernels.

DeepEP is the piece v5 doesn't have yet: it overlaps communication with expert compute. And because NeMo AutoModel rides v5's reversible weight conversion to load each model, it can focus its engineering on these reusable core ops instead of per-model checkpoint plumbing, while save_pretrained() still emits standard HF checkpoints that tools like vLLM and SGLang can load. The next section walks through how the two work together and the performance gains we measured, from full fine-tuning NVIDIA Nemotron 3 Ultra 550B A55B across 16 nodes down to single-node models such as Qwen3-30B-A3B and Nemotron 3 Nano 30B A3B .

For more details please read the original article at Hugging Face.

Continue Learning

Foundations

AI Fundamentals: Your First Steps

Foundations

History of AI: From Turing to Today

Foundations

How AI Actually Works (Under the Hood)

Originally published by Hugging Face

Read the original

Stats & Key Facts

Continue Learning

Comments