Profiling in PyTorch (Part 2): From nn.Linear to a Fused MLP
Hugging Face published the second part of its PyTorch profiling series, showing how to read GPU performance data and speed up a common AI building block by packaging its math into fewer kernels. The standard version fired 5 separate GPU kernels per forward pass, while compiling the same model with torch.compile cut that to 4 by fusing the small pointwise steps into one kernel. The fused path skipped roughly 50 MB of intermediate memory traffic per pass and ran in 89.4 microseconds, slightly faster than a hand-tuned alternative at 92.8 microseconds.
Key Takeaways
- The article teaches profiling by walking through a Multilayer Perceptron, a stack of math layers found inside most modern AI models, and measuring exactly where GPU time goes.
- Running the model in standard eager mode launched 5 GPU kernels per forward pass: 3 matrix multiplies, 1 activation step, and 1 element-wise multiply.
- Applying torch.compile fused the small pointwise operations plus a reshape into a single kernel, dropping the count to 4 and keeping intermediate results in fast on-chip registers instead of slow GPU memory.
- The fusion removed about 50 MB of intermediate tensor writes to GPU memory on every forward pass, cutting a full round-trip to high-bandwidth memory.
- The compiled kernel measured 89.4 microseconds versus 92.8 microseconds for the hand-tuned Liger GEGLU MLP, a gap of roughly 4 percent.
- The choice is framed as a trade-off: the compiled kernel is faster for one fixed input size, while the hand-tuned kernel stays consistent across changing input sizes without recompiling.
Stats & Key Facts
- #5 GPU kernels launched per forward pass in standard eager mode, reduced to 4 after compiling
- #Roughly 50 MB of intermediate memory traffic removed per forward pass through fusion
- #89.4 microseconds measured for the compiled fused kernel
- #92.8 microseconds measured for the hand-tuned Liger GEGLU MLP, about 4 percent slower
- #Test tensor shaped 8192 by 768, formed from a batch of 64 sequences at 128 tokens each
- #Model width of 768 and hidden width of 3072 used as the fixed test dimensions
What a Multilayer Perceptron Is and Why It Matters for AI Cost
The example centers on a small but widely used component of neural networks.
A Multilayer Perceptron, often shortened to MLP, is a stack of math layers that sits inside most modern AI models, including the language models behind chat assistants. Each layer multiplies inputs by learned weights and applies a shaping function, then passes the result forward.
Because this block runs over and over during both training and everyday use, how efficiently it runs on the GPU directly shapes the compute bill. The Hugging Face post uses it as a teaching case to show that the same model produces the same answers at different costs depending on how its math is packaged for the hardware.
Five Kernels in Eager Mode: Where the GPU Time Goes
Profiling the standard run exposed five separate units of GPU work.
- ›Three matrix multiplies, the heavy math steps that combine inputs with weights.
- ›One activation step using a GeLU shaping function with the tanh approximation.
- ›One element-wise multiply that combines two intermediate results.
- ›Each kernel hand-off and each trip to GPU memory adds overhead the profiler makes visible.
Standard PyTorch runs code in what is called eager mode, executing each operation as a separate GPU kernel. The profiler showed five kernels per forward pass, and the gaps between them are time the chip spends coordinating rather than computing.
How torch.compile Fuses the Math Into Fewer Kernels
Compiling the same model collapsed the small steps into one combined kernel.
torch.compile is a built-in PyTorch feature that analyzes the model and rewrites its operations into more efficient GPU code with almost no change to the original program. In this case it merged the two pointwise steps, the activation and the multiply, along with a reshape, into a single fused kernel.
The result dropped the kernel count from five to four: three matrix multiplies plus one fused kernel. The fusion also keeps intermediate results in registers, fast memory that sits inside the chip, instead of writing them out to the much slower high-bandwidth memory and reading them back.
Cutting 50 MB of Memory Traffic Per Forward Pass
The biggest saving came from avoiding a round-trip to GPU memory.
In eager mode, the intermediate result of the activation step, a tensor shaped 8192 by 768 in the model's working format, makes a full round-trip through high-bandwidth memory before the next step uses it. That intermediate weighs in at roughly 50 MB.
Moving data to and from this memory is one of the slowest parts of GPU work. By keeping the intermediate on-chip, the fused kernel erases that round-trip entirely, which is where much of the speedup comes from. Less memory traffic means less waiting and lower energy use per pass.
Compiled Kernel Versus Hand-Tuned Liger: 89.4 vs 92.8 Microseconds
The post compares the automatically compiled kernel against a kernel an expert wrote by hand.
- ›Compiled fused kernel: 89.4 microseconds, generated for this exact input shape.
- ›Liger GEGLU MLP: 92.8 microseconds, a pre-built kernel fetched from the Hugging Face Hub through the kernels library.
- ›The two land within about 4 percent of each other on speed.
The Liger GEGLU MLP is a hand-written kernel from the open source Liger library, downloaded as a version-pinned, pre-built package rather than compiled on the user's machine. It avoids compile overhead such as retracing the model and checking guards each time conditions change.
The Real Trade-Off: Specialized Speed Versus Stable Flexibility
The post reframes the comparison away from a simple human-versus-machine contest.
The compiled kernel is fast precisely because it was generated for one exact input size. Change the batch size or sequence length and the compiler might need to rebuild, paying that setup cost again.
The hand-tuned Liger kernel is generic. It stays consistent across changing input sizes without recompiling, picking its launch settings from the data it receives. The author frames the choice as a fast generic kernel versus a kernel specialized for one fixed shape, not slow versus fast.
How the Profiling Was Done and on What Hardware
The walkthrough relied on standard PyTorch tooling and a fixed test setup.
- ›Used the built-in PyTorch profiler to capture per-kernel timing.
- ›Viewed the captured traces in Perfetto, a trace visualization tool.
- ›Ran on an NVIDIA A100 GPU with 80 GB of memory.
- ›Fixed the test at a batch of 64 sequences, 128 tokens each, model width 768, hidden width 3072, so the numbers are repeatable.
Frequently Asked Questions
What is kernel fusion and why does it speed up an AI model?
Kernel fusion combines several small GPU operations into one, so the chip does the work in a single pass instead of handing off between separate steps. It speeds things up by avoiding repeated trips to slow GPU memory, keeping intermediate results in fast on-chip registers.
How much faster was the fused approach in this example?
The compiled fused kernel ran in 89.4 microseconds versus 92.8 microseconds for the hand-tuned Liger version, a difference of about 4 percent. The larger structural win was cutting kernel launches from 5 to 4 and removing roughly 50 MB of memory traffic per forward pass.
What is torch.compile?
torch.compile is a built-in PyTorch feature that analyzes a model and rewrites its operations into more efficient GPU code with almost no change to the original program. In this example it automatically fused the activation, multiply, and reshape steps into one kernel.
Should a team use the compiled kernel or the hand-tuned one?
It depends on the workload. The compiled kernel is slightly faster for one fixed input size, while the hand-tuned Liger kernel stays consistent across changing input sizes without rebuilding, which suits situations where batch size or sequence length varies.
Why does this matter for a business running AI products?
The same model produces the same answers at different compute costs depending on how its math is packaged for the GPU. Cutting kernel launches and memory traffic lowers the GPU bill behind AI products without changing what the model does.
The post shows that small engineering choices in how a model's math is packaged for the GPU directly shape both speed and the compute bill. For business readers, the lesson is that running AI more cheaply is often about smarter execution, not just bigger hardware.
Continue Learning
Comments
Sign in to join the conversation