Timing Trick Cuts Energy Used in LLM Training by Up to 14 Percent

Overview

Researchers at the University of Twente in the Netherlands cut the energy used to train a large language model by up to 14 percent without slowing it down, by finely tuning how fast a GPU's clocks tick during computation. The method, presented at the Computing Frontiers conference in Sicily, adjusts the chip's speed at the level of tiny tasks called kernels rather than across whole training steps. In tests training a 1.3 billion parameter model, the approach saved roughly 14.6 percent of energy with only a 0.6 percent increase in training time. For an industry where a single frontier model trains on tens of gigawatt-hours, that share of savings is large.

Key Takeaways

A team at the University of Twente reduced GPU energy use in LLM training by up to 14 percent while keeping training speed nearly unchanged.
The technique is dynamic voltage and frequency scaling (DVFS), a method known since the 1990s, applied at a much finer level than before.
Modern GPUs have two clocks, one for the computing core and one for memory; the trick slows whichever clock is idle to draw less power.
The key advance is adjusting clock speed per kernel, a tiny computational task, instead of once per training pass as earlier attempts did.
Tests trained GPT-3-XL, a 1.3 billion parameter model, on a single Nvidia RTX 3080 Ti GPU, saving about 14.6 percent of energy with a 0.6 percent slowdown.
Newer GPU hardware such as Nvidia's Blackwell line switches clock speeds faster, which the researchers expect will unlock fuller savings.

Stats & Key Facts

#Up to 14 percent energy savings in LLM training, measured at 14.6 percent in the team's main test.
#0.6 percent slowdown in training time, the small performance cost of the method.
#About 2 percent energy savings from older pass-level DVFS, the prior approach the new method beats.
#1.3 billion parameters in GPT-3-XL, the model used in the experiment.
#About 40 kernels per neural network layer, the fine-grained units the team tuned individually.
#50 gigawatt-hours estimated to train GPT-4 in 2023, equal to the yearly power use of about 5,000 American homes.

Timing Trick Cuts Energy Used in LLM Training by Up to 14 Percent

Why LLM Training Burns So Much Electricity

The energy cost of building frontier AI models has grown into a serious expense and environmental concern.

Training a large language model means running billions of calculations across specialized chips for days or weeks. OpenAI's GPT-4 took an estimated 50 gigawatt-hours to train in 2023, roughly the yearly electricity use of 5,000 American homes. Since then the computing resources behind top models have only grown, though companies rarely publish exact power figures.

Against that backdrop, even small percentage savings translate into large amounts of electricity and money. A 14 percent cut applied across a training run that consumes tens of gigawatt-hours represents real reductions in cost and carbon, which is why the University of Twente result drew attention.

How Slowing the Right GPU Clock Saves Power

The method targets wasted power by slowing whichever part of the chip is not the bottleneck at any moment.

›Every chip uses at least one clock that triggers each operation; faster ticking means faster work but higher power draw.
›Modern GPUs have two clocks, one for the computing core and one for the memory.
›When the core is crunching numbers, the memory clock slows down to cut power without hurting speed.
›When the core waits for data from memory, the core clock slows while the memory clock speeds up.
›Fully switching off the memory section is not an option because GPU designs do not expose that control and restarting it mid-calculation takes too long.

The Per-Kernel Tuning That Made the Difference

The real innovation was adjusting clock speeds at a much finer timescale than earlier researchers tried.

DVFS has existed since at least the 1990s, but earlier efforts to apply it to LLM training either slowed calculations too much or worked at too coarse a level to save useful energy. Prior work set one clock speed for the forward pass, where data flows through the model, and another for backpropagation, where the model's weights get adjusted.

GPU work breaks down into tiny tasks called kernels, such as a single vector-vector multiplication. The Twente team split one neural network layer into roughly 40 kernels and tuned the clock frequencies for each one. This fine-grained control found far greater savings than the per-pass approach, which delivered only about 2 percent.

Why the GPU Cannot Do This on Its Own

GPUs already adjust their own clock speeds automatically, but the manual method beats them.

Modern GPUs run DVFS automatically when their internal systems sense rising or falling demand, so a natural reaction is to let the chip handle it. Lead author Jeffrey Spaan notes the GPU lacks foresight about which kernels will run next, so it has to make on-the-fly best-effort guesses and never reaches the same savings.

The researchers, by contrast, know the sequence of kernels in advance and plan the clock adjustments ahead of time. Spaan describes his work as finding computing waste by optimizing the hardware for the software rather than the other way around.

What the Experiment Actually Measured

The published results come from a controlled test on a single consumer-grade GPU.

›The team trained GPT-3-XL, a 1.3 billion parameter model, to validate the approach.
›They ran the test on one Nvidia RTX 3080 Ti GPU and focused on training a single layer to save time.
›The kernel-level method reached up to 14.6 percent energy savings against roughly 2 percent for the older pass-level technique.
›Training time grew by only 0.6 percent, meaning speed stayed nearly identical.
›The work appears in a paper titled Reducing Compute Waste in LLMs through Kernel-Level DVFS.

Limits Today and Bigger Savings Ahead

The 14 percent figure is a best case, and the researchers see room to improve it.

Switching clock frequencies is not instant, so the full savings depend on how quickly the hardware reacts. The 14 percent result represents a best-case scenario on the tested GPU. Newer chips such as Nvidia's Blackwell line switch frequencies faster, which the team expects will let more of the theoretical savings show up in practice.

To make the method practical for others, the researchers are building an automated tool that works out the best frequency settings for a given workload. That would let teams apply the technique without manually mapping every kernel themselves.

Frequently Asked Questions

How much energy does this method actually save?

In the team's main test it cut energy use by about 14.6 percent, with the headline figure stated as up to 14 percent. The older pass-level version of the technique saved only around 2 percent.

Does saving energy make training slower?

Almost not at all. The experiment showed only a 0.6 percent increase in training time, so the speed stayed essentially the same while energy dropped.

What is DVFS in plain terms?

DVFS, or dynamic voltage and frequency scaling, means changing how fast a chip's clock ticks during work. Slowing the clock on the part of the chip that is idle at a given moment lowers power use without hurting the part doing the real work.

Why is adjusting per kernel better than earlier methods?

Earlier methods set clock speeds once for the forward pass and once for backpropagation, which is too coarse. Tuning each of the roughly 40 kernels in a layer matches the chip's speed to the work much more closely, capturing far more savings.

Will this work on the latest GPUs?

The researchers expect even better results on newer hardware. Chips like Nvidia's Blackwell switch clock frequencies faster, which should let more of the potential savings appear in real training runs.

By tuning GPU clock speeds at the level of individual computational tasks, the University of Twente team cut LLM training energy by up to 14 percent with almost no speed cost. With faster-switching chips and an automated tuning tool in the works, the savings could grow and reach a wider set of training operations.