DiffusionGemma: 4x faster text generation

Overview

Google DeepMind released DiffusionGemma, an experimental open-weights model that writes text through a diffusion process instead of one word at a time, running up to 4x faster than standard models on GPUs. The model is a 26B Mixture of Experts design that activates only about 3.8B parameters per step and generates 256 tokens in parallel. It reaches more than 1,000 tokens per second on a single NVIDIA H100 and ships free to use under the Apache 2.0 license.

Key Takeaways

DiffusionGemma generates a whole 256-token block at once and refines it over several passes, the source of its speed advantage over models that produce one token at a time.
It is a 26B Mixture of Experts model that activates roughly 3.8B parameters during inference, so it stays fast while drawing on a large knowledge base.
Speed reaches over 1,000 tokens per second on a single NVIDIA H100 and over 700 tokens per second on a consumer RTX 5090 graphics card.
When quantized the model fits within about 18GB of VRAM, putting it within reach of high-end consumer and single-card setups.
The weights are free under the permissive Apache 2.0 license for both commercial and research use, available on Hugging Face, Kaggle, and Google Cloud.
Google states output quality is lower than standard Gemma 4 and recommends Gemma 4 for production work needing top quality.

Stats & Key Facts

#26B total parameter Mixture of Experts model, sized for a large knowledge base
#About 3.8B parameters active per inference step, keeping compute low
#Up to 4x faster text generation than standard autoregressive models on GPUs
#More than 1,000 tokens per second on a single NVIDIA H100 GPU
#More than 700 tokens per second on a consumer NVIDIA GeForce RTX 5090
#256 tokens generated in parallel on each forward pass
#Fits within roughly 18GB of VRAM when quantized to 4-bit precision

Text Diffusion Replaces Word-by-Word Generation

The core change is how the model puts words on the page.

Standard language models build a sentence token by token, in strict order, committing to each word before moving to the next. DiffusionGemma works differently. It starts with a block of random placeholder tokens and refines them across several passes until clear text emerges, similar to how an image diffusion model sharpens noise into a picture.

Because the model treats a 256-token block as a single canvas, it fills and corrects all positions at the same time. As confident words lock into place, they guide the refinement of the remaining ones. This parallel approach is what produces the speed gain over one-word-at-a-time generation.

26B Mixture of Experts With Only 3.8B Active Parameters

The architecture balances size with efficiency.

›Total size of 26B parameters in a Mixture of Experts layout, giving the model a broad knowledge base.
›Only about 3.8B parameters activate during any single inference step, which keeps the model fast and light on compute.
›Built on the Gemma 4 family with an added diffusion head that handles the block generation.
›Uses bi-directional attention, so the model reads context in both directions and corrects earlier tokens as it writes.

Speed Benchmarks on Data Center and Consumer GPUs

Google published throughput figures on two classes of hardware.

›More than 1,000 tokens per second on a single NVIDIA H100 data center GPU.
›More than 700 tokens per second on an NVIDIA GeForce RTX 5090, a high-end consumer card.
›Up to 4x faster overall than comparable autoregressive models on GPUs.
›256 tokens produced per forward pass, the structural reason behind the throughput numbers.

Runs on a Single Card With 18GB of VRAM

Hardware needs stay modest once the model is compressed.

When quantized to 4-bit precision, DiffusionGemma fits within roughly 18GB of VRAM. That figure matters because it brings the model within reach of single high-end consumer graphics cards rather than requiring a cluster of data center hardware.

For a business team, this lowers the cost barrier to running a fast open model locally. Work stays on owned hardware, which helps with data privacy and removes per-token usage fees tied to hosted services.

Open License and Day-Zero Framework Support

The release is built for immediate adoption across common tooling.

›Released under the Apache 2.0 license, a permissive license allowing commercial and research use.
›Weights available on Hugging Face, Kaggle, and Google Cloud's model garden.
›Day-zero support across MLX, vLLM, Hugging Face Transformers, and NVIDIA NeMo.
›Additional support through Unsloth, NVIDIA NIM, and a JAX diffusion toolbox, with llama.cpp support arriving soon.

Where DiffusionGemma Fits and Where It Does Not

Google frames the model around a clear trade between speed and polish.

The model is aimed at speed-critical work such as inline editing, rapid iteration, and local low-latency applications where fast turnaround matters more than top output quality. The parallel structure also suits non-linear text such as code infilling and structured sequences.

Google is direct about the limits. Output quality is deliberately lower than standard Gemma 4, and the company recommends Gemma 4 for production cases that need maximum quality. DiffusionGemma is labeled experimental, so teams should treat it as a tool for responsive prototypes and interactive features rather than a drop-in replacement for a flagship model.

Frequently Asked Questions

What makes DiffusionGemma faster than a normal language model?

It generates a full block of 256 tokens at once and refines the block over several passes, instead of producing one word at a time in order. This parallel approach is the source of its speed advantage of up to 4x on GPUs.

What hardware do I need to run it?

When quantized to 4-bit precision the model fits within about 18GB of VRAM, so it runs on a single high-end consumer card such as an NVIDIA RTX 5090. On that card it reaches over 700 tokens per second, and over 1,000 on a data center H100.

Is DiffusionGemma free to use commercially?

Yes. The weights are released under the Apache 2.0 license, a permissive license that allows both commercial and research use, and they are available on Hugging Face, Kaggle, and Google Cloud.

Is it as good as standard Gemma 4?

No. Google states output quality is deliberately lower than standard Gemma 4 because the model prioritizes speed. Google recommends using Gemma 4 for production cases that need the highest quality.

What is DiffusionGemma best used for?

Google positions it for speed-critical and interactive work such as inline editing, rapid iteration, code infilling, and local low-latency applications where fast turnaround matters more than top polish.

DiffusionGemma shows a clear trade between speed and polish, offering an open, fast model that runs on a single card for teams building responsive, low-latency applications. For work needing the highest output quality, Google still points to standard Gemma 4.