NVIDIA Accelerates Google DeepMind's DiffusionGemma for Local AI
Google DeepMind released DiffusionGemma, an experimental open model that writes text by refining whole blocks at once instead of one word at a time, and NVIDIA has tuned it to run faster across its GeForce RTX, RTX PRO, and DGX hardware. The model reaches more than 1,000 tokens per second on a single NVIDIA H100 GPU and over 700 tokens per second on a consumer RTX 5090, roughly four times the speed of a comparable standard model. It targets fast, single-user work such as chat and on-device assistants where low delay matters most. The weights are open under the Apache 2.0 license and free to test through NVIDIA-hosted APIs.
Key Takeaways
- DiffusionGemma is a new open text model from Google DeepMind that builds sentences by refining a full block of words at once, similar to how image diffusion models turn noise into a picture.
- NVIDIA optimized the model to run across its chips, from consumer GeForce RTX GPUs to DGX data center systems, so the same model scales from a desktop to the cloud.
- Speed is the headline: the model is roughly four times faster than an equivalent word-by-word model in single-user settings, which lowers waiting time for chat and assistant tasks.
- It uses a mixture-of-experts design with 26 billion total parameters but activates only 3.8 billion per step, keeping the compute load lighter than the full size suggests.
- The weights are open under the Apache 2.0 license, and businesses can test it free through NVIDIA-hosted APIs at build.nvidia.com or download it from Hugging Face.
- Google notes the model trades some output quality for speed, so it sits below the standard Gemma 4 model on general accuracy.
Stats & Key Facts
- #1,000-plus tokens per second of text generation throughput on a single NVIDIA H100 GPU.
- #700-plus tokens per second on a consumer-grade NVIDIA GeForce RTX 5090.
- #Up to 2,000 tokens per second on the larger NVIDIA DGX Station system.
- #Around 4x faster than a comparable word-by-word model in single-user scenarios.
- #26 billion total parameters in the mixture-of-experts model, with only 3.8 billion active per step.
- #Up to 256 tokens denoised in parallel during each generation step.

How DiffusionGemma Writes Text in Parallel Instead of Word by Word
The core idea borrows from how AI image tools work.
Most text AI models work left to right, producing one word, then the next, with each word depending on the ones before it. DiffusionGemma takes a different path. It starts with a block of random placeholder tokens, then refines that whole block at once over several passes until readable text emerges, much the way image diffusion models turn visual noise into a finished picture.
This design lets every word in a block consider all the others at the same time, a property called bidirectional attention. In practice the model denoises up to 256 tokens in a single step rather than emitting them one at a time. That parallel approach is what opens the door to the higher speeds NVIDIA reports.
Speed Numbers Across NVIDIA RTX and DGX Hardware
NVIDIA published throughput figures for several chips.
- ›More than 1,000 tokens per second on a single NVIDIA H100 data center GPU.
- ›Over 700 tokens per second on a consumer NVIDIA GeForce RTX 5090 desktop card.
- ›About 150 tokens per second on the compact NVIDIA DGX Spark system.
- ›Up to 2,000 tokens per second on the larger NVIDIA DGX Station.
- ›Roughly four times the speed of an equivalent word-by-word model for single-user work.
A Mixture-of-Experts Model That Stays Light on Compute
The 26 billion parameter size is less heavy than it sounds.
DiffusionGemma pairs a diffusion head with Google's Gemma 4 architecture, a mixture-of-experts model with 26 billion total parameters. A mixture-of-experts design splits the model into many smaller specialist sections and only switches on the ones a given task needs.
Because of that, the model activates only 3.8 billion parameters per step rather than the full 26 billion. The result is a model whose running cost is closer to a small model even though its total knowledge base is large, which helps it fit on workstation and high-end consumer GPUs.
What Businesses Would Use It For
The model is aimed at fast, single-user tasks.
- ›Interactive chat where short response delay keeps conversations feeling natural.
- ›On-device assistants that plan steps and act without sending data to the cloud.
- ›Agentic loops, where an AI repeats many quick reasoning steps in sequence.
- ›Latency-sensitive workflows for a single person rather than large batch jobs serving many users at once.
The Speed-for-Quality Tradeoff Buyers Should Weigh
Faster does not mean better at everything.
NVIDIA and Google present this as an experimental, speed-optimized model, not a replacement for their strongest systems. Google states that DiffusionGemma trails the standard Gemma 4 model on output quality, so general accuracy is lower in exchange for the speed gains.
The parallel approach shows strength on structured, rule-heavy tasks. Reporting on the release notes a fine-tuned version reached 80 percent accuracy on Sudoku puzzles, where the base model scored zero. For everyday business writing and broad reasoning, teams should test it against their current model before switching.
Open Weights, Apache 2.0 License, and How to Try It
Access is open and broad.
- ›Released as an open-weight model under the Apache 2.0 license, which permits commercial use.
- ›Free to test through NVIDIA-hosted APIs at build.nvidia.com.
- ›Available for download on Hugging Face for teams that want to run it themselves.
- ›Supported by common tools including Hugging Face Transformers, vLLM, Unsloth for fine-tuning, and NVIDIA's NeMo framework, with llama.cpp support listed as coming.
- ›Local consumer deployment is still maturing, since some runtimes lacked full support at launch.
Frequently Asked Questions
What makes DiffusionGemma different from a normal AI text model?
Normal text models write one word at a time, with each word built on the ones before it. DiffusionGemma refines a whole block of up to 256 words at once over several passes, which lets it generate text much faster for single-user tasks.
How fast is DiffusionGemma?
NVIDIA reports more than 1,000 tokens per second on a single H100 data center GPU and over 700 tokens per second on a consumer RTX 5090. That is roughly four times faster than a comparable word-by-word model in single-user scenarios.
Is DiffusionGemma free to use?
Yes. The weights are open under the Apache 2.0 license, which allows commercial use. Businesses can test it free through NVIDIA-hosted APIs at build.nvidia.com or download it from Hugging Face to run themselves.
Is it as accurate as Google's best models?
No. Google describes it as an experimental, speed-focused model that trails the standard Gemma 4 model on output quality. It shows particular strength on structured, rule-based tasks but is not a full replacement for top-tier general models.
What hardware do I need to run it?
It runs across NVIDIA GeForce RTX GPUs, RTX PRO 6000 workstations, and DGX systems, scaling from a local desktop to the cloud. Its mixture-of-experts design activates only 3.8 billion of 26 billion parameters per step, which helps it fit on high-end consumer and workstation cards.
DiffusionGemma signals a shift toward text AI that prioritizes speed for single-user work, and NVIDIA's tuning makes that speed reachable from a desktop GPU up to a data center. The open license and free trial access give businesses a low-cost way to test whether the speed gains fit their tasks before committing.
Continue Learning
Comments
Sign in to join the conversation