Evolving Dataflow to process massive datasets for machine learning

Overview

Google created MapReduce more than 20 years ago to handle its early data-processing scaling problems. The company has since evolved its internal data platform, Flume, the successor to MapReduce, with work focused on scalability, efficiency, and developer experience. Many of those features are now available in Dataflow, Google's fully managed batch and streaming platform. The post explains the new capabilities and how Google Cloud customers apply them.

Key Takeaways

Flume is the successor to MapReduce and many of its innovations ship inside Dataflow, the same core technology Google uses for its own demanding internal workloads.
Scalability features include liquid sharding, global compute, automatic pipeline optimization, rate-limiting of external API calls, and tandem pools for serverless remote inference.
Efficiency features target accelerators like TPUs through heterogeneous worker pools, TPU-aware autoscaling, duty-cycle policy enforcement, and TPU fungibility.
Modern machine learning steps such as data ingestion, transformation, and feature extraction depend on processing very large datasets.
The work is driven by AI-era demands, from training models like Gemini to powering autonomous vehicles like Waymo.

Stats & Key Facts

#MapReduce was created more than 20 years ago
#Scale of data processing at Google has grown over the last 20 years

Why Google Evolved Its Data Platform

The AI era demands efficient, large-scale data processing.

›Training frontier models like Gemini by Google DeepMind requires large-scale data work.
›Fully autonomous vehicles like Waymo also depend on processing massive datasets.
›Data ingestion, transformation, and feature extraction all rely on large datasets.

To meet the scale required across Google, the company evolved Flume, the successor to the original MapReduce, with a focus on scalability, efficiency, and a better developer experience. Many of those innovations are available in Dataflow, Google's fully managed batch and streaming platform built on the same core technology Google uses internally.

Features for Massive Scalability

Several features address the challenges of immense scale and are available in Dataflow.

›Liquid sharding dynamically splits work units during execution for on-the-fly rebalancing, helping pipelines with uneven data distribution and stragglers.
›Global compute scales by dynamically scheduling workloads across Google's global infrastructure, choosing location based on factors like data locality and resource availability.
›Automatic pipeline optimization fuses consecutive operations into a single stage to reduce I/O and stage-transition overhead.
›Rate-limiting external API calls manages load on external services, important for ML pipelines that call external APIs for tasks like model evaluation.
›Tandem pools support serverless remote inference by hosting, sharing, managing, and autoscaling external model servers.

Boosting Efficiency With Accelerators

Several features improve utilization and cost efficiency for teams using accelerators like TPUs.

›Heterogeneous worker pools let developers set custom resource requirements per stage, so TPU-intensive work runs on TPU workers while other stages use CPU workers.
›TPU-aware autoscaling prevents excessive initial assignment of TPU workers and improves efficiency during later autoscaling.
›Duty-cycle policy enforcement scales down TPU workloads when the accelerator's duty cycle is low and scales back up when utilization improves.
›TPU fungibility encourages scheduling jobs to the most suitable TPU version and cell location based on quota and resource availability.

Improving the Developer Experience

Google considered the wide mix of backgrounds and tools across the company.

›Rapid prototyping and iteration are treated as important goals.
›Reliable production operations are also a priority.
›Google invested in capabilities to support these needs.

Given the range of teams and tools, the platform work aimed to support rapid prototyping, iteration, and reliable production operations across Google.

From Internal Platform to Dataflow

The features described come from Google's internal platform and are surfaced through Dataflow.

›Dataflow is a fully managed batch and streaming platform.
›It is built on the same core technology Google uses for its most demanding internal workloads.
›Google Cloud customers put these features into action through Dataflow.

Frequently Asked Questions

What is Flume?

Flume is Google's internal data platform and the successor to the original MapReduce, evolved with a focus on scalability, efficiency, and developer experience.

How do these innovations reach customers?

Many of the Flume innovations are available in Dataflow, Google's fully managed batch and streaming platform built on the same core technology Google uses internally.

What does liquid sharding do?

Liquid sharding dynamically splits work units during execution to rebalance on the fly, helping pipelines with uneven data distribution and stragglers.

How does the platform handle TPUs more efficiently?

It uses heterogeneous worker pools, TPU-aware autoscaling, duty-cycle policy enforcement, and TPU fungibility to improve TPU utilization and cost efficiency.

Why does AI need large-scale data processing?

Machine learning steps such as data ingestion, transformation, and feature extraction rely on processing massive datasets, which is needed to train models like Gemini and power vehicles like Waymo.

Google's evolution of Flume into Dataflow brings its internal scalability and efficiency work to Cloud customers running large AI workloads.