RL without TD learning

Overview

This Berkeley BAIR blog post introduces a reinforcement learning approach based on a divide-and-conquer paradigm rather than temporal difference (TD) learning. The author argues that TD learning struggles to scale to long-horizon tasks because errors in bootstrapped value estimates accumulate across the horizon. The proposed divide-and-conquer method splits a trajectory into two equal halves and combines their values, which in theory reduces the number of Bellman recursions logarithmically rather than linearly, offering a path toward scalable off-policy RL.

Key Takeaways

The post proposes doing reinforcement learning by divide and conquer instead of temporal difference learning.
The problem setting is off-policy RL, which can use old data, demonstrations and other sources rather than only fresh data.
TD learning struggles on long-horizon tasks because bootstrapping errors accumulate over the entire horizon.
n-step TD mixes TD and Monte Carlo returns but only reduces Bellman recursions by a constant factor and adds variance.
Divide and conquer splits a trajectory into two equal segments and combines their values to update the full trajectory.
In theory this reduces the number of Bellman recursions logarithmically and avoids tuning a hyperparameter like n.

On-policy versus off-policy RL

The author reviews the two main classes of RL algorithms.

›On-policy RL can only use fresh data from the current policy and must discard old data after each update.
›PPO, GRPO and policy gradient methods in general belong to the on-policy category.
›Off-policy RL can use any data, including old experience, human demonstrations and Internet data.

The author notes off-policy RL is more general and flexible, and also harder. Q-learning is the most well-known off-policy algorithm. In domains where data collection is expensive, such as robotics, dialogue systems and healthcare, off-policy RL is often the only option, which makes it an important problem.

The scalability problem with TD learning

The post explains why TD-based value learning does not scale to long horizons.

›Off-policy RL typically trains a value function with TD learning using the Bellman update rule.
›The error in the next value propagates to the current value through bootstrapping.
›These errors accumulate over the entire horizon, which makes TD learning struggle on long-horizon tasks.

As of 2025, the author says there are reasonably good recipes for scaling on-policy RL such as PPO and GRPO, but no scalable off-policy RL algorithm that handles complex, long-horizon tasks well.

Mixing TD with Monte Carlo

n-step TD is a partial fix that the author finds unsatisfactory.

›n-step TD (TD-n) uses the actual Monte Carlo return for the first n steps, then a bootstrapped value for the rest.
›This reduces the number of Bellman recursions by a factor of n so errors accumulate less.
›In the extreme case of n equal to infinity, it recovers pure Monte Carlo value learning.

The author calls this highly unsatisfactory for two reasons. First, it does not fundamentally solve error accumulation; it only reduces the recursions by a constant factor n. Second, as n grows, the method suffers from high variance and suboptimality, so n must be carefully tuned for each task.

The third paradigm: divide and conquer

The author proposes divide and conquer as a fundamentally different approach.

›The key idea is to divide a trajectory into two equal-length segments and combine their values.
›Combining segment values updates the value of the full trajectory.
›In theory this reduces the number of Bellman recursions logarithmically rather than linearly.

Unlike n-step TD, divide and conquer does not require choosing a hyperparameter like n, and the author argues it does not necessarily suffer from high variance or suboptimality. The author claims this third paradigm may provide an ideal solution to off-policy RL that scales to arbitrarily long-horizon tasks.

Frequently Asked Questions

What problem does this post address?

It addresses the lack of a scalable off-policy RL algorithm for complex, long-horizon tasks, which the author attributes to error accumulation in temporal difference learning.

Why does TD learning struggle with long horizons?

In TD learning, the error in the next value propagates to the current value through bootstrapping, and these errors accumulate over the entire horizon.

What is the divide-and-conquer idea?

It divides a trajectory into two equal-length segments and combines their values to update the value of the full trajectory, which in theory reduces the number of Bellman recursions logarithmically.

How is divide and conquer better than n-step TD?

n-step TD only reduces Bellman recursions by a constant factor n and adds variance as n grows, while divide and conquer reduces them logarithmically and avoids tuning a hyperparameter like n.

What is off-policy RL and why does it matter?

Off-policy RL can use any data, including old experience, demonstrations and Internet data, which is essential in domains like robotics, dialogue systems and healthcare where data collection is expensive.

The author presents divide and conquer as a promising third paradigm for value learning that could make off-policy RL scale to arbitrarily long-horizon tasks.

Continue Learning

ChatGPT: Complete Getting Started Guide

Originally published by Berkeley BAIR

Read the original

On-policy versus off-policy RL

The scalability problem with TD learning

Mixing TD with Monte Carlo

The third paradigm: divide and conquer

Frequently Asked Questions

Continue Learning

Comments