[NeurIPS Best Paper] 1000 Layer Networks for Self-Supervised RL — Kevin Wang Et Al, Princeton
AI

Latent Space

[NeurIPS Best Paper] 1000 Layer Networks for Self-Supervised RL — Kevin Wang Et Al, Princeton

Latent SpaceJan 2, 2026

AI Summary

The episode explores the NeurIPS Best Paper on RL1000, where Kevin Wang and his Princeton team demonstrated that scaling reinforcement learning networks to 1,000 layers using a self‑supervised, contrastive objective unlocks dramatic performance gains. They explain why traditional value‑based RL failed to benefit from depth, how residual connections, layer normalization, and a shift to classification‑based learning created a "critical depth" that multiplies performance once enough data (≈15 M transitions) is available, and why depth is more parameter‑efficient than width. The discussion also covers the role of JAX‑accelerated environments, the potential for distilling deep teachers into lightweight students, and the broader implication that RL can now scale like language and vision by adopting self‑supervised representation learning.

Episode Description

From undergraduate research seminars at Princeton to winning Best Paper award at NeurIPS 2025, Kevin Wang, Ishaan Javali, Michał Bortkiewicz, Tomasz Trzcinski, Benjamin Eysenbach defied conventional wisdom by scaling reinforcement learning networks to 1,000 layers deep—unlocking performance gains that the RL community thought impossible. We caught up with the team live at NeurIPS to dig into the story behind RL1000: why deep networks have worked in language and vision but failed in RL for over a decade (spoiler: it's not just about depth, it's about the objective), how they discovered that self-supervised RL (learning representations of states, actions, and future states via contrastive learning) scales where value-based methods collapse, the critical architectural tricks that made it work (residual connections, layer normalization, and a shift from regression to classification), why scaling depth is more parameter-efficient than scaling width (linear vs. quadratic growth), how Jax and GPU-accelerated environments let them collect hundreds of millions of transitions in hours (the data abundance that unlocked scaling in the first place), the "critical depth" phenomenon where performance doesn't just improve—it multiplies once you cross 15M+ transitions and add the right architectural components, why this isn't just "make networks bigger" but a fundamental shift in RL objectives (their code doesn't have a line saying "maximize rewards"—it's pure self-supervised representation learning), how deep teacher, shallow student distillation could unlock deployment at scale (train frontier capabilities with 1000 layers, distill down to efficient inference models), the robotics implications (goal-conditioned RL without human supervision or demonstrations, scaling architecture instead of scaling manual data collection), and their thesis that RL is finally ready to scale like language and vision—not by throwing compute at value functions, but by borrowing the self-supervised, representation-learning paradigms that made the rest of deep learning work.

We discuss:

The self-supervised RL objective: instead of learning value functions (noisy, biased, spurious), they learn representations where states along the same trajectory are pushed together, states along different trajectories are pushed apart—turning RL into a classification problem

Why naive scaling failed: doubling depth degraded performance, doubling again with residual connections and layer norm suddenly skyrocketed performance in one environment—unlocking the "critical depth" phenomenon

Scaling depth vs. width: depth grows parameters linearly, width grows quadratically—depth is more parameter-efficient and sample-efficient for the same performance

The Jax + GPU-accelerated environments unlock: collecting thousands of trajectories in parallel meant data wasn't the bottleneck, and crossing 15M+ transitions was when deep networks really paid off

The blurring of RL and self-supervised learning: their code doesn't maximize rewards directly, it's an actor-critic goal-conditioned RL algorithm, but the learning burden shifts to classification (cross-entropy loss, representation learning) instead of TD error regression

Why scaling batch size unlocks at depth: traditional RL doesn't benefit from larger batches because networks are too small to exploit the signal, but once you scale depth, batch size becomes another effective scaling dimension

RL1000 Team (Princeton)

1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities: https://openreview.net/forum?id=s0JVsx3bx1

Chapters

00:00:00 Introduction: Best Paper Award and NeurIPS Poster Experience

00:01:11 Team Introductions and Princeton Research Origins

00:03:35 The Deep Learning Anomaly: Why RL Stayed Shallow

00:04:35 Self-Supervised RL: A Different Approach to Scaling

00:05:13 The Breakthrough Moment: Residual Connections and Critical Depth

00:07:15 Architectural Choices: Borrowing from ResNets and Avoiding Vanishing Gradients

00:07:50 Clarifying the Paper: Not Just Big Networks, But Different Objectives

00:08:46 Blurring the Lines: RL Meets Self-Supervised Learning

00:09:44 From TD Errors to Classification: Why This Objective Scales

00:11:06 Architecture Details: Building on Braw and SymbaFowl

00:12:05 Robotics Applications: Goal-Conditioned RL Without Human Supervision

00:13:15 Efficiency Trade-offs: Depth vs Width and Parameter Scaling

00:15:48 JAX and GPU-Accelerated Environments: The Data Infrastructure

00:18:05 World Models and Next State Classification

00:22:37 Unlocking Batch Size Scaling Through Network Capacity

00:24:10 Compute Requirements: State-of-the-Art on a Single GPU

00:21:02 Future Directions: Distillation, VLMs, and Hierarchical Planning

00:27:15 Closing Thoughts: Challenging Conventional Wisdom in RL Scaling

Show Notes

From undergraduate research seminars at Princeton to winning Best Paper award at NeurIPS 2025, Kevin Wang, Ishaan Javali, Michał Bortkiewicz, Tomasz Trzcinski, Benjamin Eysenbach defied conventional wisdom by scaling reinforcement learning networks to 1,000 layers deep—unlocking performance gains that the RL community thought impossible. We caught up with the team live at NeurIPS to dig into the story behind RL1000: why deep networks have worked in language and vision but failed in RL for over a decade (spoiler: it's not just about depth, it's about the objective), how they discovered that self-supervised RL (learning representations of states, actions, and future states via contrastive learning) scales where value-based methods collapse, the critical architectural tricks that made it work (residual connections, layer normalization, and a shift from regression to classification), why scaling depth is more parameter-efficient than scaling width (linear vs. quadratic growth), how Jax and GPU-accelerated environments let them collect hundreds of millions of transitions in hours (the data abundance that unlocked scaling in the first place), the "critical depth" phenomenon where performance doesn't just improve—it multiplies once you cross 15M+ transitions and add the right architectural components, why this isn't just "make networks bigger" but a fundamental shift in RL objectives (their code doesn't have a line saying "maximize rewards"—it's pure self-supervised representation learning), how deep teacher, shallow student distillation could unlock deployment at scale (train frontier capabilities with 1000 layers, distill down to efficient inference models), the robotics implications (goal-conditioned RL without human supervision or demonstrations, scaling architecture instead of scaling manual data collection), and their thesis that RL is finally ready to scale like language and vision—not by throwing compute at value functions, but by borrowing the self-supervised, representation-learning paradigms that made the rest of deep learning work.

We discuss:

  • The self-supervised RL objective: instead of learning value functions (noisy, biased, spurious), they learn representations where states along the same trajectory are pushed together, states along different trajectories are pushed apart—turning RL into a classification problem

  • Why naive scaling failed: doubling depth degraded performance, doubling again with residual connections and layer norm suddenly skyrocketed performance in one environment—unlocking the "critical depth" phenomenon

  • Scaling depth vs. width: depth grows parameters linearly, width grows quadratically—depth is more parameter-efficient and sample-efficient for the same performance

  • The Jax + GPU-accelerated environments unlock: collecting thousands of trajectories in parallel meant data wasn't the bottleneck, and crossing 15M+ transitions was when deep networks really paid off

  • The blurring of RL and self-supervised learning: their code doesn't maximize rewards directly, it's an actor-critic goal-conditioned RL algorithm, but the learning burden shifts to classification (cross-entropy loss, representation learning) instead of TD error regression

  • Why scaling batch size unlocks at depth: traditional RL doesn't benefit from larger batches because networks are too small to exploit the signal, but once you scale depth, batch size becomes another effective scaling dimension

RL1000 Team (Princeton)

Chapters

  • 00:00:00 Introduction: Best Paper Award and NeurIPS Poster Experience

  • 00:01:11 Team Introductions and Princeton Research Origins

  • 00:03:35 The Deep Learning Anomaly: Why RL Stayed Shallow

  • 00:04:35 Self-Supervised RL: A Different Approach to Scaling

  • 00:05:13 The Breakthrough Moment: Residual Connections and Critical Depth

  • 00:07:15 Architectural Choices: Borrowing from ResNets and Avoiding Vanishing Gradients

  • 00:07:50 Clarifying the Paper: Not Just Big Networks, But Different Objectives

  • 00:08:46 Blurring the Lines: RL Meets Self-Supervised Learning

  • 00:09:44 From TD Errors to Classification: Why This Objective Scales

  • 00:11:06 Architecture Details: Building on Braw and SymbaFowl

  • 00:12:05 Robotics Applications: Goal-Conditioned RL Without Human Supervision

  • 00:13:15 Efficiency Trade-offs: Depth vs Width and Parameter Scaling

  • 00:15:48 JAX and GPU-Accelerated Environments: The Data Infrastructure

  • 00:18:05 World Models and Next State Classification

  • 00:22:37 Unlocking Batch Size Scaling Through Network Capacity

  • 00:24:10 Compute Requirements: State-of-the-Art on a Single GPU

  • 00:21:02 Future Directions: Distillation, VLMs, and Hierarchical Planning

  • 00:27:15 Closing Thoughts: Challenging Conventional Wisdom in RL Scaling

Comments

Want to join the conversation?

Loading comments...