Stanford CS25: Transformers United V6 I From Representation Learning to World Modeling

Stanford Online
Stanford OnlineApr 22, 2026

Why It Matters

Causal JEPA equips AI systems with human‑like, object‑centric reasoning, accelerating reliable simulation and planning for robotics and autonomous technologies.

Key Takeaways

  • JEPA replaces pixel-level prediction with latent-space similarity effectively.
  • Causal JEPA uses object‑centric slots to model interactions.
  • Energy‑based loss prevents collapse via contrastive regularization techniques.
  • Action‑conditioned transformers enable planning in simulated environments for robotics.
  • Slot‑attention provides permutation‑invariant object representations for world models learning.

Summary

The Stanford CS25 lecture introduced Joint Embedding Predictive Architecture (JEPA) and its causal extension as a new paradigm for world modeling, emphasizing latent‑space prediction over raw pixel reconstruction. Speakers highlighted how JEPA treats future prediction as an energy‑based compatibility score, avoiding generative likelihood pitfalls and using contrastive or regularization tricks to prevent representation collapse. Key technical insights included the shift from auto‑regressive pixel decoders to latent‑space encoders, the use of exponential moving average encoders, and the integration of object‑centric slot attention to capture inter‑object dynamics. By masking object slots and conditioning on actions, the model learns causal relationships that generalize across unseen scenarios, as demonstrated on datasets like Push‑T, CLEVR‑VQA, and multi‑object physics simulations. Examples such as autonomous‑driving models (Gaia) and high‑fidelity video generators (Sora) illustrated real‑world relevance, while the monkey‑banana thought experiment underscored the model’s ability to infer hidden causes. The lecture also detailed practical tricks: bidirectional transformers for masked token prediction, positional encodings for slot identity, and action‑conditioned predictors for planning. Overall, the approach promises more robust, sample‑efficient simulators for robotics and AI agents, enabling them to reason about object interactions and plan actions without exhaustive pixel‑level supervision.

Original Description

For more information about Stanford’s graduate programs, visit: https://online.stanford.edu/graduate-education
April 9, 2026
This seminar covers:
• How world models are increasingly moving away from reconstruction and toward prediction in latent space
• Two recent JEPA-based approaches that illustrate this shift from complementary angles
Follow along with the seminar schedule. Visit: https://web.stanford.edu/class/cs25/
Guest Speakers: Hazel Nam & Lucas Maes (Brown University)
Instructors:
• Steven Feng, Stanford Computer Science PhD student and NSERC PGS-D scholar
• Karan P. Singh, Electrical Engineering PhD student and NSF Graduate Research Fellow in the Stanford Translational AI Lab
• Michael C. Frank, Benjamin Scott Crocker Professor of Human Biology Director, Symbolic Systems Program
• Christopher Manning, Thomas M. Siebel Professor in Machine Learning, Professor of Linguistics and of Computer Science, Co-Founder and Senior Fellow of the Stanford Institute for Human-Centered Artificial Intelligence (HAI)

Comments

Want to join the conversation?

Loading comments...