AI Videos

All News Deals Social Blogs Videos Podcasts Digests

Stanford CS336 Language Modeling From Scratch | Spring 2026 | Lecture 16: Post-Training - RLVR

•May 27, 2026

Stanford Online

Stanford Online•May 27, 2026

Why It Matters

GRPO simplifies reinforcement learning for verifiable tasks, lowering compute costs and accelerating the deployment of reliable, math‑capable language models.

Key Takeaways

•RLVR targets verifiable tasks like math using reinforcement learning.
•Overoptimization limits RLHF due to reward model overfitting.
•PPO implementation is fragile; advantage estimation and KL clipping cause instability.
•GRPO removes value network, using z‑score advantage to simplify training.
•Open‑source releases (DeepSeek Math) showcase GRPO as practical alternative to PPO.

Summary

The lecture introduces Reinforcement Learning from Verifiable Rewards (RLVR) as the next frontier beyond instruction tuning and RLHF, focusing on tasks such as mathematics and code where outcomes can be objectively verified. It highlights recent OpenAI announcements that a thinking model solved a longstanding open‑math problem, underscoring the relevance of RLVR for high‑stakes reasoning. Key insights include the over‑optimization problem that plagues RLHF: reward models quickly overfit when fed endless preference data, creating a bottleneck. The professor revisits Proximal Policy Optimization (PPO), noting its theoretical appeal but practical fragility—advantage estimation, KL‑clipping, and value‑function baselines introduce high variance and demand delicate engineering. Concrete examples illustrate these challenges: a student’s PPO implementation broke when KL penalties were clipped at zero, and many open‑source baselines misuse generalized advantage estimators, effectively reducing PPO to a bandit problem. The DeepSeek Math paper’s GRPO algorithm is presented as a streamlined alternative that discards the costly value network and computes advantages via z‑score normalization. The implication is clear: for verifiable domains, GRPO offers a more stable, compute‑efficient path to high‑performing models, potentially reshaping research pipelines and reducing reliance on brittle PPO implementations.

Original Description

For more information about Stanford's online Artificial Intelligence programs, visit: https://stanford.io/ai

To learn more about enrolling in this course, visit: https://online.stanford.edu/courses/cs336-language-modeling-scratch

Follow along with the course schedule and syllabus, visit: https://cs336.stanford.edu/

Percy Liang

Professor of Computer Science (and courtesy in Statistics)

Tatsunori Hashimoto

Assistant Professor of Computer Science

View the course playlist: https://www.youtube.com/playlist?list=PLoROMvodv4rMqXOcazWaTUHhq-yembLCV

Comments

Want to join the conversation?

Loading comments...