Stanford CS336 Language Modeling From Scratch | Spring 2026 | Lecture 16: Post-Training - RLVR
Why It Matters
GRPO simplifies reinforcement learning for verifiable tasks, lowering compute costs and accelerating the deployment of reliable, math‑capable language models.
Key Takeaways
- •RLVR targets verifiable tasks like math using reinforcement learning.
- •Overoptimization limits RLHF due to reward model overfitting.
- •PPO implementation is fragile; advantage estimation and KL clipping cause instability.
- •GRPO removes value network, using z‑score advantage to simplify training.
- •Open‑source releases (DeepSeek Math) showcase GRPO as practical alternative to PPO.
Summary
The lecture introduces Reinforcement Learning from Verifiable Rewards (RLVR) as the next frontier beyond instruction tuning and RLHF, focusing on tasks such as mathematics and code where outcomes can be objectively verified. It highlights recent OpenAI announcements that a thinking model solved a longstanding open‑math problem, underscoring the relevance of RLVR for high‑stakes reasoning. Key insights include the over‑optimization problem that plagues RLHF: reward models quickly overfit when fed endless preference data, creating a bottleneck. The professor revisits Proximal Policy Optimization (PPO), noting its theoretical appeal but practical fragility—advantage estimation, KL‑clipping, and value‑function baselines introduce high variance and demand delicate engineering. Concrete examples illustrate these challenges: a student’s PPO implementation broke when KL penalties were clipped at zero, and many open‑source baselines misuse generalized advantage estimators, effectively reducing PPO to a bandit problem. The DeepSeek Math paper’s GRPO algorithm is presented as a streamlined alternative that discards the costly value network and computes advantages via z‑score normalization. The implication is clear: for verifiable domains, GRPO offers a more stable, compute‑efficient path to high‑performing models, potentially reshaping research pipelines and reducing reliance on brittle PPO implementations.
Comments
Want to join the conversation?
Loading comments...