Stanford CS221 | Autumn 2025 | Lecture 9: Policy Gradient

Stanford Online
Stanford OnlineMar 9, 2026

Why It Matters

Understanding policy‑gradient and off‑policy techniques equips AI engineers to design scalable RL solutions that learn directly from raw observations, accelerating deployment in robotics, language models, and other high‑dimensional domains.

Key Takeaways

  • Policy‑gradient methods learn policies directly, bypassing value tables.
  • On‑policy vs off‑policy distinction drives exploration‑exploitation trade‑offs in reinforcement.
  • Bootstrapping accelerates learning by using estimated future rewards.
  • Function approximation replaces tabular Q‑tables for high‑dimensional states.
  • Off‑policy Q‑learning can learn optimal policy from arbitrary data.

Summary

The lecture revisits reinforcement learning fundamentals before shifting focus to policy‑based approaches that learn the policy itself rather than a value function. After reviewing Markov decision processes, Q‑learning, SARSA, and the role of exploration policies, the instructor frames the discussion around four key dimensions: model‑based vs. model‑free, on‑policy vs. off‑policy, bootstrapping vs. full‑rollout, and tabular vs. function‑approximation methods.

Key insights include the distinction between on‑policy algorithms like SARSA, which estimate the value of the policy actually being executed, and off‑policy methods such as Q‑learning, which target the optimal policy regardless of the exploration strategy. Bootstrapping is highlighted as a technique that substitutes immediate reward plus the estimated value of the next state for a full rollout, dramatically speeding convergence. The lecture also stresses that tabular Q‑tables become infeasible in high‑dimensional spaces, prompting the move to function approximation—parameterizing Q(s,a) with neural networks or linear features.

Notable quotes underscore the conceptual shift: “The target represents what we expect the utility to be after incorporating feedback,” and the instructor emphasizes that “off‑policy algorithms let you explore aggressively while still estimating the optimal policy.” Examples range from a simple tram‑walking MDP to real‑world scenarios like robotic vision and language model token generation, illustrating the scalability challenges.

Implications for practitioners are clear: mastering the trade‑offs between on‑ and off‑policy learning, choosing appropriate bootstrapping strategies, and adopting function approximation are essential steps toward building state‑of‑the‑art reinforcement‑learning systems capable of handling complex, high‑dimensional environments.

Original Description

For more information about Stanford’s Artificial Intelligence professional and graduate programs, visit: https://stanford.io/ai
Please follow along with the course schedule: https://stanford-cs221.github.io/autumn2025/
Teaching Team
Percy Liang, Associate Professor of Computer Science (and courtesy in Statistics)

Comments

Want to join the conversation?

Loading comments...