Stanford CS221 | Autumn 2025 | Lecture 8: Reinforcement Learning

Stanford Online
Stanford OnlineMar 9, 2026

Why It Matters

Understanding RL fundamentals equips practitioners to build systems that learn optimal strategies without explicit models, a competitive edge for AI‑powered decision‑making in dynamic markets.

Key Takeaways

  • MDPs consist of states, actions, probabilities, rewards, and discount factor
  • Policy evaluation computes expected utility of a fixed policy via recursion
  • Value iteration adds a max operator to find optimal policy values
  • Reinforcement learning tackles MDPs when transition model is unknown
  • An RL agent updates its policy using feedback from environment interactions

Summary

The lecture revisits Markov Decision Processes (MDPs) before launching into reinforcement learning (RL). It outlines the core components of an MDP—states, actions, transition probabilities, rewards, and discount factor—using the illustrative "flaky tram" example, and clarifies how a policy maps states to actions while a rollout evaluates its performance. Key insights include the policy‑evaluation recurrence that computes expected utility for a given policy, and value iteration, which inserts a max operator to derive optimal state values and extract the optimal policy. The instructor contrasts static policies with dynamic RL agents, showing how agents receive feedback (state, action, reward, next state) and can adjust their internal policy over time. Code snippets demonstrate simulation of rollouts and the distinction between a fixed policy and an agent that can learn. Notable examples feature the flaky tram MDP, where walking costs –1 and taking the tram costs –2 with a 40% failure chance, and a simple static agent that merely follows a preset policy. The professor emphasizes the reward‑vs‑cost perspective, explains sparse versus dense reward structures, and visualizes the agent‑environment loop as a sequence of actions and observations. The discussion sets the stage for model‑based and model‑free RL algorithms—Monte Carlo, SARSA, and Q‑learning—highlighting that RL enables optimal decision making even when the underlying MDP is unknown. This foundation is crucial for developing autonomous systems that learn from interaction, a capability increasingly demanded in business automation and AI‑driven products.

Original Description

For more information about Stanford’s Artificial Intelligence professional and graduate programs, visit: https://stanford.io/ai
Please follow along with the course schedule: https://stanford-cs221.github.io/autumn2025/
Teaching Team
Percy Liang, Associate Professor of Computer Science (and courtesy in Statistics)

Comments

Want to join the conversation?

Loading comments...