Stanford CS221 | Autumn 2025 | Lecture 7: Markov Decision Processes

Stanford Online
Stanford OnlineMar 9, 2026

Why It Matters

Understanding MDPs equips businesses and AI practitioners to model uncertainty, design optimal policies, and leverage reinforcement learning for smarter, data‑driven decisions.

Key Takeaways

  • MDPs extend search by modeling stochastic action outcomes.
  • Rewards replace costs; negative reward equals cost in MDPs.
  • Policies map states to actions, unlike deterministic action sequences.
  • Transition probabilities must sum to one for each state-action pair.
  • Rollouts simulate policies to evaluate expected performance in MDPs.

Summary

The lecture introduces Markov Decision Processes (MDPs) as the stochastic extension of deterministic search problems, positioning them as the foundation for reinforcement learning. After reviewing search’s start state, successors, costs, and end criteria, the professor highlights that real‑world decisions often involve uncertainty, which MDPs capture through probabilistic outcomes. Key insights include the formal breakdown of the MDP acronym—Markov (state captures all relevant history), Decision (agent chooses actions), and Process (sequential evolution). The flaky‑tram example illustrates how a 0.4 failure probability creates multiple successors for a single action, converting deterministic costs into expected rewards (e.g., walk yields –1, tram yields –2 with 0.6 success, –2 with 0.4 failure). Transition functions must assign probabilities that sum to one for each state‑action pair, and rewards are expressed as negative costs for consistency. Notable examples feature the tram scenario and a dice‑game MDP where quitting yields $10 and staying yields $4 with a chance to continue. The professor defines a policy as a state‑to‑action mapping, contrasting it with a fixed action sequence, and demonstrates rollouts—simulated executions of a policy—to assess expected returns. The discussion emphasizes that optimal solutions are policies, not single paths. Implications are clear: MDPs provide a rigorous framework for modeling uncertainty in logistics, finance, and AI, enabling the design of policies that maximize expected reward. Mastery of transition probabilities, reward structures, and rollout evaluation prepares students to apply reinforcement‑learning techniques to real‑world decision problems.

Original Description

For more information about Stanford’s Artificial Intelligence professional and graduate programs, visit: https://stanford.io/ai
Please follow along with the course schedule: https://stanford-cs221.github.io/autumn2025/
Teaching Team
Percy Liang, Associate Professor of Computer Science (and courtesy in Statistics)

Comments

Want to join the conversation?

Loading comments...