Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 7 - Evaluation

Stanford Online
Stanford OnlineMay 28, 2026

Why It Matters

Accurate, comparable evaluation lets firms benchmark generative models, prioritize improvements, and avoid costly releases that look good but miss user intent.

Key Takeaways

  • Human evaluation remains primary for text-to-image quality assessment.
  • Aesthetics and prompt adherence are the two core evaluation dimensions.
  • Binary, scalar, and pairwise comparisons each trade off noise and effort.
  • Win‑rate metrics must weight opponent strength to reflect true performance.
  • Elo‑style rating updates can quantify surprise and model improvements.

Summary

Lecture 7 of Stanford’s CME‑296 course turns to evaluating text‑to‑image generators, arguing that you can’t improve what you can’t measure.

The professor breaks evaluation into two primary axes—visual aesthetics and prompt adherence—and walks through three human‑rating schemes: a 1‑to‑5 Likert scale, a binary good/bad label, and pairwise preference tests. He notes that finer scales capture nuance but introduce inter‑rater noise, while binary judgments are simpler but lack absolute reference points, and pairwise comparisons reduce variance by letting judges choose the better of two images.

Using a teddy‑bear‑reading‑a‑book example, he illustrates how an image can be aesthetically pleasing yet fail prompt adherence. He then introduces a win‑rate metric and shows why raw win percentages are misleading, proposing an Elo‑style expected‑score formula (1 / [1 + 10^{(R_opponent‑R_self)/400}]) to weight victories against stronger opponents more heavily.

Adopting these calibrated metrics enables more reliable leaderboards, faster model iteration, and clearer ROI for companies deploying generative AI in advertising, design, or content creation, where both visual quality and faithful prompt execution are business‑critical.

Original Description

To follow along with the course schedule and syllabus, visit: https://cme296.stanford.edu/syllabus/
Chapters:
00:00:00 Introduction
00:05:19 Motivation
00:10:48 Human ratings
00:19:43 Elo rating system
00:26:37 Reference-free metrics
00:29:15 Fréchet inception distance (FID)
00:42:30 CLIPScore
00:44:51 PickScore
00:45:41 Reference-based metrics
00:48:07 Mean squared error (MSE)
00:49:36 Peak signal-to-noise ratio (PSNR)
00:51:54 Structural similarity (SSIM)
01:01:09 Perceptual similarity (LPIPS)
01:05:03 Multimodal LLMs
01:13:10 Faithfulness evaluation (TIFA)
01:17:29 Visual question answering score (VQA)
01:24:40 MLLM-as-a-Judge
01:34:17 Benchmarks
For more information about Stanford’s graduate programs, visit: https://online.stanford.edu/graduate-education
Afshine Amidi is an Adjunct Lecturer at Stanford University.
Shervine Amidi is an Adjunct Lecturer at Stanford University.

Comments

Want to join the conversation?

Loading comments...