AI Videos

All News Deals Social Blogs Videos Podcasts Digests

Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 7 - Evaluation

•May 28, 2026

Stanford Online

Stanford Online•May 28, 2026

Why It Matters

Accurate, comparable evaluation lets firms benchmark generative models, prioritize improvements, and avoid costly releases that look good but miss user intent.

Key Takeaways

•Human evaluation remains primary for text-to-image quality assessment.
•Aesthetics and prompt adherence are the two core evaluation dimensions.
•Binary, scalar, and pairwise comparisons each trade off noise and effort.
•Win‑rate metrics must weight opponent strength to reflect true performance.
•Elo‑style rating updates can quantify surprise and model improvements.

Summary

Lecture 7 of Stanford’s CME‑296 course turns to evaluating text‑to‑image generators, arguing that you can’t improve what you can’t measure.

The professor breaks evaluation into two primary axes—visual aesthetics and prompt adherence—and walks through three human‑rating schemes: a 1‑to‑5 Likert scale, a binary good/bad label, and pairwise preference tests. He notes that finer scales capture nuance but introduce inter‑rater noise, while binary judgments are simpler but lack absolute reference points, and pairwise comparisons reduce variance by letting judges choose the better of two images.

Using a teddy‑bear‑reading‑a‑book example, he illustrates how an image can be aesthetically pleasing yet fail prompt adherence. He then introduces a win‑rate metric and shows why raw win percentages are misleading, proposing an Elo‑style expected‑score formula (1 / [1 + 10^{(R_opponent‑R_self)/400}]) to weight victories against stronger opponents more heavily.

Adopting these calibrated metrics enables more reliable leaderboards, faster model iteration, and clearer ROI for companies deploying generative AI in advertising, design, or content creation, where both visual quality and faithful prompt execution are business‑critical.

Original Description

Learn more details about this course: https://online.stanford.edu/courses/cme296-diffusion-and-large-vision-models

To follow along with the course schedule and syllabus, visit: https://cme296.stanford.edu/syllabus/

Chapters:

00:00:00 Introduction

00:05:19 Motivation

00:10:48 Human ratings

00:19:43 Elo rating system

00:26:37 Reference-free metrics

00:29:15 Fréchet inception distance (FID)

00:42:30 CLIPScore

00:44:51 PickScore

00:45:41 Reference-based metrics

00:48:07 Mean squared error (MSE)

00:49:36 Peak signal-to-noise ratio (PSNR)

00:51:54 Structural similarity (SSIM)

01:01:09 Perceptual similarity (LPIPS)

01:05:03 Multimodal LLMs

01:13:10 Faithfulness evaluation (TIFA)

01:17:29 Visual question answering score (VQA)

01:24:40 MLLM-as-a-Judge

01:34:17 Benchmarks

For more information about Stanford’s graduate programs, visit: https://online.stanford.edu/graduate-education

Afshine Amidi is an Adjunct Lecturer at Stanford University.

Shervine Amidi is an Adjunct Lecturer at Stanford University.

View the course playlist: https://www.youtube.com/playlist?list=PLoROMvodv4rNdy8rt2rZ4T2xM0OjADnfu

Comments

Want to join the conversation?

Loading comments...