Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 7 - Evaluation
Why It Matters
Accurate, comparable evaluation lets firms benchmark generative models, prioritize improvements, and avoid costly releases that look good but miss user intent.
Key Takeaways
- •Human evaluation remains primary for text-to-image quality assessment.
- •Aesthetics and prompt adherence are the two core evaluation dimensions.
- •Binary, scalar, and pairwise comparisons each trade off noise and effort.
- •Win‑rate metrics must weight opponent strength to reflect true performance.
- •Elo‑style rating updates can quantify surprise and model improvements.
Summary
Lecture 7 of Stanford’s CME‑296 course turns to evaluating text‑to‑image generators, arguing that you can’t improve what you can’t measure.
The professor breaks evaluation into two primary axes—visual aesthetics and prompt adherence—and walks through three human‑rating schemes: a 1‑to‑5 Likert scale, a binary good/bad label, and pairwise preference tests. He notes that finer scales capture nuance but introduce inter‑rater noise, while binary judgments are simpler but lack absolute reference points, and pairwise comparisons reduce variance by letting judges choose the better of two images.
Using a teddy‑bear‑reading‑a‑book example, he illustrates how an image can be aesthetically pleasing yet fail prompt adherence. He then introduces a win‑rate metric and shows why raw win percentages are misleading, proposing an Elo‑style expected‑score formula (1 / [1 + 10^{(R_opponent‑R_self)/400}]) to weight victories against stronger opponents more heavily.
Adopting these calibrated metrics enables more reliable leaderboards, faster model iteration, and clearer ROI for companies deploying generative AI in advertising, design, or content creation, where both visual quality and faithful prompt execution are business‑critical.
Comments
Want to join the conversation?
Loading comments...