AI Videos
  • All Technology
  • AI
  • Autonomy
  • B2B Growth
  • Big Data
  • BioTech
  • ClimateTech
  • Consumer Tech
  • Crypto
  • Cybersecurity
  • DevOps
  • Digital Marketing
  • Ecommerce
  • EdTech
  • Enterprise
  • FinTech
  • GovTech
  • Hardware
  • HealthTech
  • HRTech
  • LegalTech
  • Nanotech
  • PropTech
  • Quantum
  • Robotics
  • SaaS
  • SpaceTech
AllNewsDealsSocialBlogsVideosPodcastsDigests

AI Pulse

EMAIL DIGESTS

Daily

Every morning

Weekly

Sunday recap

NewsDealsSocialBlogsVideosPodcasts
AIVideosAre AI Benchmarks Telling The Full Story? [SPONSORED]
AI

Are AI Benchmarks Telling The Full Story? [SPONSORED]

•December 20, 2025
0
Machine Learning Street Talk
Machine Learning Street Talk•Dec 20, 2025

Why It Matters

Human‑centric, demographically balanced evaluation reveals safety and trust gaps that technical benchmarks hide, guiding developers toward AI that works responsibly for diverse real‑world users.

Summary

The video critiques the current reliance on technical AI benchmarks, arguing that they miss the human‑centric aspects of large language model (LLM) performance. Andrew Gordon and Nora Petrova of Prolific explain that while models may ace exams like MMLU or Humanities‑Last‑Exam, those scores do not guarantee a safe, trustworthy, or engaging user experience. They advocate for a shift toward human‑preference leaderboards that capture dimensions such as helpfulness, communication style, adaptability, trust, and perceived personality, using stratified, demographically representative samples.

Prolific’s initial User Experience Leaderboard, tested with 500 U.S. participants, evolved into the “humane” leaderboard, which employs comparative battles between models and the TrueSkill algorithm to prioritize information‑gain and reduce uncertainty. Unlike the open‑source Chatbot Arena, which lacks demographic data and can be biased by uneven sampling, humane samples participants based on census‑derived age, ethnicity, and political alignment, allowing separate tournaments for each demographic group and a consolidated, statistically robust ranking. The methodology penalizes low‑effort queries and rewards multi‑step conversations, ensuring richer feedback than a simple “which response do you prefer?” vote.

Key findings from the pilot indicate that while leading models performed similarly on objective metrics like helpfulness, they lagged on personality, cultural relevance, and background awareness. This suggests that current fine‑tuning practices may not adequately capture the nuanced, subjective qualities users value, potentially because training on the entire internet does not translate into a coherent, user‑aligned persona. The presenters also highlight systemic issues in existing leaderboards, such as private testing advantages for some companies and the lack of safety metrics, underscoring the need for transparent, human‑centric evaluation frameworks.

If adopted broadly, humane‑style leaderboards could reshape AI development priorities, pushing firms to optimize for safety, trust, and cultural alignment alongside raw performance. By providing actionable, demographically balanced insights, these leaderboards promise more reliable assessments of how models will behave in real‑world, high‑stakes applications such as mental‑health support or policy advice, thereby fostering responsible AI deployment.

Original Description

Is a car that wins a Formula 1 race the best choice for your morning commute? Probably not. In this sponsored deep dive with Prolific, we explore why the same logic applies to Artificial Intelligence. While models are currently shattering records on technical exams, they often fail the most important test of all: the human experience.
Why High Benchmark Scores Don’t Mean Better AI
Joining us are Andrew Gordon (Staff Researcher in Behavioral Science) and Nora Petrova (AI Researcher) from Prolific . They reveal the hidden flaws in how we currently rank AI and introduce a more rigorous, "humane" way to measure whether these models are actually helpful, safe, and relatable for real people.

Key Insights in This Episode:
•The F1 Car Analogy: Andrew explains why a model that excels at the "Humanities Last Exam" might be a nightmare for daily use. Technical benchmarks often ignore the nuances of human communication and adaptability.
•The "Wild West" of AI Safety: As users turn to AI for sensitive topics like mental health, Nora highlights the alarming lack of oversight and the "thin veneer" of safety training—citing recent controversial incidents like Grok-3’s "Mecha Hitler."
•Fixing the "Leaderboard Illusion": The team critiques current popular rankings like Chatbot Arena, discussing how anonymous, unstratified voting can lead to biased results and how companies can "game" the system.
•The Xbox Secret to AI Ranking: Discover how Prolific uses TrueSkill —the same algorithm Microsoft developed for Xbox Live matchmaking—to create a fairer, more statistically sound leaderboard for LLMs.
•The Personality Gap: Early data from the Humane Leaderboard suggests that while AI is getting smarter, it is actually performing worse on metrics like personality, culture, and "sycophancy" (the tendency for models to become annoying "people-pleasers").

About the HUMAINE Leaderboard
Moving beyond simple "A vs. B" testing, the researchers discuss their new framework that samples participants based on census data (Age, Ethnicity, Political Alignment). By using a representative sample of the general public rather than just tech enthusiasts, they are building a standard that reflects the values of the real world.
Are we building models for benchmarks, or are we building them for humans? It’s time to change the scoreboard.
Rescript link:
https://app.rescript.info/public/share/IDqwjY9Q43S22qSgL5EkWGFymJwZ3SVxvrfpgHZLXQc

TIMESTAMPS:
00:00:00 Introduction & The Benchmarking Problem
00:01:58 The Fractured State of AI Evaluation
00:03:54 AI Safety & Interpretability
00:05:45 Bias in Chatbot Arena
00:06:45 Prolific's Three Pillars Approach
00:09:01 TrueSkill Ranking & Efficient Sampling
00:12:04 Census-Based Representative Sampling
00:13:00 Key Findings: Culture, Personality & Sycophancy

REFERENCES:
Paper:
[00:00:15] MMLU
https://arxiv.org/abs/2009.03300
[00:05:10] Constitutional AI
https://arxiv.org/abs/2212.08073
[00:06:45] The Leaderboard Illusion
https://arxiv.org/abs/2504.20879
[00:09:41] HUMAINE Framework Paper
https://huggingface.co/blog/ProlificAI/humaine-framework
Company:
[00:00:30] Prolific
https://www.prolific.com
[00:01:45] Chatbot Arena
https://lmarena.ai/
Person:
[00:00:35] Andrew Gordon
https://www.linkedin.com/in/andrew-gordon-03879919a/
[00:00:45] Nora Petrova
https://www.linkedin.com/in/nora-petrova/
Event:
Algorithm:
[00:09:01] Microsoft TrueSkill
https://www.microsoft.com/en-us/research/project/trueskill-ranking-system/
Leaderboard:
[00:09:21] Prolific HUMAINE Leaderboard
https://www.prolific.com/humaine
[00:09:31] HUMAINE HuggingFace Space
https://huggingface.co/spaces/ProlificAI/humaine-leaderboard
[00:10:21] Prolific AI Leaderboard Portal
https://www.prolific.com/leaderboard
Dataset:
[00:09:51] Prolific Social Reasoning RLHF Dataset
https://huggingface.co/datasets/ProlificAI/social-reasoning-rlhf
Organization:
[00:10:31] MLCommons
https://mlcommons.org/
0

Comments

Want to join the conversation?

Loading comments...