Transparent, reproducible scores let developers compare real‑world performance, reducing reliance on opaque leaderboards.
The rapid saturation of classic benchmarks such as MMLU, GSM8K, and HumanEval has exposed a growing disconnect between headline scores and actual model capabilities. As models approach near‑human performance on these tests, researchers and practitioners struggle to trust a single leaderboard, especially when disparate sources report conflicting numbers. This fragmentation hampers progress, making it difficult to gauge true advancements or identify gaps that matter in production settings.
Hugging Face’s new decentralized evaluation framework tackles this problem by turning the Hub into a living audit trail for model performance. Dataset repositories can now declare themselves as official benchmarks, providing an eval.yaml spec that any user can run and verify. Model owners publish their results in a standardized .eval_results directory, while any community member can open a pull request with additional scores, linking to papers, third‑party logs, or raw outputs. The platform automatically aggregates these contributions, displaying both author and community entries on a unified leaderboard and tagging reproducible results with verified badges. All data is exposed via public APIs, enabling developers to build custom dashboards or analytics pipelines.
For the broader AI ecosystem, this openness promises more reliable model selection, faster identification of overfitting to test sets, and a clearer path toward evaluating emerging tasks that current benchmarks ignore. Companies can integrate the Hub’s API to monitor competitor performance or to audit internal models against community‑reported baselines. While the feature won’t eliminate benchmark saturation, it makes the evaluation process visible, auditable, and collaborative—key steps toward aligning research metrics with real‑world utility.
TL;DR: Benchmark datasets on Hugging Face can now host leaderboards. Models store their own eval scores. Everything links together. The community can submit results via PR. Verified badges prove that the results can be reproduced.
Let's be real about where we are with evals in 2026. MMLU is saturated above 91 %. GSM8K hit 94 %+. HumanEval is conquered. Yet some models that ace benchmarks still can't reliably browse the web, write production code, or handle multi‑step tasks without hallucinating, based on usage reports. There is a clear gap between benchmark scores and real‑world performance.
Furthermore, there is another gap within reported benchmark scores. Multiple sources report different results. From Model Cards, to papers, to evaluation platforms, there is no alignment in reported scores. The result is that the community lacks a single source of truth.
Decentralized and transparent evaluation reporting.
We are going to take evaluations on the Hugging Face Hub in a new direction by decentralizing reporting and allowing the entire community to openly report scores for benchmarks. At first, we will start with a shortlist of four benchmarks and, over time, we’ll expand to the most relevant benchmarks.
Dataset repos can now register as benchmarks (e.g., MMLU‑Pro, GPQA, HLE are already live). They automatically aggregate reported results from across the Hub and display leaderboards in the dataset card. The benchmark defines the eval spec via eval.yaml, based on the Inspect AI format, so anyone can reproduce it. The reported results need to align with the task definition.

Eval scores live in .eval_results/*.yaml in the model repo. They appear on the model card and are fed into benchmark datasets. Both the model author’s results and open pull requests for results will be aggregated. Model authors will be able to close score PRs and hide results.
Any user can submit evaluation results for any model via a PR. Results are shown as “community”, without waiting for model authors to merge or close. The community can link to sources like a paper, Model Card, third‑party evaluation platform, or inspect eval logs. The community can discuss scores like any PR. Since the Hub is Git‑based, there is a history of when evals were added, when changes were made, etc. The sources look like below.

To learn more about evaluation results, check out the documentation.
Decentralizing evaluation will expose scores that already exist across the community in sources like model cards and papers. By exposing these scores, the community can build on top of them to aggregate, track, and understand scores across the field. All scores will also be exposed via Hub APIs, making it easy to aggregate and build curated leaderboards, dashboards, etc.
Community evals do not replace benchmarks, so leaderboards and closed evals with published results remain crucial. However, we believe it’s important to contribute to the field with open eval results based on reproducible eval specs.
This won’t solve benchmark saturation or close the benchmark‑reality gap. Nor will it stop training on test sets. But it makes the game visible by exposing what is evaluated, how, when, and by whom.
Mostly, we hope to make the Hub an active place to build and share reproducible benchmarks—particularly focusing on new tasks and domains that challenge SOTA models more.
Add eval results: Publish the evals you conducted as YAML files in .eval_results/ on any model repo.
Check out the scores on the benchmark dataset page.
Register a new benchmark: Add eval.yaml to your dataset repo and contact us to be included in the shortlist.
The feature is in beta. We’re building in the open. Feedback welcome.
Comments
Want to join the conversation?
Loading comments...