Featuring Every Eval Ever Results on Hugging Face Model Pages
We're on a journey to advance and democratize artificial intelligence through open source and open science. We enable cross-posting and interpreting evaluation results, while linking to open models, leaderboards, and a unified standardized metadata store. EEE launched in February 2026 as a project of the EvalEval Coalition , the first cross-institutional effort to improve how AI evaluation results get reported by both first and third party evaluators.
Key Takeaways
- Hugging Face launched Community Evals in February 2026 to decentralize how benchmark scores get reported on the Hub.
Combined, they patch gaps in how users, researchers, and policymakers trust, understand, and choose evaluations and models.
- These gaps can arise from evaluation settings that we found are commonly unreported .
- Reproducing just those runs from scratch would cost somewhere in the hundreds of thousands of dollars, which is a reasonable argument for not letting the data scatter once someone has paid to generate it.
- First-party evaluators reporting on their own models and third-party evaluators reporting on someone else's can both submit to Community Evals and to EEE, and anyone browsing the Hub gets results that trace back to a full record.
When you submit your data through your organization's official Hugging Face account, your results show up with a verified checkmark on EvalEval, a signal to readers that the numbers come straight from the source.
- The list of official benchmarks grows over time.
Stats & Key Facts
- #Hugging Face launched Community Evals in February 2026 to decentralize how benchmark scores get reported on the Hub.
- #The same model on the same benchmark often returns different scores depending on who ran it and how; LLaMA 65B, for one, has been reported at both 63.
- #Since launching, the datastore on Hugging Face has grown to around 229,000 evaluation results across more than 22,000 models and 2,200 benchmarks, pulled from 31 different reporting formats.
Hugging Face launched Community Evals in February 2026 to decentralize how benchmark scores get reported on the Hub. Combined, they patch gaps in how users, researchers, and policymakers trust, understand, and choose evaluations and models. Evaluation results are how we measure model capabilities, compare models against each other, and reason about safety and governance, and yet they are scattered and hard to compare.
They live in papers, leaderboards, blog posts, and harness logs, among others, each in its own format. The same model on the same benchmark often returns different scores depending on who ran it and how; LLaMA 65B, for one, has been reported at both 63. These gaps can arise from evaluation settings that we found are commonly unreported .
EEE is our fix for the reporting side. It's one JSON schema for an evaluation result that records: who ran it which model how it was accessed generation settings what the metric actually means [recommended] companion JSONL file for per-sample outputs. The schema was built with feedback from researchers and policy researchers, and it takes in results from any source, so harness logs, leaderboard scrapes, and paper numbers all end up in the same shape.
For more details please read the original article at Hugging Face.
Continue Learning
Comments
Sign in to join the conversation