olmo-eval: An evaluation workbench for the model development loop

Overview

Ai2 released olmo-eval, an open evaluation workbench built for the day-to-day loop of developing a language model rather than just scoring finished ones. It builds on Ai2's earlier OLMES standard from 2024 and aims to cut the work of adding new evaluations, run benchmarks flexibly across model checkpoints, and analyze results prompt by prompt. The tool treats agentic and multi-turn evaluation as a first-class use case and adds analysis tools to judge whether an intervention actually beat the baseline or the difference is just noise.

Key Takeaways

Ai2 released olmo-eval, an open evaluation workbench for the model development loop.
It builds on OLMES, the Open Language Model Evaluation Standard introduced in 2024.
olmo-eval is designed for repeatedly evaluating a model as its data, architecture, hyperparameters, and scale change.
Agentic and multi-turn evaluation is supported as a first-class use case.
Benchmarks can run either directly for speed or in an isolated container when a locked-down environment is needed, with the lightweight path as the default.
Stronger analysis tools help judge whether an intervention improved on the baseline or the difference is noise, such as a 2.4pp change.

Stats & Key Facts

#OLMES was introduced in 2024.
#The post uses a 2.4pp change in performance as an example of a borderline result.
#Code is available at github.com/allenai/olmo-eval.

Why Ai2 built olmo-eval

Building an LLM means evaluating it over and over across many interventions.

›Every change to data, architecture, or hyperparameters sends developers back through the same evaluation loop.
›That loop includes adding or reconfiguring benchmarks, re-running them on each new checkpoint, and checking results.
›Developers also need to confirm that a gain in a small experiment still holds on a full training run.

Ai2 argues most evaluation tools are not designed for this work. They are built either to run established benchmarks across finished models or to run a model through multi-step, tool-using problems in a sandbox. According to the post, they do not keep up with a model that is constantly changing, nor reflect how a model behaves under specific real-world conditions.

Building on OLMES

olmo-eval extends Ai2's earlier evaluation standard.

›OLMES, the Open Language Model Evaluation Standard, was introduced in 2024.
›It was meant to make LLM benchmark scores easier to compare across releases.
›It became the basis for evaluating Ai2's open models from Olmo to Tulu.

The post notes that the same models were often scored on the same benchmarks in different ways, with aspects like prompt formatting and task formulation varying from paper to paper, so claims about which models performed best often were not reproducible. OLMES pinned those benchmarking choices down in an open, documented standard, and olmo-eval extends that work across the rest of LLM development.

What olmo-eval improves

olmo-eval aims to reduce friction and add flexibility.

›It cuts down the work of implementing new evaluations.
›It offers more flexibility in defining where and how evaluations run.
›It makes it easier to compose individual components into larger workflows.

A model's final score is only part of the evaluation process, so olmo-eval focuses on the steps around that score, including running benchmarks across checkpoints and analyzing results prompt by prompt instead of as a single overall number.

How it compares to Harbor

olmo-eval overlaps with Harbor but serves a different scope.

›Harbor is an open framework for evaluating AI agents inside containerized, sandboxed environments.
›Harbor is aimed mainly at running and publishing agent benchmarks.
›olmo-eval is built for the everyday work of developing a model.

Harbor runs everything the same way, inside sealed, reproducible containers. Because containers can be resource-intensive, olmo-eval lets developers choose how each benchmark runs instead. A benchmark that just needs a model to answer questions can run directly, which is faster and cheaper, while a benchmark that needs a locked-down environment, such as one that runs code the model wrote, gets an isolated container setup. The lightweight path is the default, and olmo-eval only opts for the heavy setup when a benchmark requires it. Both tools keep benchmarks separate from the runtime policy.

Adding benchmarks and analyzing results

How you add a benchmark depends on what the benchmark needs.

›A basic evaluation uses a short definition, with options to let a model use tools as it works through a benchmark.
›A benchmark that already has its own code and procedure uses a thin wrapper so olmo-eval can run it as is.
›Results are reported alongside other benchmark scores in the same format.

Stronger analysis tools help developers judge whether an intervention actually improved on the baseline or the difference amounts to noise. The post raises the example of whether a 2.4pp change in performance is enough to make a call.

Frequently Asked Questions

What is olmo-eval?

It is an open evaluation workbench released by Ai2 that is built for the everyday loop of developing a language model, including adding benchmarks, running them across checkpoints, and analyzing results.

How does olmo-eval relate to OLMES?

It builds on OLMES, the Open Language Model Evaluation Standard introduced in 2024, and extends it across the rest of LLM development.

How is olmo-eval different from Harbor?

Harbor runs everything inside sealed, reproducible containers and is aimed at publishing agent benchmarks, while olmo-eval is built for fast model development and lets you choose whether each benchmark runs directly or in an isolated container.

Does olmo-eval support agent evaluation?

Yes. Agentic and multi-turn evaluation is supported as a first-class use case.

Where can I find the code?

The code is available at github.com/allenai/olmo-eval.

olmo-eval gives model developers an open, flexible workbench to evaluate constantly changing models and tell real gains from noise.