By @IBMResearch
Publication Date: 2025-12-16 16:00:00
Much of today’s progress in AI can be attributed to benchmarks — rigorous, standardized tests designed to expose what large language models can and can’t do, and to spur further innovation.
Benchmarks provide a structured way to compare models side-by-side and gauge their biases, risks, and fitness for a given task. The popularity of an LLM can rise or fall on its benchmark score. Why, then, are these make-or-break exams that everyone shares so poorly documented?
Elizabeth Daly, a senior technical staff member at IBM Research, kept asking herself that as she surveyed the inconsistent, often incomplete, state of benchmark documentation.
“I started to see this massive disconnect between what information is available and how it’s represented in the test harnesses,” she said. “Is the benchmark telling you that the model’s really good at algebra, or only when the test questions are presented as multiple-choice?”
At the time, Daly and other researchers across IBM were working…