Photo credit: www.technologyreview.com
Evaluating AI Benchmarks: A Call for Quality and Relevance
The quest to establish effective benchmarks in artificial intelligence has taken a significant step forward, as researchers aim to outline criteria that define a good benchmark. “Discussing the quality and purpose of benchmarks is crucial,” notes researcher Ivanova. “While we recognize this as an important issue, the challenge lies in the absence of a unified standard. This paper seeks to introduce a set of evaluation criteria that can aid in this regard.”
The introduction of the paper also coincided with the launch of Better Bench, a website dedicated to ranking the most prominent AI benchmarks. The assessment considers multiple factors, including the involvement of experts in the benchmark’s design, the clarity of the tested capabilities, and fundamental elements such as the presence of feedback channels and whether the benchmark has undergone peer review.
Among the benchmarks evaluated, the MMLU received the lowest ratings. Dan Hendrycks, director of the Center for AI Safety and a co-creator of the MMLU benchmark, expressed his disagreement with the rankings. “I am an author of some highly-rated papers, yet I believe the lower-ranked benchmarks actually outperform them,” he stated. Despite this, Hendrycks agrees that advancing the field necessitates the development of improved benchmarks.
However, some experts raise concerns that the proposed criteria may overlook broader implications. “This paper introduces valuable aspects such as implementation and documentation criteria,” says Marius Hobbhahn, CEO of Apollo Research, an organization focused on AI evaluations. “But the paramount question remains: are we measuring the right concepts? It’s possible to fulfill all criteria yet still produce a benchmark that fails to address the essential issues.”
For instance, a benchmark designed to evaluate moral reasoning may not serve its intended purpose if its meaning is vague and lacks input from domain experts, highlights AI researcher Amelia Hardy, another author of the study. “A lot hinges on the clarity of what is being measured and who is included in the benchmark’s development process,” she emphasizes.
Efforts to enhance AI benchmarking are underway, with organizations like Epoch AI taking proactive measures. Their recently developed benchmark involved the collaboration of 60 mathematicians and was validated by two Fields Medal winners, the highest accolade in mathematics. The expertise contributed by these professionals aligns with criteria established by the Better Bench initiative. Nonetheless, current advanced AI models can answer fewer than 2% of the benchmark’s questions, indicating substantial room for improvement.
“Our goal was to encapsulate the full spectrum of contemporary mathematical research,” explains Tamay Besiroglu, associate director at Epoch AI. Despite the benchmark’s rigor, Besiroglu predicts that it may take only four to five years for AI models to achieve satisfactory performance on it.
Source
www.technologyreview.com