Epoch AI, a nonprofit research organisation that investigates AI trends and solutions to ensure that its development is aligned with ethical principles, has launched a new AI benchmark to test large language models (LLMs) on their reasoning and mathematical problem-solving skills.
Called FrontierMath, this tool features hundreds of expert-level, unpublished mathematics problems that could serve as an ongoing benchmark for tracking the progress of AI in complex mathematical reasoning.
The research group said these range from computationally intensive problems in number theory and real analysis to abstract questions in algebraic geometry and category theory.
“We developed it through collaboration with over 60 mathematicians from leading institutions, including professors, IMO question writers, and Fields medalists,” it added.
It added that even the most advanced large language models (LLMs) have scored under two percent on their new benchmark.
Epoch AI claims that current benchmarks like GSM8K and MATH are inadequate due to data contamination and the tendency of AI models to achieve unnaturally high scores.
FrontierMath is said to address these issues by introducing a set of unique, unpublished problems, reducing the risks of data contamination. The problems are designed to be “guess-proof,” meaning they can only be solved through strong, logical reasoning, making accidental answers very unlikely.
As the research paper explains, the problems have large numerical answers or complex mathematical objects as solutions, with less than a 1 percent chance of guessing correctly without the proper reasoning.
Epoch AI asserts that to truly gauge AI’s capabilities, benchmarks should focus on creative problem-solving that requires sustained reasoning over multiple steps. Many experts in the field agree that current benchmarks fall short in accurately assessing the depth of an AI model’s capabilities.
The group aims to further collaborate with mathematics and the AI research community to refine and expand this benchmark, to ensure it remains relevant and challenging for future AI systems. It plans to conduct regular evaluations to provide a standardised measure of progress, and evaluate how reasoning abilities improve over time and with scale.