Icelandic Math Eval: A Competitive Mathematics Benchmark for Large Language Models
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
We introduce Icelandic Math Eval, the first comprehensive benchmark for evaluating large language models (LLMs) on competitive mathematics problems in Icelandic. Our dataset comprises 1,027 problems from Icelandic mathematics competitions spanning from 1984 to 2025, covering algebra, geometry, number theory, and combinatorics across ten difficulty levels. We evaluate three state-of-the-art models, Claude Sonnet 4.5, Gemini 2.5 Pro, and GPT-5, using a dual evaluation methodology that tests both with and without multiple-choice options. Our results reveal several key findings: (1) models achieve 81-93% overall accuracy, demonstrating substantial cross-lingual transfer of mathematical reasoning capabilities; (2) a dramatic 17.5 percentage point performance drop on problems containing images highlights persistent challenges in multimodal mathematical reasoning; (3) a 6.7 percentage point gap between evaluation modes suggests that multiple-choice formats may overestimate genuine reasoning capabilities; and (4) systematic performance degradation with increasing difficulty, dropping to 43% on the most challenging problems. Using an LLM-as-judge evaluation approach, we provide detailed analysis across problem types, difficulty levels, and model capabilities. This work contributes to multilingual AI evaluation and demonstrates the importance of developing rigorous benchmarks for diverse languages to ensure comprehensive assessment of AI capabilities.