Charting the European LLM Benchmarking Landscape: A New Taxonomy and Registry
Proceedings of Shaping Multilingual, Multimodal AI for the Social Sciences and Humanities (LLMs4SSH) @ LREC 2026
Abstract
While new benchmarks for large language models (LLMs) are being developed continuously to catch up with the growing capabilities of new models and AI in general, using and evaluating LLMs in non-English languages remains a poorly-charted landscape. We give a concise overview of recent developments in LLM benchmarking, and then propose a new taxonomy for the categorization of benchmarks that is tailored to multilingual or non-English use scenarios. We further propose a registry of benchmarks implementing the new categorization and documenting benchmarks with a rich set of metadescriptors. While still at a pilot stage, such a registry can lead to a more coordinated development of benchmarks for European languages. We conclude with a review of current trends and advocate for a higher language and culture sensitivity of evaluation methods.