Back to Main Conference 2026
LREC 2026main

Who Benchmarks the Benchmarks? A Case Study of LLM Evaluation in Icelandic

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

DOI:10.63317/5nxcp3zw7vdz

Abstract

This paper evaluates current Large Language Model (LLM) benchmarking for Icelandic, identifies problems, and calls for improved evaluation methods in low/medium-resource languages in particular. We show that benchmarks that include synthetic or machine-translated data that have not been verified in any way, commonly contain severely flawed test examples that are likely to skew the results and undermine the tests’ validity. We warn against the use of such methods without verification in low/medium-resource settings as the translation quality can, at best, only be as good as MT quality for a given language at any given time. Indeed, the results of our quantitative error analysis on existing benchmarks for Icelandic show clear differences between human-authored/-translated benchmarks vs. synthetic or machine-translated benchmarks.

Details

Paper ID
lrec2026-main-369
Pages
pp. 4702-4715
BibKey
ingimundarson-etal-2026-who
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-493814-49-4
Conference
The Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Location
Palma, Mallorca, Spain
Date
11 May 2026 16 May 2026

Authors

  • FI

    Finnur Ágúst Ingimundarson

  • SF

    Steinunn Rut Friðriksdóttir

  • Bjarki Ármannsson

  • IN

    Iris Nowenstein

  • SS

    Steinþór Steingrímsson

Links