Back to Main Conference 2026
LREC 2026main

Judging Instruction Responses in a Low-Resource Language: A Case Study on Basque

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

DOI:10.63317/32e7ij6myh5i

Abstract

Evaluating the quality of answers to a given instruction is a demanding and time-consuming task, limiting the scalability of human assessment. Large language models (LLMs) have been proposed as automatic judges to reduce this effort, but their reliability in low-resource contexts remains uncertain. Additionally, the premise that humans are reliable judges of fine-grained response quality needs to be assessed as well, if correlation with automated judges on this task is to be considered a gold standard. In this work, we investigate the performance of various LLM-as-a-judge in a low-resource scenario, namely Basque, and evaluate its correlation with human judgements. Additionally, we measure the agreement between human judgments themselves, to assess their viability as a valid reference. To perform our experiments, we translated and manually post-edited the Just-Eval benchmark, a suite of benchmarks tackling fine-grained aspects of response quality. We also extend the evaluation with a novel category aimed at judging both language consistency and grammaticality. Our results show that state of the art models exhibit fairly poor correlations with humans and amongst themselves, calling for the development of dedicated LLM-as-a-judge models for this language.

Details

Paper ID
lrec2026-main-019
Pages
pp. 281-298
BibKey
ponce-etal-2026-judging
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-493814-49-4
Conference
The Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Location
Palma, Mallorca, Spain
Date
11 May 2026 16 May 2026

Authors

  • DP

    David Ponce

  • HG

    Harritxu Gete

  • TE

    Thierry Etchegoyhen

  • IZ

    Irune Zubiaga

  • AS

    Aitor Soroa

Links