Are Social Biases in LLMs Consistent across Generative Tasks? A Case Study for Basque

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

Abstract

Most bias benchmarks for Large Language Models (LLMs) rely on multiple-choice formats, overlooking subtler biases that emerge in open-ended text generation. This gap is particularly relevant for low-resource languages like Basque, where culturally grounded evaluation resources are limited. We introduce BasqBBG (Basque Bias Benchmark for Generation), the first systematic benchmark for social bias in Basque Natural Language Generation (NLG), covering eight bias categories—including a newly added feminism dimension—adapted from the BasqBBQ dataset. We validate an LLM-as-a-Judge framework against expert human evaluations on two NLG tasks (story continuation and generative QA), achieving strong agreement (agreement of 0.78 in bias presence and 0.92 in bias directionality). We scale this approach to ten additional tasks and five models. Results show that bias levels vary markedly across tasks and depend more on model family than size: Llama-based models exhibit higher and less consistent bias (45–50%), whereas GPT-4o and the Gemma-based Kimu-9B remain substantially fairer (≤20%). Our findings highlight the need for task-aware, language-specific frameworks to assess social bias in generative LLMs. Keywords: Large Language Models, Social Bias, Basque, Natural Language Generation, Benchmarking, Manual Evaluation, LLM-as-a-judge.