A Dutch Benchmark to Assess Social Bias in LLMs within a Hiring Decision Setting

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

Abstract

In this paper, we present a Dutch benchmark to assess whether large language models (LLMs) exhibit social biases in hiring decisions, focusing on gender and country of origin. We experiment with two approaches: explicit descriptions of the applicants’ demographics and using first names as proxies. We evaluate both monolingual and multilingual LLMs and find that all tested models, gpt-4o-mini, claude-3.5-haiku, Geitje-7B-Ultra and EuroLLM-9B-Instruct, exhibit some degree of social bias in their decisions. Furthermore, all models tested are sensitive to the manner in which the prompts are written. We make our benchmark publicly available under an EUPL-1.2 license. The benchmark is available at https://github.com/MinBZK/llm-benchmark/tree/main/benchmarks/social-bias.