RuBIN: A Russian Benchmark for Evaluating LLMs with Cultural Insights
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
Understanding culture-specific knowledge is essential for developing language models that perform reliably across diverse social and linguistic settings. This work explores both methodological and practical aspects of evaluating culture-specific knowledge in large language models. Special attention is given to the multiple-choice question answering format as a tool for identifying and measuring such knowledge. An analysis of existing benchmarks reveals several limitations, including insufficient cultural sensitivity and the presence of uninformative distractor options. In response, the RuBIN benchmark is introduced – a dataset consisting of questions based on phrases that are widely known in Russian culture. The paper describes the process of selecting and filtering culturally relevant topics, generating plausible incorrect answers using LLMs, and annotating and testing the benchmark for cross-linguistic robustness. RuBIN helps identify current LLMs’ weaknesses in transferring cultural knowledge and can serve as a tool for further adapting these models to diverse linguistic and cultural contexts.