Benchmarking Mathematical Reasoning in a Low-Resource Language: Structured Prompting and Evaluation in Basque
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
Large Language Models (LLMs) have shown impressive performance on tasks requiring complex reasoning, but most evaluations tend to focus on English and other high-resource languages. This work investigates how well LLMs perform mathematical reasoning in low-resource languages, using Basque as a primary case study. To support this analysis, we introduce MASEU, a benchmark designed to evaluate reasoning in Basque across arithmetic, algebraic, and logical tasks. We then use this dataset to address three key questions: 1) how well do LLMs support Basque in reasoning tasks, 2) to what extent can including English in prompts improve results, and 3) what is the effect of continued pretraining in Basque? To explore these aspects, we use prompting strategies adapted for mathematical reasoning, building upon the foundations of CoT prompting and one of its subsequent evolutions, DUP prompting, which together allow for more precise experimentation across zero-shot and few-shot settings, providing insights into how multilingual models handle reasoning tasks in underrepresented languages.