How Well Do Large Language Models Reason in Under-Resourced Languages? Evidence from Vietnamese
Proceedings of the SIGUL 2026 Joint Workshop with ELE, EURALI, and DCLRL "Towards Inclusivity and Equality: Language Resources and Technologies for Under-Resourced and Endangered Languages
Abstract
Despite advancements in Large Language Models, reasoning benchmarks remain centered on high-resource languages, leaving languages like Vietnamese under-evaluated. In this study, we aim to address this gap by evaluating four models: PhoGPT (native), Vistral and VBD-Llama (adapted), and Llama-2 (English-centric), on commonsense reasoning and arithmetic reasoning. As Vietnamese benchmarks for these tasks are lacking, we adapt two analogy datasets from English to Vietnamese and construct two sequence datasets, ensuring a range of structural complexity and difficulty levels. We evaluate diverse prompting strategies, including Chain-of-Thought, role-playing guidance, cross-lingual prompting, and few-shot learning. Our results reveal a baseline proficiency in analogical and arithmetic reasoning among the models, with Vistral and Llama-2 outperforming other models in multiple tasks. The effects of Chain-of-Thought and contextual guidance are limited in Vietnamese, while cross-lingual prompting and few-shot learning show promising performance improvements. The findings underscore the feasibility of adapting benchmarks to less-resourced languages and provide insights into strengths and weaknesses in the performance of Vietnamese LLMs, suggesting directions for model improvements.