On LLM Prompting Techniques for Arabic Language Arithmetic Reasoning

The 7th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT7) with 5 Shared Tasks

Abstract

Math word problems (MWPs) require complex reasoning to extract mathematical relationships from textual descriptions. While Large Language Models (LLMs) have shown remarkable performance on English mathematical reasoning tasks, their effectiveness on Arabic MWPs remains largely unexplored. This paper introduces three Arabic datasets (AGSM8K, Qudurat, and ArabicMWPs) and evaluates six LLMs using three prompting techniques: Manual Chain-of-Thought (CoT), Zero-shot CoT, and Self-consistency. Performance is assessed using accuracy and BERTScore metrics (precision, recall, F1-score). Our findings demonstrate that GPT-4o with Self-consistency achieves the highest accuracy of 97.65% on AGSM8K. It also obtains a precision of 71.94%, a recall of 71.31%, and an F1-score of 71.50%. The Arabic-specific LLM ALLaM achieves 84.41% accuracy on ArabicMWPs and 43.97% on AGSM8K. Fine-tuning experiments are further conducted on models using Arabic mathematical data. This work addresses the critical gap in Arabic mathematical reasoning resources and provides insights for developing Arabic-capable AI systems. Prompt-engineering methods combined with LLMs are regarded as a strong approach for advancing education and scientific research in solving Arabic mathematical problems.