Consistency of LLMs to Comparative Statements in Mathematical Reasoning Tasks
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
Large language models (LLMs) have the potential to significantly expand access to quality education through applications such as mathematics tutoring. However, a key challenge is that student writing often contains redundancies, and prior research has shown that LLMs can be sensitive to such irrelevant information. This raises a critical research question: How consistent are LLMs when faced with extraneous comparative statements? To address this, we propose a systematic framework for evaluating LLM consistency. Our approach involves a hybrid strategy that integrates template-based and model-based methods to generate comparative statements (e.g., "One of the apples was tastier than average") and insert them into mathematical reasoning problems. The merit of our approach lies in its systematic and automated nature, enabling rigorous assessment across various models and datasets. Conducting experiments on the GSM8K, AQuA, and Hendrycks MATH benchmarks with a suite of open-source LLMs, we highlight two key results. First, LLM accuracy can drop by over 30% when presented with these statements. Furthermore, we uncover a trade-off between the diversity of the generated statements and the magnitude of the performance drop, where less diverse and more repetitive perturbations lead to greater accuracy degradation.