Evaluating the Effect of Question Wording Variations on Answer Consistency in Large Language Models
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
Large Language Models (LLMs) sometimes generate inconsistent answers when asked semantically equivalent questions expressed with different wordings. Such inconsistency may lead to decreased task performance or excessive agreement with users. This study investigates how question wording influences the answer consistencies of LLMs, focusing on binary Yes/No questions. We design four types of paraphrasing patterns, namely synonym substitution, antonym substitution, addition of agreement-seeking expressions, and strengthened agreement-seeking expressions, and evaluate their impact on model outputs. Experiments with multiple open-source and commercial LLMs show that many models become more likely to answer "Yes" when agreement-seeking expressions are included, and they are particularly vulnerable to antonym substitutions. Our analysis further suggests that some of these tendencies are already present in pretrained models and are not fully removed by post-training. We also provide insights into which factors are likely (or unlikely) to contribute to improving consistency. By providing a systematic evaluation framework, this work highlights the necessity of accounting for wording-induced biases in the development and deployment of LLMs.