Reasoning Graph-Structured Question Answering: Datasets and Insights from LLM Benchmarking
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
Large Language Models (LLMs) have shown remarkable success in multi-hop question-answering (M-QA) due to their advanced reasoning capabilities. However, the influence of reasoning structures on their performance remains underexplored, primarily due to the lack of M-QA datasets that explicitly encode the reasoning pathways underlying each question-answer pair. To address this gap, we introduce the reasoning graph-structured question answering dataset (GRS-QA), which provides both semantic contexts and reasoning structures for the QA pairs. Unlike existing M-QA datasets, GRS-QA explicitly captures intricate reasoning pathways through reasoning graphs, where nodes correspond to textual contexts and edges denote logical flows. Using GRS-QA, we systematically evaluate LLM performance across varying context structures, prompting styles, and data domains. Our empirical analysis reveals that LLMs perform differently based on the reasoning structure, context, and prompting styles, indicating their varying ability to leverage graph-structured knowledge. Notably, providing explicit reasoning guidance proves more effective than supplying contextual information alone.