MHTS: Multi-Hop Tree Structure Framework for Generating Difficulty-Controllable QA Datasets for RAG Evaluation

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

Abstract

Existing RAG benchmarks often overlook query difficulty, leading to inflated performance on simpler questions and unreliable evaluations. A robust benchmark dataset must satisfy three key criteria: quality, ensuring complete and reliable ground truth (GT) responses; diversity, expanding semantic coverage to prevent overfitting; and difficulty, capturing the complexity of reasoning based on hops and the distribution of supporting evidence. However, current benchmarks lack a systematic approach to defining and controlling query difficulty at a fine-grained level. To address this, we propose MHTS (Multi-Hop Tree Structure), a novel dataset synthesis framework that systematically controls multi-hop reasoning complexity by leveraging a multi-hop tree structure to generate logically connected, multi-chunk queries. Our fine-grained difficulty estimation formula exhibits a strong correlation with the overall performance metrics of a RAG system, validating its effectiveness in assessing both retrieval and answer generation capabilities. By ensuring high-quality, diverse, and difficulty-controlled queries, our approach enhances RAG evaluation and benchmarking capabilities. This work contributes to the development of more reliable, efficient, and adaptable AI-driven research assistants, facilitating advancements in document-based reasoning and retrieval tasks.