LongTailQA: Benchmarking LLMs and RAG Models on Disambiguated Long-Tail Entities
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
Large Language Models (LLMs) struggle with memorizing long-tail facts. Retrieval-Augmented Generation (RAG) models show better performance on long-tail Question Answering (QA) by offloading memory to external knowledge sources. We demonstrate that popular QA benchmarks such as PopQA, WITQA, and EntityQA contain significant entity ambiguity, with 8-30% of long-tail questions referencing entities with non-unique names. This ambiguity confounds evaluation, obscuring true model capabilities. To perform robust benchmarking, we disambiguate these questions with the Wikipedia knowledge graph to develop LongTailQA, an improved QA benchmark that mitigates entity ambiguity in long-tail entity questions. We evaluate various recent LLMs and RAG models, such as Self-RAG and InstructRAG, investigating retriever quality and retrieval depth impacts on QA performance. We observe that: (i) disambiguation improves model accuracy up to 24.7%, (ii) RAG models benefit significantly more than vanilla LLMs, (iii) simply increasing retrieval depth does not improve RAG performance, and (iv) RAG models achieve high accuracy with perfect information, highlighting the need to filter noisy documents during retrieval. The LongTailQA benchmark facilitates robust evaluation of long-tail knowledge recall and RAG system effectiveness. We make the codebase and datasets publicly available at https://github.com/williamx854/LongTailQA-Benchmark