Detecting Hallucinations in Authentic LLM–Human Interactions

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

Abstract

As large language models (LLMs) are increasingly applied in sensitive domains such as medicine and law, hallucination detection has become a critical task. Although numerous benchmarks have been proposed to advance research in this area, most of them are artificially constructed––either through deliberate hallucination induction or simulated interactions––rather than derived from genuine LLM–human dialogues. Consequently, these benchmarks fail to fully capture the characteristics of hallucinations that occur in real-world usage. To address this limitation, we introduce AuthenHallu, the first hallucination detection benchmark built entirely from authentic LLM–human interactions. For AuthenHallu, we select and annotate samples from genuine LLM–human dialogues, thereby providing a faithful reflection of how LLMs hallucinate in everyday user interactions. Statistical analysis shows that hallucinations occur in 31.4% of the query–response pairs in our benchmark, and this proportion increases dramatically to 60.0% in challenging domains such as ’Math & Number Problems’. Furthermore, we explore the potential of using vanilla LLMs themselves as hallucination detectors and find that, despite some promise, their current performance remains insufficient in real-world scenarios. The data and code are publicly available at https://github.com/TAI-HAMBURG/AuthenHallu.