HotelCheckSpan: A Benchmark Dataset for LLM Faithfulness
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
Hallucinations are among the most persistent and challenging issues in large language model (LLM) outputs. This particularly holds in domains that combine both objective and subjective content, such as hotel descriptions, that are intended to be enticing advertisements for the hotel. Distinguishing between factual errors and interpretative exaggeration is often subtle, complicating both human and automated evaluation. To address this, we present HotelCheckSpan, the first span-level faithfulness dataset for the hotel domain. Each example aggregates one or more hotel descriptions, and human-annotated summaries are labeled with three error types: Incorrect, Misleading, and Not Checkable. By marking the precise spans where errors occur, the dataset captures fine-grained information about the nature of hallucinations and factual inconsistencies. In addition to human annotations, we collect span-level judgments from multiple LLMs, enabling direct human–model comparisons. Our analysis shows that inter-annotator agreement varies substantially across aggregation levels: example-level agreement can mask subtle span-level disagreements, while soft and hard F1 variants highlight discrepancies in both span placement and error categorization. HotelCheckSpan provides a benchmark for studying ambiguity and disagreement, validating automatic faithfulness metrics, and evaluating LLMs as judges, offering a rich resource for research on faithfulness, subjectivity, and annotation practices in mixed-content domains