Can LLMs Understand Punchlines? LLMs' Narrative Understanding Evaluation with Short-shorts
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
In this study, we constructed a narrative comprehension benchmark using the works of Shinichi Hoshi to examine the extent to which Large Language Models (LLMs) can understand twist endings, or punchlines, in short-short stories. Specifically, story endings were categorized into six types—such as Revelation, Apocalypse, and Sarcasm—and a classification task was designed in which LLMs were prompted with the story text and asked to select the appropriate ending category. We collected human annotations from eight native Japanese speakers to establish a reference benchmark. Experimental comparisons were conducted across multiple LLMs (GPT-4, Claude, Gemini, and Grok), assessing their performance both at the metric level and at the discourse level against human judgments. The results revealed that although certain models approached human performance in specific categories, overall accuracy remained notably lower than the human baseline. Through quantitative and qualitative analyses, this study highlights the challenges LLMs face in capturing narrative subtleties such as irony, implication, and emotional reversal. The proposed benchmark provides a novel framework for evaluating narrative understanding and the deeper semantic reasoning capabilities of LLMs.