Localizing Events in Space: Comparing Humans and AI Models
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
Understanding how Large Language Models (LLMs) and Text-to-Image models (T2Is) acquire and apply implicit spatial knowledge remains an open challenge. In this paper, we present a novel dataset and evaluation framework designed to probe event localization capabilities in both humans, LLMs and T2Is. Our dataset includes 134 sentence pairs derived from Flickr30k captions, where explicit location information is systematically removed via Abstract Meaning Representation (AMR) parsing and manual refinement. Using this dataset, we analyze the effects of location ablation on spatial reasoning across human annotators, LLMs, and T2Is. Results show that while humans maintain robust location inferences after ablation, LLMs exhibit degraded performance, particularly for semantically polysemous verbs. T2Is demonstrate similar limitations, often generating visually inconsistent spatial contexts when locative cues are missing. Our findings highlight the gap between human and LLMs and T2Is in recovering implicit situational knowledge and suggest future directions for improving spatial reasoning in multimodal AI systems. This dataset contribution work serves as a proof-of-concept for systematic evaluation of implicit spatial reasoning and paves the way for larger-scale studies.