Back to Main Conference 2026
LREC 2026main

Can Video LLMs See Through Illusions? Video-Illusion QA Benchmark Dataset

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

DOI:10.63317/2s4rwea9k5ji

Abstract

Recent advances in multimodal learning have sparked growing interest in understanding how large vision-language models interpret optical illusions. While the behavior of image LLMs—which handle one image and text but not video input—on visual illusion images has been actively explored, research on their video counterparts remains limited. Video LLMs, which process sequential frames, are gaining prominence in areas such as robotics and autonomous driving. Understanding how they handle visual illusions over time is crucial for safety and may also reveal their potential as computational models of human cognition. To address this gap, we present the Video-Illusion QA Benchmark (VILQA), a novel video question answering (QA) benchmark mainly composed of carefully curated illusion videos that exhibit temporally driven perceptual phenomena. To the best of our knowledge, VILQA is the largest and most comprehensive benchmark for temporally-driven visual illusions. We evaluate several video LLMs on this benchmark from multiple perspectives. Some models were able to perceive visual illusions in a way similar to the general human experience and demonstrated an ability to resist illusions even more effectively than humans. The constructed dataset is available at https://github.com/SDS-NLP/VILQA.

Details

Paper ID
lrec2026-main-730
Pages
pp. 9291-9300
BibKey
ohira-etal-2026-can
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-493814-49-4
Conference
The Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Location
Palma, Mallorca, Spain
Date
11 May 2026 16 May 2026

Authors

  • SO

    Souto Ohira

  • TH

    Tosho Hirasawa

  • MK

    Mamoru Komachi

Links