VideoEvent: Leveraging Relevance and LLMs for Video Question Answering
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
We propose VideoEvent, a lightweight and efficient training-free framework for Video Question Answering (VQA) with large language models (LLMs). Although several training-free VQA methods have been proposed, they often neglect the temporal dependencies between frames or clips, treating them as isolated units and relying on complex or resource-intensive components. To address this limitation while maintaining performance and simplicity, we propose VideoEvent, a framework that segments an input video into question-relevant temporal events and selectively supplements them with low-level visual cues such as background and object layout. Our method selects semantically relevant time spans and retrieves one representative background frame to enrich the prompt to LLM. This design minimizes reliance on additional tools and reduces inference cost, making it highly suitable for practical deployment. Experimental results on EgoSchema and NExT-QA show that VideoEvent reduces inference cost by up to 30% while maintaining state-of-the-art accuracy, and its background module improves accuracy by 1–3% across multiple frameworks.