Back to Main Conference 2026
LREC 2026main

VDAct 2.0: Scaling Video-Grounded Dialogue for Event-driven Activity Understanding with LLM-Assisted Filtering

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

DOI:10.63317/4vcv4ncvs6xx

Abstract

We present VDAct 2.0, an enhanced benchmark for video-grounded dialogue that builds upon the original VDAct by expanding dialogue coverage and introducing a scalable LLM-assisted filtering pipeline to ensure high-quality, grounded QA pairs. VDAct 2.0 comprises 6,356 human-annotated dialogues with a total of 63,958 turns, grounded in 2,975 household activity videos, with undesirable dialogue turns systematically identified and removed. To achieve this, we design a trigger-based quality framework and calibrate a panel of high-agreement LLMs through human-in-the-loop calibration, allowing scalable QA-turn-level filtering. We benchmark a wide range of pretrained and fine-tuned models, both open-source and proprietary, across standard text generation metrics and LLM-based evaluations. The results highlight both recent advances and remaining challenges in video-grounded dialogue modeling, positioning VDAct 2.0 as a high-fidelity testbed for evaluating and advancing multimodal reasoning in interactive settings.

Details

Paper ID
lrec2026-main-208
Pages
pp. 2653-2666
BibKey
imrattanatrai-etal-2026-vdact
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-493814-49-4
Conference
The Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Location
Palma, Mallorca, Spain
Date
11 May 2026 16 May 2026

Authors

  • WI

    Wiradee Imrattanatrai

  • MA

    Masaki Asada

  • KH

    Kimihiro Hasegawa

  • KF

    Ken Fukuda

  • TM

    Teruko Mitamura

Links