Back to Main Conference 2024
LREC-COLING 2024main

Can We Learn Question, Answer, and Distractors All from an Image? A New Task for Multiple-choice Visual Question Answering

Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

DOI:10.63317/5cipz8usy2zu

Abstract

Multiple-choice visual question answering (MC VQA) requires an answer picked from a list of distractors, based on a question and an image. This research has attracted wide interest from the fields of visual question answering, visual question generation, and visual distractor generation. However, these fields still stay in their own territories, and how to jointly generate meaningful questions, correct answers, and challenging distractors remains unexplored. In this paper, we introduce a novel task, Visual Question-Answer-Distractors Generation (VQADG), which can bridge this research gap as well as take as a cornerstone to promote existing VQA models. Specific to the VQADG task, we present a novel framework consisting of a vision-and-language model to encode the given image and generate QADs jointly, and contrastive learning to ensure the consistency of the generated question, answer, and distractors. Empirical evaluations on the benchmark dataset validate the performance of our model in the VQADG task.

Details

Paper ID
lrec2024-main-0254
Pages
pp. 2852-2863
BibKey
ding-etal-2024-learn
Editor
N/A
Publisher
European Language Resources Association (ELRA) and ICCL
ISSN
2522-2686
ISBN
979-10-95546-34-4
Conference
Joint International Conference on Computational Linguistics, Language Resources and Evaluation
Location
Turin, Italy
Date
20 May 2024 25 May 2024

Authors

  • WD

    Wenjian Ding

  • YZ

    Yao Zhang

  • JW

    Jun Wang

  • AJ

    Adam Jatowt

  • ZY

    Zhenglu Yang

Links