Can Multimodal LLMs Generate Pedagogical Questions?

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

Abstract

Educational materials frequently combine text, diagrams, tables, and charts to convey complex concepts. Understanding such materials often requires reasoning across modalities rather than relying solely on textual descriptions. In educational contexts, the main challenge lies in assessing the relevance and quality of the questions themselves. This raises a key issue: what defines a good question in a specialized learning environment? By comparison, evaluating answers is a more conventional task, although it requires examining criteria consistent with the targeted educational level. To the best of our knowledge, the use of LLMs for assessing the pedagogical relevance of questions remains unexplored. This gap highlights the need to define pedagogical relevance more clearly and to investigate the consistency of LLM judgments, as well as their alignment with human evaluations. We introduce a new Multimodal QA dataset in the education domain. To reduce the need for extensive human annotation, we leverage LLMs to help design questions on educational material, jointly with a human annotation. Contrary to most of QA Multimodal corpora, we focus on questions that could be asked by a teacher in his/her class, and that need dealing with different parts of the document to be answered. Results show that while LLMs as a judge is an efficient framework, many problem could arise and that align prediction with human annotators is a difficult task for complex criteria.