From Generation to Evaluation: A Resource for Error-Categorized Question Generation from Video Transcripts
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
A key challenge in automated question generation is producing grammatically correct, error-free, and contextually relevant questions. While large language models already handle this well, smaller models that can run on consumer-grade hardware face greater difficulties. Another obstacle is the lack of large, high-quality datasets, particularly for education video transcripts, which limits the diversity and applicability of training data. On top of this, current evaluation methods either rely on strict comparison to a "ground truth," undervaluing valid but unmatched questions, or on expert judgments, which do not scale. They do not provide insights into the nature of errors. In this paper, we introduce a dataset of real-life educational video transcripts and investigate the question-generating capabilities of small language models by assessing their output with pre-defined error categories. We also present a novel approach to automatic quality assessment by classifying questions into predefined error categories. We show that questions generated by small language models are still prone to error. Our proposed classification approach outperforms baseline approaches and matches GPT-5 performance by reaching an accuracy of 72%.