Development of Serbian QA Datasets through Prompt-Based Generation and Human Validation
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
LLMs capable of answering questions, fulfilling diverse user requests, and functioning as chatbots rely heavily on extensive datasets. However, for the Serbian language, there is a significant lack of high-quality datasets structured in a question-and-answer (QA) format. To address this, we extracted a portion of the SQuAD-sr dataset, which, to the best of our knowledge, is the largest QA dataset in Serbian and contains over 87k samples. While this dataset is an incredibly valuable resource, it was translated using an adapted Translate-Align-Retrieve method and contains errors and terminological inaccuracies. In this work, we systematically reviewed and corrected more than 7k samples from the SQuAD-sr dataset, significantly improving the dataset’s reliability and quality. We call this modified subset of the SQuAD-sr dataset, the SQuAD-sr-md dataset. The corrections that were made are crucial for training accurate and robust QA models in Serbian, ensuring that AI systems can leverage the full potential of this dataset. We also introduce an additional QA dataset generated from encyclopedia articles, Wikipedia pages, and scientific paper abstracts using LLMs, which contains 74k samples. We name this dataset the SerbianQA-Gen.