Comparing Traditional and LLM-based Approaches for Automated Scoring of Dutch Writing Products

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

Abstract

This research examines several traditional and recent approaches for automated grading of Dutch texts written by adolescent L1 speakers. We relied on a proprietary dataset comprising human-scored texts. Following recent paradigms in NLP research, we compared training a feature-based model to fine-tuning both mono- and multilingual BERT-based and generative large language models. The latter were also prompted directly in a zero-shot setting. The results reveal that the feature-based and BERT-based approaches are promising for the task at hand and even complementary, although there is still room for improvement. The error analysis demonstrates that the generative models do not only make more errors in classification, but that these error are also more problematic. We therefore conclude that especially generative LLMs are not directly employable in this educational context.