Quadratic Weighted Kappa Is Not Enough for Evaluating Automated Essay Scoring Models

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

Abstract

Quadratic Weighted Kappa (QWK) has been the standard evaluation metric in Automated Essay Scoring (AES) research for over two decades. Despite repeated criticisms highlighting its limitations, the community has largely continued to rely on QWK without adopting alternative metrics. This study aims to encourage a shift toward more suitable evaluation practices by systematically examining QWK’s behavior under three key conditions: dataset size, class imbalance, and score range. Using both a publicly available AES dataset and carefully synthesized datasets, we demonstrate scenarios where QWK produces unstable or misleading results. Our findings highlight the need for more robust evaluation practices and point to alternative metrics, particularly variants of Gwet’s AC2, that offer greater reliability across a variety of conditions.