Back to Main Conference 2026
LREC 2026main

Quadratic Weighted Kappa Is Not Enough for Evaluating Automated Essay Scoring Models

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

DOI:10.63317/3co8wwdqqyf6

Abstract

Quadratic Weighted Kappa (QWK) has been the standard evaluation metric in Automated Essay Scoring (AES) research for over two decades. Despite repeated criticisms highlighting its limitations, the community has largely continued to rely on QWK without adopting alternative metrics. This study aims to encourage a shift toward more suitable evaluation practices by systematically examining QWK’s behavior under three key conditions: dataset size, class imbalance, and score range. Using both a publicly available AES dataset and carefully synthesized datasets, we demonstrate scenarios where QWK produces unstable or misleading results. Our findings highlight the need for more robust evaluation practices and point to alternative metrics, particularly variants of Gwet’s AC2, that offer greater reliability across a variety of conditions.

Details

Paper ID
lrec2026-main-348
Pages
pp. 4447-4456
BibKey
albatarni-etal-2026-quadratic
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-493814-49-4
Conference
The Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Location
Palma, Mallorca, Spain
Date
11 May 2026 16 May 2026

Authors

  • SA

    Salam Albatarni

  • TE

    Tamer Elsayed

Links