Back to Main Conference 2026
LREC 2026main

Cross-Corpus CEFR Classification through Artificial Learners Perplexities

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

DOI:10.63317/2bvpcczgao2t

Abstract

The complexity of neural methods for automatic proficiency assessment often sacrifices interpretability and robustness. This paper presents a competitive alternative for CEFR classification using optimized statistical models with a novel perplexity-based feature engineering pipeline. We introduce LLM-derived perplexity features as a proxy for how unexpected a learner’s word choices are: native model perplexity measures unexpectedness relative to native language use, while Artificial Learner model perplexity quantifies relative to a specific proficiency level. While recent work favors end-to-end neural architectures, we demonstrate that traditional pipelines enhanced with these interpretable perplexity features can achieve comparable performance on established benchmarks. We evaluate two transfer scenarios: zero-shot (trained on EFCAMDAT, tested on external corpora) and 90-10 split (same features, in-domain classifier training). On KUPA-KEYS, perplexity features achieve RMSE 0.707 (zero-shot) and 0.660 (90-10 split), outperforming fine-tuned BERT and prompt-based LLMs. On CELVA-SP, zero-shot perplexity shows limited generalization (RMSE 1.437 vs. LLM’s 1.016), but statistical models close this gap in the 90-10 split (RMSE 0.872). Across all three evaluation datasets, perplexity-based models achieve the best average macro F1 in the 90-10 split (0.446 vs. 0.287 for BERT and 0.175 for prompting), demonstrating that interpretable features paired with domain-adapted classifiers provide the most robust cross-domain representations. We contribute: (1) state-of-the-art KUPA-KEYS results with interpretable models, (2) the first comprehensive CELVA-SP benchmark, and (3) evidence that feature-level transfer outperforms both end-to-end fine-tuning and zero-shot prompting.

Details

Paper ID
lrec2026-main-062
Pages
pp. 828-837
BibKey
stearns-etal-2026-cross
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-493814-49-4
Conference
The Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Location
Palma, Mallorca, Spain
Date
11 May 2026 16 May 2026

Authors

  • BS

    Bernardo Stearns

  • JM

    John P. McCrae

  • TG

    Thomas Gaillat

Links