HomeLREC 2026WorkshopsDETERMITlrec2026-ws-determit-03
Back to DETERMIT 2026
LREC 2026workshop

Conplext 1.0: A Multilingual Lexical Complexity Prediction Dataset for L2 Learning

Proceedings of the 2nd Workshop on Evaluating Text Difficulty in a Multilingual Context (DeTermIt! 2026)

DOI:10.63317/4wbqzig7hgaw

Abstract

This paper presents Conplext 1.0, a multilingual dataset designed for lexical complexity prediction in the context of second language (L2) learning. The resource covers 3,901 sentence contexts for 1,000 vocabulary items across five languages (English, French, Spanish, Swedish, and Dutch), each aligned with Common European Framework of Reference (CEFR) proficiency levels. Contexts were generated using a generative large language model and subsequently filtered for pedagogical suitability. A large-scale best–worst scaling (BWS) annotation experiment is being conducted with L2 learners to derive continuous, learner-informed lexical complexity values. The resulting dataset enables the development of context-aware word difficulty models that account for variation across both languages and learning stages. In addition to its primary use in lexical complexity prediction, Conplext provides valuable opportunities for research in word sense disambiguation, generative model evaluation, and adaptive language learning applications. By integrating computational and educational perspectives, this work advances the study of lexical difficulty in multilingual language learning environments.

Details

Paper ID
lrec2026-ws-determit-03
Pages
pp. 22-32
BibKey
alfter-etal-2026-conplext
Editors
Giorgio Maria Di Nunzio, Federica Vezzani, Liana Ermakova, Hosein Azarbonyad, Jaap Kamps
Publisher
European Language Resources Association (ELRA)
ISSN
N/A
ISBN
N/A
Workshop
Proceedings of the 2nd Workshop on Evaluating Text Difficulty in a Multilingual Context (DeTermIt! 2026)
Location
Palma, Mallorca, Spain
Date
11 - 16 May 2026

Authors

  • DA

    David Alfter

  • JD

    Jasper Degraeuwe

Links