HomeLREC 2026WorkshopsDETERMITlrec2026-ws-determit-05
Back to DETERMIT 2026
LREC 2026workshop

Translation as Augmentation: Effect of Translated Data on Assessment of Difficulty

Proceedings of the 2nd Workshop on Evaluating Text Difficulty in a Multilingual Context (DeTermIt! 2026)

DOI:10.63317/3mmqujpgyua5

Abstract

Reliable Text Difficulty Assessment is a prerequisite for valid text simplification workflows and personalized learning applications. However, the development of robust assessment models is severely hindered by a critical bottleneck: the scarcity of expert-annotated corpora containing fine-grained difficulty levels (e.g., CEFR), particularly for lower-resource languages. This paper addresses this data scarcity problem in the context of a low-resource European language. We propose a cross-lingual data augmentation strategy that leverages machine translation to transfer labeled resources from high-resource languages to the target low-resource language. We train BERT-based regression models to predict difficulty scores and investigate whether synthetic, translated data can effectively supplement native training sets. Our experiments demonstrate that augmenting scarce native data with machine-translated corpora significantly improves the accuracy of difficulty estimation, offering a viable solution for languages lacking extensive expert annotations.

Details

Paper ID
lrec2026-ws-determit-05
Pages
pp. 42-50
BibKey
wu-etal-2026-translation
Editors
Giorgio Maria Di Nunzio, Federica Vezzani, Liana Ermakova, Hosein Azarbonyad, Jaap Kamps
Publisher
European Language Resources Association (ELRA)
ISSN
N/A
ISBN
N/A
Workshop
Proceedings of the 2nd Workshop on Evaluating Text Difficulty in a Multilingual Context (DeTermIt! 2026)
Location
Palma, Mallorca, Spain
Date
11 - 16 May 2026

Authors

  • YW

    Yiheng Wu

  • JH

    Jue Hou

  • RY

    Roman Yangarber

Links