HomeLREC 2026WorkshopsBUCClrec2026-ws-bucc-08
Back to BUCC 2026
LREC 2026workshop

Align and Shine: Building High-quality Sentence-aligned Corpora for Multilingual Text Simplification

Proceedings of the 19th Workshop on Building and Using Comparable Corpora (BUCC)

DOI:10.63317/55pt8xqgkge6

Abstract

Text simplification plays a crucial role in improving the accessibility and comprehensibility of written information for diverse audiences, including language learners and readers with limited literacy. Despite its importance, large-scale, high-quality datasets for training and evaluating text simplification models remain scarce for languages other than English. This paper reports an experimental study on the collection and processing of crowd-sourced simplification data to construct a corpus suitable for both training and testing text simplification systems across multiple languages (Catalan, English, French, Italian and Spanish). We report mechanisms for sentence-level alignment from document-level data. The resulting dataset of the aligned sentence pairs is publicly available.

Details

Paper ID
lrec2026-ws-bucc-08
Pages
pp. 62-71
BibKey
hilasacasanchez-etal-2026-align
Editors
Reinhard Rapp, Ayla Rigouts Terryn, Serge Sharoff, Pierre Zweigenbaum
Publisher
European Language Resources Association (ELRA)
ISSN
N/A
ISBN
N/A
Workshop
Proceedings of the 19th Workshop on Building and Using Comparable Corpora (BUCC)
Location
Palma, Mallorca, Spain
Date
11 - 16 May 2026

Authors

  • LH

    Luis Kenji Hilasaca Sanchez

  • NK

    Nouran Khallaf

  • SS

    Serge Sharoff

Links