Back to Main Conference 2026
LREC 2026main

Phrase-Level Segmentation on Medieval Corpora for Aligning Multilingual Texts

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

DOI:10.63317/32huzuuokpfr

Abstract

This paper presents an approach to multilingual alignment for medieval languages, focusing on the prior step of"phrase" segmentation. It outlines the challenges posed by historical data and describes different strategies forsegmenting texts in multiple languages. It releases a gold-standard segmentation corpus based on various literaryand historical works from the late Middle Ages in Europe. This corpus consists of texts in seven medieval languages (French, Castilian, Catalan, Portuguese, Latin, Italian, English). Several architectures are tested with both in-domain and out-of-domain evaluation sets.

Details

Paper ID
lrec2026-main-072
Pages
pp. 936-946
BibKey
ing-etal-2026-phrase
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-493814-49-4
Conference
The Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Location
Palma, Mallorca, Spain
Date
11 May 2026 16 May 2026

Authors

  • LI

    Lucence Ing

  • ML

    Matthias Gille Levenson

  • CM

    Carolina Macedo

Links