HomeLREC 2026WorkshopsSIGULlrec2026-ws-sigul-25
Back to SIGUL 2026
LREC 2026workshop

Rebelòt: Datasets and Token-Level Language Identification for Lombard-Italian-English Code-Mixing

Proceedings of the SIGUL 2026 Joint Workshop with ELE, EURALI, and DCLRL "Towards Inclusivity and Equality: Language Resources and Technologies for Under-Resourced and Endangered Languages

DOI:10.63317/4yids37agyxu

Abstract

Lombard is an endangered and under-resourced Gallo-Italic language variety that exists with Standard Italian. As with other language varieties of Italy, code-switching and code-mixing is common between Lombard and Italian in everyday conversation and with English, online. This linguistic complexity, and the lack of a unified written standard, poses challenges for Natural Language Processing tools. We introduce Rebelòt, a novel multi-domain, token-level annotated dataset for Lombard-Italian-English code-mixing. Furthermore, we develop and evaluate three variants of a token-level Language Identification (LID) tool based on a pre-trained encoder architecture, fine-tuned using both authentic data from our corpus and synthetically generated code-mixed text. Our evaluation demonstrates that the optimal model variant achieves an accuracy of over 0.99 on token-level prediction, and substantially outperforms widely used off-the-shelf LID baselines at sentence-level.

Details

Paper ID
lrec2026-ws-sigul-25
Pages
pp. 253-262
BibKey
signoroni-etal-2026-rebelòt
Editors
Atul Kr. Ojha, Sakriani Sakti, Claudia Soria, Maite Melero, John P. McCrae, Constantine Lignos, Chao-Hong Liu, German Rigau Claramunt, Georg Rehm
Publisher
European Language Resources Association (ELRA)
ISSN
N/A
ISBN
N/A
Workshop
Proceedings of the SIGUL 2026 Joint Workshop with ELE, EURALI, and DCLRL "Towards Inclusivity and Equality: Language Resources and Technologies for Under-Resourced and Endangered Languages
Location
Palma, Mallorca, Spain
Date
11 - 16 May 2026

Authors

  • ES

    Edoardo Signoroni

  • EB

    Emma Bednaříková

  • PR

    Pavel Rychly

Links