Rebelòt: Datasets and Token-Level Language Identification for Lombard-Italian-English Code-Mixing
Proceedings of the SIGUL 2026 Joint Workshop with ELE, EURALI, and DCLRL "Towards Inclusivity and Equality: Language Resources and Technologies for Under-Resourced and Endangered Languages
Abstract
Lombard is an endangered and under-resourced Gallo-Italic language variety that exists with Standard Italian. As with other language varieties of Italy, code-switching and code-mixing is common between Lombard and Italian in everyday conversation and with English, online. This linguistic complexity, and the lack of a unified written standard, poses challenges for Natural Language Processing tools. We introduce Rebelòt, a novel multi-domain, token-level annotated dataset for Lombard-Italian-English code-mixing. Furthermore, we develop and evaluate three variants of a token-level Language Identification (LID) tool based on a pre-trained encoder architecture, fine-tuned using both authentic data from our corpus and synthetically generated code-mixed text. Our evaluation demonstrates that the optimal model variant achieves an accuracy of over 0.99 on token-level prediction, and substantially outperforms widely used off-the-shelf LID baselines at sentence-level.