A Joint Detection Framework for Latvian Loanwords and Calques Using Monolingual Data
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
Lexical borrowing is pervasive across languages with extensive cultural contact, yet its automatic detection remains challenging for low-resource languages, especially regarding calques. Existing methods depend heavily on bilingual resources and focus almost exclusively on phonological loanwords, leaving structural borrowing phenomena like calques largely unaddressed by automated tools. This paper proposes a novel joint binary classification pipeline based solely on monolingual data and mBERT, introducing the first large-scale annotated Latvian borrowing dataset with over 3,000 manually labeled entries across three categories: loanwords, calques, and local words. The pipeline adopts a staged decision process grounded in language contact theory, separating surface-level loanwords before tackling the more ambiguous calque category. Experiments demonstrate that our semi-supervised strategy with pseudo-labeling achieves a macro-F1 of 0.854 on an external test set, outperforming both a direct three-way classifier and a GPT-4o zero-shot baseline. These results establish a performance benchmark for the previously unaddressed task of automatic borrowing detection in Latvian, providing empirical tools for borrowing detection in resource-scarce contexts.