Back to Main Conference 2026
LREC 2026main

A Joint Detection Framework for Latvian Loanwords and Calques Using Monolingual Data

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

DOI:10.63317/2f6e35y4dgkn

Abstract

Lexical borrowing is pervasive across languages with extensive cultural contact, yet its automatic detection remains challenging for low-resource languages, especially regarding calques. Existing methods depend heavily on bilingual resources and focus almost exclusively on phonological loanwords, leaving structural borrowing phenomena like calques largely unaddressed by automated tools. This paper proposes a novel joint binary classification pipeline based solely on monolingual data and mBERT, introducing the first large-scale annotated Latvian borrowing dataset with over 3,000 manually labeled entries across three categories: loanwords, calques, and local words. The pipeline adopts a staged decision process grounded in language contact theory, separating surface-level loanwords before tackling the more ambiguous calque category. Experiments demonstrate that our semi-supervised strategy with pseudo-labeling achieves a macro-F1 of 0.854 on an external test set, outperforming both a direct three-way classifier and a GPT-4o zero-shot baseline. These results establish a performance benchmark for the previously unaddressed task of automatic borrowing detection in Latvian, providing empirical tools for borrowing detection in resource-scarce contexts.

Details

Paper ID
lrec2026-main-798
Pages
pp. 10157-10167
BibKey
zhang-etal-2026-joint
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-493814-49-4
Conference
The Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Location
Palma, Mallorca, Spain
Date
11 May 2026 16 May 2026

Authors

  • YZ

    Yelingyun Zhang

  • AK

    Atis Kapenieks

  • MP

    Marina Platonova

Links