Parallel Sentence Filtering for Low-Resource Language Pairs: A Case Study for Upper Sorbian, German, and Czech

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

Abstract

As parallel corpora for low-resource languages are scarce, and automatic approaches to mine sentence pairs can lead to noisy datasets, parallel sentence filtering aims to detect only actual translations. We study here two language pairs: Upper Sorbian–German and Czech–German to represent both high and low availability of data resources. To evaluate filtering performance, we generate synthetic datasets by combining existing parallel corpora with synthetic non-parallel pairs, notably with five types of local semantic changes on the German side, such as negation or modality transformations. We represent sentences using three multilingual language models, XLM-R, Glot500m, and LaBSE, and train classifiers for the task. All three model representations led to worse filtering quality when pairs were altered more subtly, such as an antonym replacement. We still observed that a language model pre-trained on the considered language achieves more robust classification performance when sentence pairs are more ambiguous. We also evaluated a cross-lingual approach where the classifier is trained on the Czech–German pair and then applied to the Upper Sorbian–German pair. Such a language transfer paves the way for filtering other low-resource language pairs in the future.