Back to Main Conference 2026
LREC 2026main

Dialectal Filtering: Synthesizing Kurdish Corpora for Low-Resource Varieties by Utilizing "Noise" in Large Textual Data

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

DOI:10.63317/249h3r9tiw52

Abstract

This work introduces a dialect-aware text filtering framework to pre-process, clean, and enhance large text corpora, creating variety-specific sub-corpora for neglected language varieties. We apply our framework to Kurdish, a language with rich dialectal diversity, which presents significant challenges for Natural Language Processing due to its low-resource status and the noisy nature of available text corpora. Leveraging lexicographic features, we assign multi-language-labels to text instances and synthesize over 130 dialect specific corpora from large "noisy" data sets containing unlabeled mixtures of Kurdish varieties, representing to our knowledge the largest collection of dialect-specific Kurdish NLP resources to date. This work contributes to the creation of low-resource language technology foundations, especially dialect-specific NLP applications. Specifically, we advance research on Kurdish languages by providing insights into the linguistic relationships among Kurdish varieties.

Details

Paper ID
lrec2026-main-116
Pages
pp. 1505-1519
BibKey
schuler-etal-2026-dialectal
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-493814-49-4
Conference
The Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Location
Palma, Mallorca, Spain
Date
11 May 2026 16 May 2026

Authors

  • CS

    Christian Schuler

  • RA

    Raman Ahmad

  • ĀW

    Ānrán Wáng

  • DG

    Daniil Gurgurov

  • TB

    Timo Baumann

  • SO

    Simon Ostermann

  • JG

    Josef van Genabith

Links