Back to Main Conference 2022
LREC 2022main

MHE: Code-Mixed Corpora for Similar Language Identification

Proceedings of the Thirteenth International Conference on Language Resources and Evaluation (LREC 2022)

DOI:10.63317/2mhx6ooap8dt

Abstract

This paper introduces a new Magahi-Hindi-English (MHE) code-mixed data-set for similar language identification (SMLID), where Magahi is a less-resourced minority language. This corpus provides a language id at two levels: word and sentence. This data-set is the first Magahi-Hindi-English code-mixed data-set for similar language identification task. Furthermore, we will discuss the complexity of the data-set and provide a few baselines for the language identification task.

Details

Paper ID
lrec2022-main-366
Pages
pp. 3425-3433
BibKey
rani-etal-2022-mhe
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
79-10-95546-38-2
Conference
Thirteenth Language Resources and Evaluation Conference
Location
Marseille, France
Date
20 June 2022 25 June 2022

Authors

  • PR

    Priya Rani

  • JM

    John P. McCrae

  • TF

    Theodorus Fransen

Links