Back to WILDRE 2024
LREC-COLING 2024workshop

Creating Corpus of Low Resource Indian Languages for Natural Language Processing: Challenges and Opportunities

Proceedings of the 7th Workshop on Indian Language Data: Resources and Evaluation

DOI:10.63317/3ss6s3apwxbp

Abstract

Addressing tasks in Natural Language Processing requires access to sufficient and high-quality data. However, working with languages that have limited resources poses a significant challenge due to the absence of established methodologies, frameworks, and collaborative efforts. This paper intends to briefly outline the challenges associated with standardization in data creation, focusing on Indian languages, which are often categorized as low resource languages. Additionally, potential solutions and the importance of standardized procedures for low-resource language data are proposed. Furthermore, the critical role of standardized protocols in corpus creation and their impact on research is highlighted. Lastly, this paper concludes by defining what constitutes a corpus.

Details

Paper ID
lrec2024-ws-wildre-08
Pages
pp. 54-58
BibKey
dongare-2024-creating
Editor
N/A
Publisher
European Language Resources Association (ELRA) and ICCL
ISSN
N/A
ISBN
N/A
Workshop
Proceedings of the 7th Workshop on Indian Language Data: Resources and Evaluation
Location
undefined, undefined
Date
20 May 2024 25 May 2024

Authors

  • PD

    Pratibha Dongare

Links