Back to Main Conference 2022
LREC 2022main

GujMORPH - A Dataset for Creating Gujarati Morphological Analyzer

Proceedings of the Thirteenth International Conference on Language Resources and Evaluation (LREC 2022)

DOI:10.63317/3sqfi2xnmjs2

Abstract

Computational morphology deals with the processing of a language at the word level. A morphological analyzer is a key linguistic word-level tool that returns all the constituent morphemes and their grammatical categories associated with a particular word form. For the highly inflectional and low resource languages, the creation of computational morphology-related tools is a challenging task due to the unavailability of underlying key resources. In this paper, we discuss the creation of an annotated morphological dataset- GujMORPH for the Gujarati - an indo-aryan language. For the creation of this dataset, we studied language grammar, word formation rules, and suffix attachments in depth. This dataset contains 16,527 unique inflected words along with their morphological segmentation and grammatical feature tagging information. It is a first of its kind dataset for the Gujarati language and can be used to develop morphological analyzer and generator models. The dataset is annotated in the standard Unimorph schema and evaluated on the baseline system. We also describe the tool used to annotate the data in the standard format. The dataset is released publicly along with the library. Using this library, the data can be obtained in a format that can be directly used to train any machine learning model.

Details

Paper ID
lrec2022-main-767
Pages
pp. 7088-7095
BibKey
baxi-bhatt-2022-gujmorph
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
79-10-95546-38-2
Conference
Thirteenth Language Resources and Evaluation Conference
Location
Marseille, France
Date
20 June 2022 25 June 2022

Authors

  • JB

    Jatayu Baxi

  • BB

    Brijesh Bhatt

Links