POS Tagging in Low-Resource Maithili Language: Specific Challenges and Nuances

Proceedings of the 8th Workshop on Indian Language Data: Resources and Evaluation

Abstract

Abstract Part-of-Speech (POS) tagging is a key step in Natural Language Processing (NLP), laying the groundwork for more advanced syntactic and semantic tasks. Despite Maithili’s status as an Indo-Aryan language with a rich literary tradition and official recognition in India, computational resources for it are still very limited. In this paper, the creation of an annotated corpus of 25,000 sentences drawn from the fields of health, tourism, and administration is described with the hierarchical tagset currently used for Maithili. This paper also indicates that standard tagsets, typically adapted from English or Hindi, fail to capture the linguistic nuances of Maithili. This underestimates the need for a dedicated tagging framework that considers characteristics like vocative particles, verbal nuances, honorific complexities. Keywords: Parts of Speech, Natural Language Processing, Maithili, annotation