Is Literal Annotation Enough? Building an Annotation Framework for Metonymic Named Entities in Marathi

Proceedings of the 8th Workshop on Indian Language Data: Resources and Evaluation

Abstract

Named Entity Recognition (NER) has been a core task of natural language processing (NLP) since the Message Understanding Conferences (MUCs). Data annotation plays a crucial role in this task. However, existing annotation studies often rely on the literal sense of entities. Such annotations may lead to inconsistencies, while resolving ambiguity introduced by figurative tropes like metonymy. For example, in India won the series, India refers to a sports team instead of a geographic location. Understanding such non-literal senses is crucial for various NLP applications such as Question Answering, Information Extraction, etc. By addressing this gap, this study presents an annotation framework and detailed guidelines for annotating metonymic readings of named entities in Marathi, an Indo-Aryan language spoken in the central-western region of India. The study uses news corpus from various domains. It presents a two-tiered annotation framework for annotating conventional metonymies in Marathi language. Further, it describes the annotation framework applied to a corpus of 1,279 Marathi sentences. The result shows the inadequacy of literal-only annotation as 53.6% of named entity spans have metonymic readings. This study makes a crucial contribution for resource development for low-resource languages that share similar linguistic structures and cultural contexts. The paper describes the framework with necessary examples, challenges and concludes with a future scope.