FiNERVINER: Fine-grained Named Entity Recognition for Vulnerable Languages of India's North Eastern Region
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
Named entity recognition (NER), particularly fine-grained NER (FgNER), extracts domain-specific entity information for Natural Language Processing (NLP) applications such as knowledge base construction and relation extraction. While manual annotation for creating relevant data is expensive, distant supervision often produces noisy data. Moreover, resources for coarse-grained and fine-grained NER in Indian languages, particularly in the vulnerable languages of India’s North Eastern Region, remain scarce. This work aims at creating such a resource for three vulnerable languages: <i>Bodo/Boro (brx)</i>, <i>Manipuri/Meitei (mni)</i>, and <i>Mizo/Lushai (lus)</i>, which are regarded as official languages in three Indian states and spoken by more than six million people across five countries in South and Southeast Asia. We use annotations projection from high-resource FgNER datasets using source-to-target parallel corpora and a projection tool built on a multilingual encoder. The dataset comprises over 198k sentences, 282k entities, and 2.8M tokens in each low-resource language. Our thorough analyses validate the dataset’s high quality. We further explore zero-shot and cross-lingual settings, examining the impact of script similarity and multilingualism in cross-lingual FgNER performance. The dataset, expert detector models, the agentic tool, and the interactive web application are available as open-source resources at: <url>https://hf.co/collections/prachuryyaIITG/finerviner</url>.