MELD: Melding Diverse Multilingual and Multi-Domain Datasets for Named Entity Recognition Evaluation
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
Zero-shot Named Entity Recognition (NER) has gained prominence for information extraction across diverse domains without being limited to a single, fixed tag set. However, existing NER resources vary widely in data format, licensing terms, annotation schemes, and availability, making it difficult to systematically evaluate the generalization capabilities of zero-shot NER models. Prior attempts to aggregate datasets with broad coverage across domains have largely focused on a small subset of languages, and it is often not transparent how datasets were processed from their sources. This paper introduces MELD, a comprehensive multilingual and multi-domain data collection designed to address these gaps. MELD integrates 60 NER datasets spanning 194 languages, 14 domains, and 601 normalized entity types. While previously introduced multilingual NER datasets are mainly silver-standard, MELD contains gold-standard annotations for 60 languages. All data processing steps are fully open-source and reproducible, facilitating future extensions and ensuring long-term accessibility. While MELD is primarily designed for zero-shot evaluation, it also provides training and development splits in a single, consistent format to support future research in few-shot and supervised NER settings.