JMedWiC: A Japanese Word-in-Context Dataset in the Medical Domain
Proceedings of the 8th Workshop on Clinical Natural Language Processing (Clinical NLP) @ LREC 2026
Abstract
We release JMedWiC, a Japanese dataset for Word-in-Context (WiC) tasks specifically tailored to the medical domain. To address the challenge of word sense disambiguation, where the meaning of a word varies depending on its context, previous research has developed WiC datasets to evaluate word sense identity by determining whether a target word shares the same sense across two given contexts. In the medical domain, the misinterpretation of word senses can hinder the accurate comprehension of medical information; however, there is currently no Japanese WiC dataset specialized for this domain. Moreover, existing WiC datasets have been constructed using lexical resources with sense inventories, such as WordNet and UMLS, but such resources are not sufficiently developed for Japanese. Therefore, we construct a Japanese WiC dataset in the medical domain by manually annotating sense-identity labels for target words in context pairs automatically extracted from a large-scale corpus, without relying on lexical resources.