Back to Main Conference 2022
LREC 2022main

JADE: Corpus for Japanese Definition Modelling

Proceedings of the Thirteenth International Conference on Language Resources and Evaluation (LREC 2022)

DOI:10.63317/382gfozm5yjs

Abstract

This study investigated and released the JADE, a corpus for Japanese definition modelling, which is a technique that automatically generates definitions of a given target word and phrase. It is a crucial technique for practical applications that assist language learning and education, as well as for those supporting reading documents in unfamiliar domains. Although corpora for development of definition modelling techniques have been actively created, their languages are mostly limited to English. In this study, a corpus for Japanese, named JADE, was created following the previous study that mines an online encyclopedia. The JADE provides about 630k sets of targets, their definitions, and usage examples as contexts for about 41k unique targets, which is sufficiently large to train neural models. The targets are both words and phrases, and the coverage of domains and topics is diverse. The performance of a pre-trained sequence-to-sequence model and the state-of-the-art definition modelling method was also benchmarked on JADE for future development of the technique in Japanese. The JADE corpus has been released and available online.

Details

Paper ID
lrec2022-main-743
Pages
pp. 6884-6888
BibKey
huang-etal-2022-jade
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
79-10-95546-38-2
Conference
Thirteenth Language Resources and Evaluation Conference
Location
Marseille, France
Date
20 June 2022 25 June 2022

Authors

  • HH

    Han Huang

  • TK

    Tomoyuki Kajiwara

  • YA

    Yuki Arase

Links