Back to Main Conference 2000
LREC 2000main

Extraction of Unknown Words Using the Probability of Accepting the Kanji Character Sequence as One Word

Proceedings of the Second International Conference on Language Resources and Evaluation (LREC 2000)

DOI:10.63317/29fatzdo2i8v

Abstract

In this paper, we propose a method to extract unknown words, which are composed of two or three kahji characters, from Japanase text. Generally the known word composed of kanji characters are segmented into other words by the morphological analysis. Moreover, the appearance probability of each segmented word is small. By these features, we can define the measure of accepting two or three kanji character sequence as an unknown word. On the other hand, we can find some segmentation patterns of unknown words. By applying our measure to kanji character sequences which have these patterns, we can extract unknown words. In the experiment, the F-measuer for extraction of known words composed of two and three kanji characters was about 0.7 and 0.4 respectively. Our method does not need to use the frequency of the word in the training corpus to judge whether its word is the unknown word or not. Therefore, our method has the advantage that low frequent unknown words are extracted.

Details

Paper ID
lrec2000-main-059
Pages
N/A
BibKey
shinnou-ikeya-2000-extraction
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
N/A
Conference
Second International Conference on Language Resources and Evaluation
Location
Athens, Greece
Date
31 May 2000 2 June 2000

Authors

  • HS

    Hiroyuki Shinnou

  • MI

    Masanori Ikeya

Links