Back to Main Conference 2010
LREC 2010main

Predicting Morphological Types of Chinese Bi-Character Words by Machine Learning Approaches

Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010)

DOI:10.63317/3w58m7wyzv5h

Abstract

This paper presented an overview of Chinese bi-character words’ morphological types, and proposed a set of features for machine learning approaches to predict these types based on composite characters’ information. First, eight morphological types were defined, and 6,500 Chinese bi-character words were annotated with these types. After pre-processing, 6,178 words were selected to construct a corpus named Reduced Set. We analyzed Reduced Set and conducted the inter-annotator agreement test. The average kappa value of 0.67 indicates a substantial agreement. Second, Bi-character words’ morphological types are considered strongly related with the composite characters’ parts of speech in this paper, so we proposed a set of features which can simply be extracted from dictionaries to indicate the characters’ “tendency” of parts of speech. Finally, we used these features and adopted three machine learning algorithms, SVM, CRF, and Naïve Bayes, to predict the morphological types. On the average, the best algorithm CRF achieved 75% of the annotators’ performance.

Details

Paper ID
lrec2010-main-274
Pages
N/A
BibKey
huang-etal-2010-predicting
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
2-9517408-6-7
Conference
Seventh International Conference on Language Resources and Evaluation
Location
Valletta, Malta
Date
17 May 2010 23 May 2010

Authors

  • TH

    Ting-Hao Huang

  • LK

    Lun-Wei Ku

  • HC

    Hsin-Hsi Chen

Links