Back to Main Conference 2016
LREC 2016main

Challenges and Solutions for Consistent Annotation of Vietnamese Treebank

Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)

DOI:10.63317/4vs4ouxceqsy

Abstract

Treebanks are important resources for researchers in natural language processing, speech recognition, theoretical linguistics, etc. To strengthen the automatic processing of the Vietnamese language, a Vietnamese treebank has been built. However, the quality of this treebank is not satisfactory and is a possible source for the low performance of Vietnamese language processing. We have been building a new treebank for Vietnamese with about 40,000 sentences annotated with three layers: word segmentation, part-of-speech tagging, and bracketing. In this paper, we describe several challenges of Vietnamese language and how we solve them in developing annotation guidelines. We also present our methods to improve the quality of the annotation guidelines and ensure annotation accuracy and consistency. Experiment results show that inter-annotator agreement ratios and accuracy are higher than 90% which is satisfactory.

Details

Paper ID
lrec2016-main-243
Pages
pp. 1532-1539
BibKey
nguyen-etal-2016-challenges
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-9517408-9-1
Conference
Tenth International Conference on Language Resources and Evaluation
Location
Portorož, Slovenia
Date
23 May 2016 28 May 2016

Authors

  • QN

    Quy Nguyen

  • YM

    Yusuke Miyao

  • HL

    Ha Le

  • NN

    Ngan Nguyen

Links