Back to Main Conference 2008
LREC 2008main

Word Segmentation of Vietnamese Texts: a Comparison of Approaches

Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC 2008)

DOI:10.63317/4apye64ze8jv

Abstract

We present in this paper a comparison between three segmentation systems for the Vietnamese language. Indeed, the majority of Vietnamese words is built by semantic composition from about 7,000 syllables, which also have a meaning as isolated words. So the identification of word boundaries in a text is not a simple task, and ambiguities often appear. Beyond the presentation of the tested systems, we also propose a standard definition for word segmentation in Vietnamese, and introduce a reference corpus developed for the purpose of evaluating such a task. The results observed confirm that it can be relatively well treated by automatic means, although a solution needs to be found to take into account out-of-vocabulary words.

Details

Paper ID
lrec2008-main-355
Pages
N/A
BibKey
dinh-etal-2008-word
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
2-9517408-4-0
Conference
Sixth International Conference on Language Resources and Evaluation
Location
Marrakech, Morocco
Date
28 May 2008 30 May 2008

Authors

  • Quang Thắng Đinh

  • HL

    Hồng Phương Lê

  • TN

    Thị Minh Huyền Nguyễn

  • CN

    Cẩm Tú Nguyễn

  • MR

    Mathias Rossignol

  • XV

    Xuân Lương Vũ

Links