Back to Main Conference 2018
LREC 2018main

A Taxonomy for In-depth Evaluation of Normalization for User Generated Content

Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

DOI:10.63317/5cb9nkjw53km

Abstract

In this work we present a taxonomy of error categories for lexical normalization, which is the task of translating user generated content to canonical language. We annotate a recent normalization dataset to test the practical use of the taxonomy and read a near-perfect agreement. This annotated dataset is then used to evaluate how an existing normalization model performs on the different categories of the taxonomy. The results of this evaluation reveal that some of the problematic categories only include minor transformations, whereas most regular transformations are solved quite well.

Details

Paper ID
lrec2018-main-109
Pages
N/A
BibKey
van-der-goot-etal-2018-taxonomy
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
79-10-95546-00-9
Conference
Eleventh International Conference on Language Resources and Evaluation
Location
Miyazaki, Japan
Date
7 May 2018 12 May 2018

Authors

  • Rv

    Rob van der Goot

  • Rv

    Rik van Noord

  • Gv

    Gertjan van Noord

Links