Back to Main Conference 2008
LREC 2008main

Diacritic Annotation in the Arabic Treebank and its Impact on Parser Evaluation

Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC 2008)

DOI:10.63317/48557dayhbxb

Abstract

The Arabic Treebank (ATB), released by the Linguistic Data Consortium, contains multiple annotation files for each source file, due in part to the role of diacritic inclusion in the annotation process. The data is made available in both “vocalized” and “unvocalized” forms, with and without the diacritic marks, respectively. Much parsing work with the ATB has used the unvocalized form, on the basis that it more closely represents the “real-world” situation. We point out some problems with this usage of the unvocalized data and explain why the unvocalized form does not in fact represent “real-world” data. This is due to some aspects of the treebank annotation that to our knowledge have never before been published.

Details

Paper ID
lrec2008-main-361
Pages
N/A
BibKey
maamouri-etal-2008-diacritic
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
2-9517408-4-0
Conference
Sixth International Conference on Language Resources and Evaluation
Location
Marrakech, Morocco
Date
28 May 2008 30 May 2008

Authors

  • MM

    Mohamed Maamouri

  • SK

    Seth Kulick

  • AB

    Ann Bies

Links