Back to Main Conference 2016
LREC 2016main
Guidelines and Framework for a Large Scale Arabic Diacritized Corpus
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)
Abstract
This paper presents the annotation guidelines developed as part of an effort to create a large scale manually diacritized corpus for various Arabic text genres. The target size of the annotated corpus is 2 million words. We summarize the guidelines and describe issues encountered during the training of the annotators. We also discuss the challenges posed by the complexity of the Arabic language and how they are addressed. Finally, we present the diacritization annotation procedure and detail the quality of the resulting annotations.