Back to Main Conference 2016
LREC 2016main

Guidelines and Framework for a Large Scale Arabic Diacritized Corpus

Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)

DOI:10.63317/55fzrvtc3pvd

Abstract

This paper presents the annotation guidelines developed as part of an effort to create a large scale manually diacritized corpus for various Arabic text genres. The target size of the annotated corpus is 2 million words. We summarize the guidelines and describe issues encountered during the training of the annotators. We also discuss the challenges posed by the complexity of the Arabic language and how they are addressed. Finally, we present the diacritization annotation procedure and detail the quality of the resulting annotations.

Details

Paper ID
lrec2016-main-577
Pages
pp. 3637-3643
BibKey
zaghouani-etal-2016-guidelines
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-9517408-9-1
Conference
Tenth International Conference on Language Resources and Evaluation
Location
Portorož, Slovenia
Date
23 May 2016 28 May 2016

Authors

  • WZ

    Wajdi Zaghouani

  • HB

    Houda Bouamor

  • AH

    Abdelati Hawwari

  • MD

    Mona Diab

  • OO

    Ossama Obeid

  • MG

    Mahmoud Ghoneim

  • SA

    Sawsan Alqahtani

  • KO

    Kemal Oflazer

Links