Back to Main Conference 2016
LREC 2016main

Creating a Large Multi-Layered Representational Repository of Linguistic Code Switched Arabic Data

Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)

DOI:10.63317/2k45bgq8m7ru

Abstract

We present our effort to create a large Multi-Layered representational repository of Linguistic Code-Switched Arabic data. The process involves developing clear annotation standards and Guidelines, streamlining the annotation process, and implementing quality control measures. We used two main protocols for annotation: in-lab gold annotations and crowd sourcing annotations. We developed a web-based annotation tool to facilitate the management of the annotation process. The current version of the repository contains a total of 886,252 tokens that are tagged into one of sixteen code-switching tags. The data exhibits code switching between Modern Standard Arabic and Egyptian Dialectal Arabic representing three data genres: Tweets, commentaries, and discussion fora. The overall Inter-Annotator Agreement is 93.1%.

Details

Paper ID
lrec2016-main-669
Pages
pp. 4228-4235
BibKey
diab-etal-2016-creating
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-9517408-9-1
Conference
Tenth International Conference on Language Resources and Evaluation
Location
Portorož, Slovenia
Date
23 May 2016 28 May 2016

Authors

  • MD

    Mona Diab

  • MG

    Mahmoud Ghoneim

  • AH

    Abdelati Hawwari

  • FA

    Fahad AlGhamdi

  • NA

    Nada AlMarwani

  • MA

    Mohamed Al-Badrashiny

Links