Back to Main Conference 2018
LREC 2018main

Multi-Dialect Arabic POS Tagging: A CRF Approach

Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

DOI:10.63317/2ctharew6nhx

Abstract

This paper introduces a new dataset of POS-tagged Arabic tweets in four major dialects along with tagging guidelines. The data, which we are releasing publicly, includes tweets in Egyptian, Levantine, Gulf, and Maghrebi, with 350 tweets for each dialect with appropriate train/test/development splits for 5-fold cross validation. We use a Conditional Random Fields (CRF) sequence labeler to train POS taggers for each dialect and examine the effect of cross and joint dialect training, and give benchmark results for the datasets. Using clitic n-grams, clitic metatypes, and stem templates as features, we were able to train a joint model that can correctly tag four different dialects with an average accuracy of 89.3%.

Details

Paper ID
lrec2018-main-015
Pages
N/A
BibKey
darwish-etal-2018-multi
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
79-10-95546-00-9
Conference
Eleventh International Conference on Language Resources and Evaluation
Location
Miyazaki, Japan
Date
7 May 2018 12 May 2018

Authors

  • KD

    Kareem Darwish

  • HM

    Hamdy Mubarak

  • AA

    Ahmed Abdelali

  • ME

    Mohamed Eldesouki

  • YS

    Younes Samih

  • RA

    Randah Alharbi

  • MA

    Mohammed Attia

  • WM

    Walid Magdy

  • LK

    Laura Kallmeyer

Links