Back to Main Conference 2022
LREC 2022main

TArC: Tunisian Arabish Corpus, First complete release

Proceedings of the Thirteenth International Conference on Language Resources and Evaluation (LREC 2022)

DOI:10.63317/5eqrikasc4rn

Abstract

In this paper we present the final result of a project focused on Tunisian Arabic encoded in Arabizi, the Latin-based writing system for digital conversations. The project led to the realization of two integrated and independent tools: a linguistic corpus and a neural network architecture created to annotate the former with various levels of linguistic information (code-switching classification, transliteration, tokenization, POS-tagging, lemmatization). We discuss the choices made in terms of computational and linguistic methodology and the strategies adopted to improve our results. We report on the experiments performed in order to outline our research path. Finally, we explain the reasons why we believe in the potential of these tools for both computational and linguistic researches.

Details

Paper ID
lrec2022-main-121
Pages
pp. 1125-1136
BibKey
gugliotta-dinarelli-2022-tarc
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
79-10-95546-38-2
Conference
Thirteenth Language Resources and Evaluation Conference
Location
Marseille, France
Date
20 June 2022 25 June 2022

Authors

  • EG

    Elisa Gugliotta

  • MD

    Marco Dinarelli

Links