Back to Main Conference 2014
LREC 2014main

A Corpus and Phonetic Dictionary for Tunisian Arabic Speech Recognition

Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014)

DOI:10.63317/2x7os8jjo9dx

Abstract

In this paper we describe an effort to create a corpus and phonetic dictionary for Tunisian Arabic Automatic Speech Recognition (ASR). The corpus, named TARIC (Tunisian Arabic Railway Interaction Corpus) has a collection of audio recordings and transcriptions from dialogues in the Tunisian Railway Transport Network. The phonetic (or pronunciation) dictionary is an important ASR component that serves as an intermediary between acoustic models and language models in ASR systems. The method proposed in this paper, to automatically generate a phonetic dictionary, is rule based. For that reason, we define a set of pronunciation rules and a lexicon of exceptions. To determine the performance of our phonetic rules, we chose to evaluate our pronunciation dictionary on two types of corpora. The word error rate of word grapheme-to-phoneme mapping is around 9%.

Details

Paper ID
lrec2014-main-385
Pages
pp. 306-310
BibKey
masmoudi-etal-2014-corpus
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-9517408-8-4
Conference
Ninth International Conference on Language Resources and Evaluation
Location
Reykjavik, Iceland
Date
26 May 2014 31 May 2014

Authors

  • AM

    Abir Masmoudi

  • MK

    Mariem Ellouze Khmekhem

  • YE

    Yannick Estève

  • LB

    Lamia Hadrich Belguith

  • NH

    Nizar Habash

Links