Back to Main Conference 2004
LREC 2004main

Collection and Evaluation of Broadcast News Data for Arabic

Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC 2004)

DOI:10.63317/2pmqwv2r3dun

Abstract

This paper focuses on presenting a general methodology for acquiring and automatically segmenting broadcast news data from the web. It was shown that it is possible starting from a relatively small corpus of about 10 hours to segment automatically about 30 hours of data. This step is important because manual segmentation of broadcast news data is generally very tedious and time consuming. In addition to the data collection proposal we show the development of an initial recognition system. We present an automatic procedure for creating vowelizations for Arabic words. This is again important because most available Arabic transcriptions lack vowelization, which is crucial for creating phonetic transcription. The performance of our system is initially 36% error rate.

Details

Paper ID
lrec2004-main-170
Pages
N/A
BibKey
afify-emam-2004-collection
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
2-9517408-1-6
Conference
Fourth International Conference on Language Resources and Evaluation
Location
Lisbon, Portugal
Date
26 May 2004 28 May 2004

Authors

  • MA

    Mohamed Afify

  • OE

    Ossama Emam

Links