Back to Main Conference 2004
LREC 2004main

Network of Data Centres (NetDC): BNSC - An Arabic Broadcast News Speech Corpus

Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC 2004)

DOI:10.63317/2qw28eurrigi

Abstract

Broadcast news is a very rich source of Language Resources that has been exploited to develop and assess a large set of Human Language Technologies. Some examples include systems to: automatically produce text transcriptions of spoken data; identify the language of a text; translate a text from one language to another; identify topics in the news and retrieve all stories discussing a target topic; retrieve stories directly from the broadcast audio and extract summaries of the content of news stories. BNSC is a broadcast news speech corpus developed in the framework of the European-funded project Network of Data Centres (NetDC). The corpus contains more than 20 hours of Arabic news recordings in modern standard Arabic. The news was recorded over a period of 3 months and were transcribed in Arabic script. The project was done in corporation with the LDC (Linguistic Data Consortium), which has produced a similar corpus of its Voice of America Arabic in the United States. This paper presents the BNSC corpus production from data collection to final product.

Details

Paper ID
lrec2004-main-521
Pages
N/A
BibKey
choukri-etal-2004-network
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
2-9517408-1-6
Conference
Fourth International Conference on Language Resources and Evaluation
Location
Lisbon, Portugal
Date
26 May 2004 28 May 2004

Authors

  • KC

    Khalid Choukri

  • MN

    Mahtab Nikkhou

  • NP

    Niklas Paulsson

Links