Back to Main Conference 2010
LREC 2010main

A Very Large Scale Mandarin Chinese Broadcast Corpus for GALE Project

Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010)

DOI:10.63317/2j6oncoqazjj

Abstract

In this paper, we present the design, collection, transcription and analysis of a Mandarin Chinese Broadcast Collection of over 3000 hours. The data was collected by Hong Kong University of Science and Technology (HKUST) in China on a cable TV and satellite transmission platform established in support of the DARPA Global Autonomous Language Exploitation (GALE) program. The collection includes broadcast news (BN) and broadcast conversation (BC) including talk shows, roundtable discussions, call-in shows, editorials and other conversational programs that focus on news and current events. HKUST also collects detailed information about all recorded programs. A subset of BC and BN recordings are manually transcribed with standard Chinese characters in UTF-8 encoding, using specific mark-ups for a small set of spontaneous and conversational speech phenomena. The collection is among the largest and first of its kind for Mandarin Chinese Broadcast speech, providing abundant and diverse samples for Mandarin speech recognition and other application-dependent tasks, such as spontaneous speech processing and recognition, topic detection, information retrieval, and speaker recognition. HKUST’s acoustic analysis of 500 hours of the speech and transcripts demonstrates the positive impact this data could have on system performance.

Details

Paper ID
lrec2010-main-452
Pages
N/A
BibKey
liu-etal-2010-large
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
2-9517408-6-7
Conference
Seventh International Conference on Language Resources and Evaluation
Location
Valletta, Malta
Date
17 May 2010 23 May 2010

Authors

  • YL

    Yi Liu

  • PF

    Pascale Fung

  • YY

    Yongsheng Yang

  • DD

    Denise DiPersio

  • MG

    Meghan Glenn

  • SS

    Stephanie Strassel

  • CC

    Christopher Cieri

Links