Phonetically Balanced Code-Mixed Speech Corpus for Hindi-English Automatic Speech Recognition

Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

Abstract

The paper presents the development of a phonetically balanced read speech corpus of code-mixed Hindi-English. Phonetic balance in the corpus has been created by selecting sentences that contained triphones lower in frequency than a predefined threshold. The assumption with a compulsory inclusion of such rare units was that the high frequency triphones will inevitably be included. Using this metric, the Pearson's correlation coefficient of the phonetically balanced corpus with a large code-mixed reference corpus was recorded to be 0.996. The data for corpus creation has been extracted from selected sections of Hindi newspapers.These sections contain frequent English insertions in a matrix of Hindi sentence. Statistics on the phone and triphone distribution have been presented, to graphically display the phonetic likeness between the reference corpus and the corpus sampled through our method.

Resources

Details

Paper ID

lrec2018-main-235

Pages

N/A

DOI

10.63317/3xo52xmbi9zq

BibKey

pandey-etal-2018-phonetically

Editors

Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis, Takenobu Tokunaga

Publisher

European Language Resources Association (ELRA)

ISSN

2522-2686

ISBN

79-10-95546-00-9

Conference

Eleventh International Conference on Language Resources and Evaluation

Location

Miyazaki, Japan

Date

7 - 12 May 2018

Authors

AP
Ayushi Pandey
BS
Brij Mohan Lal Srivastava
RK
Rohit Kumar
BN
Bhanu Teja Nellore
KT
Kasi Sai Teja
SG
Suryakanth V. Gangashetty

Links

URL

DOI