Back to Main Conference 2004
LREC 2004main

Automatic Generation of Compound Word Lexicon for Hindi Speech Synthesis

Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC 2004)

DOI:10.63317/4bj95jkoywmc

Abstract

This paper addresses the problem of Hindi compound word splitting and its relevance to developing a good quality phonetizer for Hindi Speech Synthesis. The constituents of a Hindi compound word are not separated by space or hyphen. Hence, most of the existing compound splitting algorithms can not be applied to Hindi. We propose a new technique for automatic extraction of compound words from Hindi corpus. Preliminary tests conducted on the algorithm have shown a split rate of 92 to 96% of the input compound words. Of these splits, around 83 to 87% are correct splits. A few modifications have been suggested, which will improve the accuracy of the splits. Finally, we observe an improvement of 1.6% in Hindi Grapheme-to-Phoneme (G2P) conversion as a result of using a phonetized compound word lexicon, created by the above technique.

Details

Paper ID
lrec2004-main-298
Pages
N/A
BibKey
deepa-etal-2004-automatic
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
2-9517408-1-6
Conference
Fourth International Conference on Language Resources and Evaluation
Location
Lisbon, Portugal
Date
26 May 2004 28 May 2004

Authors

  • SD

    S.R. Deepa

  • KB

    Kalika Bali

  • AR

    A.G. Ramakrishnan

  • PT

    Partha Pratim Talukdar

Links