Back to Main Conference 2018
LREC 2018main

Text Normalization Infrastructure that Scales to Hundreds of Language Varieties

Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

DOI:10.63317/57m3aa8gyujx

Abstract

We describe the automated multi-language text normalization infrastructure that prepares textual data to train language models used in Google’s keyboards and speech recognition systems, across hundreds of language varieties. Training corpora are sourced from various types of data sets, and the text is then normalized using a sequence of hand-written grammars and learned models. These systems need to scale to hundreds or thousands of language varieties in order to meet product needs. Frequent data refreshes, privacy considerations and simultaneous updates across such a high number of languages make manual inspection of the normalized training data infeasible, while there is ample opportunity for data normalization issues. By tracking metrics about the data and how it was processed, we are able to catch internal data processing issues and external data corruption issues that can be hard to notice using standard extrinsic evaluation methods. Showing the importance of paying attention to data normalization behavior in large-scale pipelines, these metrics have highlighted issues in Google’s real-world speech recognition system that have caused significant, but latent, quality degradation.

Details

Paper ID
lrec2018-main-216
Pages
N/A
BibKey
chua-etal-2018-text
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
79-10-95546-00-9
Conference
Eleventh International Conference on Language Resources and Evaluation
Location
Miyazaki, Japan
Date
7 May 2018 12 May 2018

Authors

  • MC

    Mason Chua

  • Dv

    Daan van Esch

  • NC

    Noah Coccaro

  • EC

    Eunjoon Cho

  • SB

    Sujeet Bhandari

  • LJ

    Libin Jia

Links