Data-Driven Parametric Text Normalization: Rapidly Scaling Finite-State Transduction Verbalizers to New Languages

Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)

DOI:10.63317/2cmdkvxjwawv

Abstract

This paper presents a methodology for rapidly generating FST-based verbalizers for ASR and TTS systems by efficiently sourcing language-specific data. We describe a questionnaire which collects the necessary data to bootstrap the number grammar induction system and parameterize the verbalizer templates described in Ritchie et al. (2019), and a machine-readable data store which allows the data collected through the questionnaire to be supplemented by additional data from other sources. This system allows us to rapidly scale technologies such as ASR and TTS to more languages, including low-resource languages.

Resources

Details

Paper ID

lrec2020-ws-sltu-30

Pages

pp. 218-225

DOI

10.63317/2cmdkvxjwawv

BibKey

ritchie-etal-2020-data

Editors

N/A

Publisher

European Language Resources Association (ELRA)

ISSN

N/A

ISBN

N/A

Workshop

Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)

Location

Marseille, France

Date

11 - 16 May 2020

Authors

SR
Sandy Ritchie
EM
Eoin Mahon
KH
Kim Heiligenstein
NB
Nikos Bampounis
Dv
Daan van Esch
CS
Christian Schallhart
JM
Jonas Mortensen
BB
Benoit Brard

Links

URL

DOI