The Greedy Algorithm and its Application to the Construction of a Continuous Speech Database

Proceedings of the Third International Conference on Language Resources and Evaluation (LREC 2002)

Abstract

Databases containing varied linguistic features can be build by condensing large corpora; in this work we need to cover a set of phonetic units with a minimal set of natural phonetic sentences. With this aim in view we compare three set covering methods: the greedy method, its inverse which we call the spitting method, and the pair exchange method. Each method is defined with several criteria guiding the selection of sentences; they relate to the number of units of the sentences, to their length, and to the rareness of their units. A first experiment shows that pair exchange method doesn't guarantee a total covering. Greedy and spitting methods performances are comparable; nevertheless greedy is a bit better and above all less time-consuming. Applying spitting method to a greedy cover increases performance by removing about 10% redundancy. So does pair exchange method, but it is more time-consuming. Most of the criteria guiding selections are sensitive to the sentences length. Criteria performances obtained for a total covering are not necessarily transposable to a partial covering.