Title

The C-ORAL-ROM Project. New methods for spoken language archives in a multilingual romance corpus

Authors

Emanuela Cresti (Dipartimento di Italianistica, Università di Firenze Piazza Savonarola 1,50125 Firenze Italy)

Massimo Moneglia (Dipartimento di Italianistica, Università di Firenze Piazza Savonarola 1,50125 Firenze Italy)

Fernanda Bacelar do Nascimento (Centro de Linguistica da Universidade de Lisboa Complexo Interdisciplinar, Av Gama Pinto, 2, 1649-003 Lisboa Portugal)

Antonio Moreno Sandoval (Laboratorio de Lingüística Informática Departemento de Linguistica, Universidad Autonoma de Madrid Carretera de Colmenar Viejo Km 15 Cantoblanco 28049 Madrid Spain)

Jean Veronis (Description Linguistique Informatizée sur Corpus, Université de Provence 29, Avenue Robert Schuman13621 AIX EN PROVENCE - Cedex 1 France)

Philippe Martin (Pitch Instruments France 24, Rue Las Cases 75005 France)

Kalid Choukri (European Language Distribution Agency European Language Association Agency (ELDA) 55-57, Rue Brillant-Savarin 75013 Paris France)

Valerie Mapelli (Istituto Trentino di Cultura, Trento Istituto Trentino di Cultura (Centro per la ricerca scientifica e tecnologica) 38050 Povo, Trento, Italy)

Daniele Falavigna (Istituto Trentino di Cultura, Trento Istituto Trentino di Cultura (Centro per la ricerca scientifica e tecnologica) 38050 Povo, Trento, Italy)

Antonio Cid (Instituto Cervantes, Oficina del Español en la Sociedad de la Información Livreros, 23 28801 Alcalà de Henares - Madrid Spain)

Claude Blum (Editions Honoré CHAMPION 7, Quai Malaquais 75006 PARIS France)

Session

SO1: Large Projects-Initiatives For Speech Corpora

Abstract

C-ORAL-ROM is a multilingual corpus of spontaneous speech of around 1.200.000 words representing the four main Romance languages: French, Italian, Portuguese and Spanish.. The resource will be delivered in standard textual format, aligned to the audio source in a multimedia edition. C-ORAL-ROM aims to ensure at the same time a sufficient representation of spontaneous speech variation in each language resource and the comparability among the four resources with respect to a definite set of variation parameters. The multimedia conception of C-ORAL-ROM allows simultaneously alignment and full appreciation of the acoustic information through the speech software WINPITCHCORPUS. The storage of spoken language resources is based on the identification of utterances in the four corpora through perceptively relevant prosodic properties. In C-ORAL-ROM all the textual information is tagged simultaneously with respect to prosodic parsing and utterance limits. Each prosodic unit corresponding to an utterance is easily and directly aligned to its acoustic counterpart, thus ensuring a natural text - sound correspondence and the definition of a data base of possible speech act in the four romance languages.

Keywords

Multilingual romance corpus

Full Paper

290.pdf