Back to Main Conference 2010
LREC 2010main

A Fully Annotated Corpus of Russian Speech

Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010)

DOI:10.63317/4pvjxjnacnxx

Abstract

The paper introduces CORPRES ― a fully annotated Russian speech corpus developed at the Department of Phonetics, St. Petersburg State University as a result of a three-year project. The corpus includes samples of different speaking styles produced by 4 male and 4 female speakers. Six levels of annotation cover all phonetic and prosodic information about the recorded speech data, including labels for pitch marks, phonetic events, narrow and wide phonetic transcription, orthographic and prosodic transcription. Precise phonetic transcription of the data provides an especially valuable resource for both research and development purposes. Overall corpus size is 528 458 running words and contains 60 hours of speech made up of 7.5 hours from each speaker. 40% of the corpus was manually segmented and fully annotated on all six levels. 60% of the corpus was partly annotated; there are labels for pitch period and phonetic event labels. Orthographic, prosodic and ideal phonetic transcription for this part was generated and stored as text files. The fully annotated part of the corpus covers all speaking styles included in the corpus and all speakers. The paper contains information about CORPRES design and annotation principles, overall data description and some speculation about possible use of the corpus.

Details

Paper ID
lrec2010-main-188
Pages
N/A
BibKey
skrelin-etal-2010-fully
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
2-9517408-6-7
Conference
Seventh International Conference on Language Resources and Evaluation
Location
Valletta, Malta
Date
17 May 2010 23 May 2010

Authors

  • PS

    Pavel Skrelin

  • NV

    Nina Volskaya

  • DK

    Daniil Kocharov

  • KE

    Karina Evgrafova

  • OG

    Olga Glotova

  • VE

    Vera Evdokimova

Links