Back to Main Conference 2024
LREC-COLING 2024main

Gos 2: A New Reference Corpus of Spoken Slovenian

Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

DOI:10.63317/2mn9knnmuddv

Abstract

This paper introduces a new version of the Gos reference corpus of spoken Slovenian, which was recently extended to more than double the original size (300 hours, 2.4 million words) by adding speech recordings and transcriptions from two related initiatives, the Gos VideoLectures corpus of public academic speech, and the Artur speech recognition database. We describe this process by first presenting the criteria guiding the balanced selection of the newly added data and the challenges encountered when merging language resources with divergent designs, followed by the presentation of other major enhancements of the new Gos corpus, such as improvements in lemmatization and morphosyntactic annotation, word-level speech alignment, a new XML schema and the development of a specialized online concordancer.

Details

Paper ID
lrec2024-main-0691
Pages
pp. 7825-7830
BibKey
verdonik-etal-2024-gos
Editor
N/A
Publisher
European Language Resources Association (ELRA) and ICCL
ISSN
2522-2686
ISBN
979-10-95546-34-4
Conference
Joint International Conference on Computational Linguistics, Language Resources and Evaluation
Location
Turin, Italy
Date
20 May 2024 25 May 2024

Authors

  • DV

    Darinka Verdonik

  • KD

    Kaja Dobrovoljc

  • TE

    Tomaž Erjavec

  • NL

    Nikola Ljubešić

Links