Back to Main Conference 2024
LREC-COLING 2024main

Khan Academy Corpus: A Multilingual Corpus of Khan Academy Lectures

Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

DOI:10.63317/4mjtasw2xir7

Abstract

We present the Khan Academy Corpus totalling 10122 hours in 87394 recordings across 29 languages, where 43% of recordings (4252 hours) are equipped with human-written subtitles. The subtitle texts cover a total of 137 languages. The dataset was collected from open access Khan Academy lectures, benefiting from their manual transcripts and manual translations of the transcripts. The dataset can serve in creation or evaluation of multilingual speech recognition or translation systems, featuring a diverse set of subject domains.

Details

Paper ID
lrec2024-main-0851
Pages
pp. 9743-9752
BibKey
duriskova-etal-2024-khan
Editor
N/A
Publisher
European Language Resources Association (ELRA) and ICCL
ISSN
2522-2686
ISBN
979-10-95546-34-4
Conference
Joint International Conference on Computational Linguistics, Language Resources and Evaluation
Location
Turin, Italy
Date
20 May 2024 25 May 2024

Authors

  • Dominika Ďurišková

  • DJ

    Daniela Jurášová

  • Matúš Žilinec

  • Eduard Šubert

  • OB

    Ondřej Bojar

Links