Back to Main Conference 2024
LREC-COLING 2024main

KoFREN: Comprehensive Korean Word Frequency Norms Derived from Large Scale Free Speech Corpora

Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

DOI:10.63317/4f55ejcfp5wh

Abstract

Word frequencies are integral in linguistic studies, showing strong correlations with speakers’ cognitive abilities and other important linguistic parameters including the Age of Acquisition (AoA). However, the formulation of credible Korean word frequency norms has been obstructed by the lack of expansive speech data and a reliable part-ofspeech (POS) tagger. In this study, we unveil Korean word frequency norms (KoFREN), derived from large-scale spontaneous speech corpora (41 million words) that include a balanced representation of gender and age. We employed a machine learning-powered POS tagger, showcasing accuracy on par with human annotators. Our frequency norms correlate significantly with external studies’ lexical decision time (LDT) and AoA measures. KoFREN also aligns with English counterparts sourced from SUBTLEX_US - an English word frequency measure that has been frequently used in the literature. KoFREN is poised to facilitate research in spontaneous Contemporary Korean and can be utilized in many fields, including clinical studies of Korean patients.

Details

Paper ID
lrec2024-main-0866
Pages
pp. 9926-9931
BibKey
kim-etal-2024-kofren
Editor
N/A
Publisher
European Language Resources Association (ELRA) and ICCL
ISSN
2522-2686
ISBN
979-10-95546-34-4
Conference
Joint International Conference on Computational Linguistics, Language Resources and Evaluation
Location
Turin, Italy
Date
20 May 2024 25 May 2024

Authors

  • JK

    Jin-seo Kim

  • AC

    Anna Seo Gyeong Choi

  • SC

    Sunghye Cho

Links