Back to Main Conference 2022
LREC 2022main

Writing System and Speaker Metadata for 2,800+ Language Varieties

Proceedings of the Thirteenth International Conference on Language Resources and Evaluation (LREC 2022)

DOI:10.63317/3nc6ij9wmtgt

Abstract

We describe an open-source dataset providing metadata for about 2,800 language varieties used in the world today. Specifically, the dataset provides the attested writing system(s) for each of these 2,800+ varieties, as well as an estimated speaker count for each variety. This dataset was developed through internal research and has been used for analyses around language technologies. This is the largest publicly-available, machine-readable resource with writing system and speaker information for the world’s languages. We analyze the distribution of languages and writing systems in our data and compare it to their representation in current NLP. We hope the availability of this data will catalyze research in under-represented languages.

Details

Paper ID
lrec2022-main-538
Pages
pp. 5035-5046
BibKey
van-esch-etal-2022-writing
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
79-10-95546-38-2
Conference
Thirteenth Language Resources and Evaluation Conference
Location
Marseille, France
Date
20 June 2022 25 June 2022

Authors

  • Dv

    Daan van Esch

  • TL

    Tamar Lucassen

  • SR

    Sebastian Ruder

  • IC

    Isaac Caswell

  • CR

    Clara Rivera

Links