Back to Main Conference 2022
LREC 2022main

TeDDi Sample: Text Data Diversity Sample for Language Comparison and Multilingual NLP

Proceedings of the Thirteenth International Conference on Language Resources and Evaluation (LREC 2022)

DOI:10.63317/3fxdn2ys5q9d

Abstract

We present the TeDDi sample, a diversity sample of text data for language comparison and multilingual Natural Language Processing. The TeDDi sample currently features 89 languages based on the typological diversity sample in the World Atlas of Language Structures. It consists of more than 20k texts and is accompanied by open-source corpus processing tools. The aim of TeDDi is to facilitate text-based quantitative analysis of linguistic diversity. We describe in detail the TeDDi sample, how it was created, data availability, and its added value through for NLP and linguistic research.

Details

Paper ID
lrec2022-main-123
Pages
pp. 1150-1158
BibKey
moran-etal-2022-teddi
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
79-10-95546-38-2
Conference
Thirteenth Language Resources and Evaluation Conference
Location
Marseille, France
Date
20 June 2022 25 June 2022

Authors

  • SM

    Steven Moran

  • CB

    Christian Bentz

  • XG

    Ximena Gutierrez-Vasques

  • OP

    Olga Pelloni

  • TS

    Tanja Samardzic

Links