HomeLREC 2020WorkshopsSLTUlrec2020-ws-sltu-15
Back to SLTU 2020
LREC 2020workshop

Text Corpora and the Challenge of Newly Written Languages

Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)

DOI:10.63317/3kvp3xraiycr

Abstract

Text corpora represent the foundation on which most natural language processing systems rely. However, for many languages, collecting or building a text corpus of a sufficient size still remains a complex issue, especially for corpora that are accessible and distributed under a clear license allowing modification (such as annotation) and further resharing. In this paper, we review the sources of text corpora usually called upon to fill the gap in low-resource contexts, and how crowdsourcing has been used to build linguistic resources. Then, we present our own experiments with crowdsourcing text corpora and an analysis of the obstacles we encountered. Although the results obtained in terms of participation are still unsatisfactory, we advocate that the effort towards a greater involvement of the speakers should be pursued, especially when the language of interest is newly written.

Details

Paper ID
lrec2020-ws-sltu-15
Pages
pp. 111-120
BibKey
millour-fort-2020-text
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
N/A
ISBN
N/A
Workshop
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)
Location
undefined, undefined
Date
11 May 2020 16 May 2020

Authors

  • AM

    Alice Millour

  • KF

    Karën Fort

Links