Croatian Error-Annotated Corpus of Non-Professional Written Language

Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)

Abstract

In the paper authors present the Croatian corpus of non-professional written language. Consisting of two subcorpora, i.e. the clinical subcorpus, consisting of written texts produced by speakers with various types of language disorders, and the healthy speakers subcorpus, as well as by the levels of its annotation, it offers an opportunity for different lines of research. The authors present the corpus structure, describe the sampling methodology, explain the levels of annotation, and give some very basic statistics. On the basis of data from the corpus, existing language technologies for Croatian are adapted in order to be implemented in a platform facilitating text production to speakers with language disorders. In this respect, several analyses of the corpus data and a basic evaluation of the developed technologies are presented.

Resources

Details

Paper ID

lrec2016-main-513

Pages

pp. 3220-3226

DOI

10.63317/2a5j6mxjotom

BibKey

stefanec-etal-2016-croatian

Editors

Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asunción Moreno, Jan Odijk, Stelios Piperidis

Publisher

European Language Resources Association (ELRA)

ISSN

2522-2686

ISBN

978-2-9517408-9-1

Conference

Tenth International Conference on Language Resources and Evaluation

Location

Portorož, Slovenia

Date

23 - 28 May 2016

Authors

VŠ
Vanja Štefanec
NL
Nikola Ljubešić
JK
Jelena Kuvač Kraljević

Links

URL

DOI