SzegedKoref: A Hungarian Coreference Corpus

Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

Abstract

In this paper we introduce SzegedKoref, a Hungarian corpus in which coreference relations are manually annotated. For annotation, we selected some texts of Szeged Treebank, the biggest treebank of Hungarian with manual annotation at several linguistic layers. The corpus contains approximately 55,000 tokens and 4000 sentences. Due to its size, the corpus can be exploited in training and testing machine learning based coreference resolution systems, which we would like to implement in the near future. We present the annotated texts, we describe the annotated categories of anaphoric relations, we report on the annotation process and we offer several examples of each annotated category. Two linguistic phenomena -- phonologically empty pronouns and pronouns referring to subordinate clauses -- are important characteristics of Hungarian coreference relations. In our paper, we also discuss both of them.

Resources

Details

Paper ID

lrec2018-main-061

Pages

N/A

DOI

10.63317/5hghr677923d

BibKey

vincze-etal-2018-szegedkoref

Editors

Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis, Takenobu Tokunaga

Publisher

European Language Resources Association (ELRA)

ISSN

2522-2686

ISBN

79-10-95546-00-9

Conference

Eleventh International Conference on Language Resources and Evaluation

Location

Miyazaki, Japan

Date

7 - 12 May 2018

Authors

VV
Veronika Vincze
KH
Klára Hegedűs
AS
Alex Sliz-Nagy
RF
Richárd Farkas

Links

URL

DOI