Creation of a Balanced State-of-the-Art Multilayer Corpus for NLU

Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

Abstract

This paper presents a work in progress to create a multilayered syntactically and semantically annotated text corpus for Latvian. The broad application area we address is natural language understanding (NLU), while more specific applications are abstractive text summarization and knowledge base population, which are required by the project industrial partner, Latvian information agency LETA, for the automation of various media monitoring processes. Both the multilayered corpus and the downstream applications are anchored in cross-lingual state-of-the-art representations: Universal Dependencies (UD), FrameNet, PropBank and Abstract Meaning Representation (AMR). In this paper, we particularly focus on the consecutive annotation of the treebank and framebank layers. We also draw links to the ultimate AMR layer and the auxiliary named entity and coreference annotation layers. Since we are aiming at a medium-sized still general-purpose corpus for a less-resourced language, an important aspect we consider is the variety and balance of the corpus in terms of genres, authors and lexical units.

Resources

Details

Paper ID

lrec2018-main-714

Pages

N/A

DOI

10.63317/3rob6twf9o8r

BibKey

gruzitis-etal-2018-creation

Editors

Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis, Takenobu Tokunaga

Publisher

European Language Resources Association (ELRA)

ISSN

2522-2686

ISBN

79-10-95546-00-9

Conference

Eleventh International Conference on Language Resources and Evaluation

Location

Miyazaki, Japan

Date

7 - 12 May 2018

Authors

NG
Normunds Gruzitis
LP
Lauma Pretkalnina
BS
Baiba Saulite
LR
Laura Rituma
GN
Gunta Nespore-Berzkalne
AZ
Arturs Znotins
PP
Peteris Paikens

Links

URL

DOI