QTLeap WSD/NED Corpora: Semantic Annotation of Parallel Corpora in Six Languages

Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)

Abstract

This work presents parallel corpora automatically annotated with several NLP tools, including lemma and part-of-speech tagging, named-entity recognition and classification, named-entity disambiguation, word-sense disambiguation, and coreference. The corpora comprise both the well-known Europarl corpus and a domain-specific question-answer troubleshooting corpus on the IT domain. English is common in all parallel corpora, with translations in five languages, namely, Basque, Bulgarian, Czech, Portuguese and Spanish. We describe the annotated corpora and the tools used for annotation, as well as annotation statistics for each language. These new resources are freely available and will help research on semantic processing for machine translation and cross-lingual transfer.

Resources

Details

Paper ID

lrec2016-main-483

Pages

pp. 3023-3030

DOI

10.63317/5d9kswtyumzu

BibKey

otegi-etal-2016-qtleap

Editors

Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asunción Moreno, Jan Odijk, Stelios Piperidis

Publisher

European Language Resources Association (ELRA)

ISSN

2522-2686

ISBN

978-2-9517408-9-1

Conference

Tenth International Conference on Language Resources and Evaluation

Location

Portorož, Slovenia

Date

23 - 28 May 2016

Authors

AO
Arantxa Otegi
NA
Nora Aranberri
AB
Antonio Branco
JH
Jan Hajič
MP
Martin Popel
KS
Kiril Simov
EA
Eneko Agirre
PO
Petya Osenova
RP
Rita Pereira
JS
João Silva
SN
Steven Neale

Links

URL

DOI