Back to Main Conference 2016
LREC 2016main

QTLeap WSD/NED Corpora: Semantic Annotation of Parallel Corpora in Six Languages

Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)

DOI:10.63317/5d9kswtyumzu

Abstract

This work presents parallel corpora automatically annotated with several NLP tools, including lemma and part-of-speech tagging, named-entity recognition and classification, named-entity disambiguation, word-sense disambiguation, and coreference. The corpora comprise both the well-known Europarl corpus and a domain-specific question-answer troubleshooting corpus on the IT domain. English is common in all parallel corpora, with translations in five languages, namely, Basque, Bulgarian, Czech, Portuguese and Spanish. We describe the annotated corpora and the tools used for annotation, as well as annotation statistics for each language. These new resources are freely available and will help research on semantic processing for machine translation and cross-lingual transfer.

Details

Paper ID
lrec2016-main-483
Pages
pp. 3023-3030
BibKey
otegi-etal-2016-qtleap
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-9517408-9-1
Conference
Tenth International Conference on Language Resources and Evaluation
Location
Portorož, Slovenia
Date
23 May 2016 28 May 2016

Authors

  • AO

    Arantxa Otegi

  • NA

    Nora Aranberri

  • AB

    Antonio Branco

  • JH

    Jan Hajič

  • MP

    Martin Popel

  • KS

    Kiril Simov

  • EA

    Eneko Agirre

  • PO

    Petya Osenova

  • RP

    Rita Pereira

  • JS

    João Silva

  • SN

    Steven Neale

Links