Domain Adaptation in MT Using Titles in Wikipedia as a Parallel Corpus: Resources and Evaluation

Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)

Abstract

This paper presents how an state-of-the-art SMT system is enriched by using an extra in-domain parallel corpora extracted from Wikipedia. We collect corpora from parallel titles and from parallel fragments in comparable articles from Wikipedia. We carried out an evaluation with a double objective: evaluating the quality of the extracted data and evaluating the improvement due to the domain-adaptation. We think this can be very useful for languages with limited amount of parallel corpora, where in-domain data is crucial to improve the performance of MT sytems. The experiments on the Spanish-English language pair improve a baseline trained with the Europarl corpus in more than 2 points of BLEU when translating in the Computer Science domain.

Resources

Details

Paper ID

lrec2016-main-351

Pages

pp. 2209-2213

DOI

10.63317/372zhzs7is3v

BibKey

labaka-etal-2016-domain

Editors

Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asunción Moreno, Jan Odijk, Stelios Piperidis

Publisher

European Language Resources Association (ELRA)

ISSN

2522-2686

ISBN

978-2-9517408-9-1

Conference

Tenth International Conference on Language Resources and Evaluation

Location

Portorož, Slovenia

Date

23 - 28 May 2016

Authors

GL
Gorka Labaka
IA
Iñaki Alegria
KS
Kepa Sarasola

Links

URL

DOI