Exploiting a Large Strongly Comparable Corpus

Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)

Abstract

This article describes a large comparable corpus for Basque and Spanish and the methods employed to build a parallel resource from the original data. The EITB corpus, a strongly comparable corpus in the news domain, is to be shared with the research community, as an aid for the development and testing of methods in comparable corpora exploitation, and as basis for the improvement of data-driven machine translation systems for this language pair. Competing approaches were explored for the alignment of comparable segments in the corpus, resulting in the design of a simple method which outperformed a state-of-the-art method on the corpus test sets. The method we present is highly portable, computationally efficient, and significantly reduces deployment work, a welcome result for the exploitation of comparable corpora.

Resources

Details

Paper ID

lrec2016-main-560

Pages

pp. 3523-3529

DOI

10.63317/52vj3re8w7sk

BibKey

etchegoyhen-etal-2016-exploiting

Editors

Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asunción Moreno, Jan Odijk, Stelios Piperidis

Publisher

European Language Resources Association (ELRA)

ISSN

2522-2686

ISBN

978-2-9517408-9-1

Conference

Tenth International Conference on Language Resources and Evaluation

Location

Portorož, Slovenia

Date

23 - 28 May 2016

Authors

TE
Thierry Etchegoyhen
AA
Andoni Azpeitia
NP
Naiara Pérez

Links

URL

DOI