Back to Main Conference 2012
LREC 2012main

The Trilingual ALLEGRA Corpus: Presentation and Possible Use for Lexicon Induction

Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012)

DOI:10.63317/2q4x6wm48kzn

Abstract

In this paper, we present a trilingual parallel corpus for German, Italian and Romansh, a Swiss minority language spoken in the canton of Grisons. The corpus called ALLEGRA contains press releases automatically gathered from the website of the cantonal administration of Grisons. Texts have been preprocessed and aligned with a current state-of-the-art sentence aligner. The corpus is one of the first of its kind, and can be of great interest, particularly for the creation of natural language processing resources and tools for Romansh. We illustrate the use of such a trilingual resource for automatic induction of bilingual lexicons, which is a real challenge for under-represented languages. We induce a bilingual lexicon for German-Romansh by phrase alignment and evaluate the resulting entries with the help of a reference lexicon. We then show that the use of the third language of the corpus ― Italian ― as a pivot language can improve the precision of the induced lexicon, without loss in terms of quality of the extracted pairs.

Details

Paper ID
lrec2012-main-400
Pages
pp. 2890-2896
BibKey
scherrer-cartoni-2012-trilingual
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-9517408-7-7
Conference
Eighth International Conference on Language Resources and Evaluation
Location
Istanbul, Turkey
Date
21 May 2012 27 May 2012

Authors

  • YS

    Yves Scherrer

  • BC

    Bruno Cartoni

Links