Back to Main Conference 2010
LREC 2010main

Alignment-based Profiling of Europarl Data in an English-Swedish Parallel Corpus

Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010)

DOI:10.63317/2gtpvb75tntx

Abstract

This paper profiles the Europarl part of an English-Swedish parallel corpus and compares it with three other subcorpora of the same parallel corpus. We first describe our method for comparison which is based on manually reviewed word alignments. We investigate relative frequences of different types of correspondence, including null alignments, many-to-one correspondences and crossings. In addition, both halves of the parallel corpus have been annotated with morpho-syntactic information. The syntactic annotation uses labelled dependency relations. Thus, we can see how different types of correspondences are distributed on different parts-of-speech and compute correspondences at the structural level. In spite of the fact that two of the other subcorpora contains fiction, it is found that the Europarl part is the one having the highest proportion of many types of restructurings, including additions, deletions, long distance reorderings and dependency reversals. We explain this by the fact that the majority of Europarl segments are parallel translations rather than source texts and their translations.

Details

Paper ID
lrec2010-main-129
Pages
N/A
BibKey
ahrenberg-2010-alignment
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
2-9517408-6-7
Conference
Seventh International Conference on Language Resources and Evaluation
Location
Valletta, Malta
Date
17 May 2010 23 May 2010

Authors

  • LA

    Lars Ahrenberg

Links