Back to Main Conference 2010
LREC 2010main

NP Alignment in Bilingual Corpora

Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010)

DOI:10.63317/2xe94d56x8i3

Abstract

Aligning the NPs of parallel corpora is logically halfway between the sentence- and word-alignment tasks that occupy much of the MT literature, but has received far less attention. NP alignment is a challenging problem, capable of rapidly exposing flaws both in the word-alignment and in the NP chunking algorithms one may bring to bear. It is also a very rewarding problem in that NPs are semantically natural translation units, which means that (i) word alignments will cross NP boundaries only exceptionally, and (ii) within sentences already aligned, the proportion of 1-1 alignments will be higher for NPs than words. We created a simple gold standard for English-Hungarian, Orwell’s 1984, (since this already exists in manually verified POS-tagged format in many languages thanks to the Multex and MultexEast project) by manually verifying the automaticaly generated NP chunking (we used the yamcha, mallet and hunchunk taggers) and manually aligning the maximal NPs and PPs. The maximum NP chunking problem is much harder than base NP chunking, with F-measure in the .7 range (as opposed to over .94 for base NPs). Since the results are highly impacted by the quality of the NP chunking, we tested our alignment algorithms both with real world (machine obtained) chunkings, where results are in the .35 range for the baseline algorithm which propagates GIZA++ word alignments to the NP level, and on idealized (manually obtained) chunkings, where the baseline reaches .4 and our current system reaches .64.

Details

Paper ID
lrec2010-main-364
Pages
N/A
BibKey
recski-etal-2010-np
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
2-9517408-6-7
Conference
Seventh International Conference on Language Resources and Evaluation
Location
Valletta, Malta
Date
17 May 2010 23 May 2010

Authors

  • GR

    Gábor Recski

  • AR

    András Rung

  • AZ

    Attila Zséder

  • AK

    András Kornai

Links