Heuristic Word Alignment with Parallel Phrases
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010)
Abstract
We present a heuristic method for word alignment, which is the task of identifying corresponding words in parallel text. The heuristic method is based on parallel phrases extracted from manually word aligned sentence pairs. Word alignment is performed by matching parallel phrases to new sentence pairs, and adding word links from the parallel phrase to words in the matching sentence segment. Experiments on an English--Swedish parallel corpus showed that the heuristic phrase-based method produced word alignments with high precision but low recall. In order to improve alignment recall, phrases were generalized by replacing words with part-of-speech categories. The generalization improved recall but at the expense of precision. Two filtering strategies were investigated to prune the large set of generalized phrases. Finally, the phrase-based method was compared to statistical word alignment with Giza++ and we found that although statistical alignments based on large datasets will outperform phrase-based word alignment, a combination of phrase-based and statistical word alignment outperformed pure statistical alignment in terms of Alignment Error Rate (AER).