Computing Semantic Similarity for Aligning Bilingual Semi-parallel Texts: A Case Study

Proceedings of the 19th Workshop on Building and Using Comparable Corpora (BUCC)

Abstract

Semi-parallel text refers to versions of the same text that have to some extent been edited by authors, translators, or others. They are of relevance especially in the social sciences and in literary genres. In this paper, we consider the bilingual (English/German) variant of the problem. The philosopher Hannah Arendt, for example, wrote political essays that often exist in multiple versions and in both languages. She repeatedly modified her texts, added or deleted parts, and framed topics differently for target audiences. For researchers to explore the history of such material in detail, and at the same time at scale, automatic alignment (i.e., finding the best match of semantically similar sentences) is a very valuable preprocessing step. In this paper, we compare the performances of a range of methods for this task, based on computing semantic similarity. We present the results and conduct a qualitative error analysis to identify recurring sources of error.