HomeLREC 2026WorkshopsBUCClrec2026-ws-bucc-03
Back to BUCC 2026
LREC 2026workshop

Computing Semantic Similarity for Aligning Bilingual Semi-parallel Texts: A Case Study

Proceedings of the 19th Workshop on Building and Using Comparable Corpora (BUCC)

DOI:10.63317/37kekuueqcz6

Abstract

Semi-parallel text refers to versions of the same text that have to some extent been edited by authors, translators, or others. They are of relevance especially in the social sciences and in literary genres. In this paper, we consider the bilingual (English/German) variant of the problem. The philosopher Hannah Arendt, for example, wrote political essays that often exist in multiple versions and in both languages. She repeatedly modified her texts, added or deleted parts, and framed topics differently for target audiences. For researchers to explore the history of such material in detail, and at the same time at scale, automatic alignment (i.e., finding the best match of semantically similar sentences) is a very valuable preprocessing step. In this paper, we compare the performances of a range of methods for this task, based on computing semantic similarity. We present the results and conduct a qualitative error analysis to identify recurring sources of error.

Details

Paper ID
lrec2026-ws-bucc-03
Pages
pp. 9-19
BibKey
frenzel-etal-2026-computing
Editors
Reinhard Rapp, Ayla Rigouts Terryn, Serge Sharoff, Pierre Zweigenbaum
Publisher
European Language Resources Association (ELRA)
ISSN
N/A
ISBN
N/A
Workshop
Proceedings of the 19th Workshop on Building and Using Comparable Corpora (BUCC)
Location
Palma, Mallorca, Spain
Date
11 - 16 May 2026

Authors

  • SF

    Steffen Frenzel

  • MK

    Maximilian Krupop

  • MS

    Manfred Stede

Links