COME-ALPs: Coreference Annotation with MErging Heuristics Using ALignment-based Projection in Parallel Corpora
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
Multi-lingual, parallel datasets annotated with discourse phenomena like coreferences are a rare resource. These datasets are useful and informative to evaluate models for NLP tasks taking long contextual information into account, as proved by the large literature published in the last couple of years on e.g. Context-Aware Neural Machine Translation (CA-NMT). Inspired by resources published in previous work, in this paper we propose an automated procedure to annotate multi-lingual, parallel data with coreferences. Through the use of accurate alignment and coreference annotation tools, we project the annotation from English data, where tools are most often more accurate, to one or more target languages. We apply some consistency constraints to obtain more accurate annotations on both source and target side. Using our procedure we generated two new resources that can be used for evaluating CA-NMT models. One starting from the well-known TED Talk’s data released for the IWSLT17 shared task, where we project the annotation from English to target languages as diverse as French, German and Chinese. The second resource is derived from the WMT24 shared task, consisting of news domain data in the same set of target languages. We release these resources, as well as the code framework for applying our annotation procedure, to the community.