Back to Main Conference 2026
LREC 2026main

Hindsight Quality Prediction Experiments in Multi-Candidate Human-Post-Edited Machine Translation

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

DOI:10.63317/24puen8nstzh

Abstract

This paper investigates two complementary paradigms for predicting machine translation quality: source-side difficulty prediction and candidate-side quality estimation (QE). The rapid adoption of Large Language Models (LLMs) into machine translation (MT) workflows is reshaping the research landscape, yet its impact on established quality prediction paradigms remains underexplored. We study this issue through a series of "hindsight" experiments on a unique, multi-candidate dataset resulting from a genuine machine translation post-editing (MTPE) project. The dataset consists of over 6,000 English source segments with nine translation hypotheses from a diverse set of traditional neural MT systems and advanced LLMs, all evaluated against a single, final human post-edited reference. Using Kendall’s rank correlation, we assess the predictive power of source-side difficulty metrics, candidate-side QE models and position heuristics against two gold-standard scores: TER (as a proxy for post-editing effort) and COMET (as a proxy for human judgment). Our analysis yields three primary findings: (1) On the source side, the predictive power of difficulty metrics is highly contingent on the reference metric used; features that strongly correlate with COMET (e.g., segment length, neural predictors) show much weaker correlation to TER. (2) On the candidate side, we find a significant mismatch between QE model rankings and final human-adjudicated quality, and further show that modern QE metrics are significantly more aligned with the quality of traditional neural MT outputs than with those from general-purpose LLMs. (3) While we confirm a statistically significant positional bias in document-level LLMs (i.e., the tendency for translation quality to degrade for segments occurring later in a document) its practical impact on translation quality appears to be negligible. These findings highlight that the architectural shift towards LLMs alters the reliability of established quality prediction methods while simultaneously mitigating previous challenges in document-level translation.

Details

Paper ID
lrec2026-main-688
Pages
pp. 8740-8755
BibKey
marmonier-etal-2026-hindsight
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-493814-49-4
Conference
The Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Location
Palma, Mallorca, Spain
Date
11 May 2026 16 May 2026

Authors

  • MM

    Malik Marmonier

  • BS

    Benoît Sagot

  • RB

    Rachel Bawden

Links