Back to Main Conference 2026
LREC 2026main

ParaCLEAN: Improving Translation Quality through Systematic Parallel Data Cleaning

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

DOI:10.63317/36e3vfurjna4

Abstract

Parallel corpora often contain significant noise, particularly in low-resource settings where both collected and synthetic data are combined. We present ParaCLEAN, a modular pipeline for cleaning parallel data that integrates embeddings-based filtering, language identification, deduplication, and normalisation. Experiments on Catalan to Japanese translation demonstrate that ParaCLEAN improves data quality and downstream MT performance. Ablation studies highlight the contribution of each step. ParaCLEAN is lightweight, reproducible, and extensible for diverse language pairs.

Details

Paper ID
lrec2026-main-527
Pages
pp. 6630-6640
BibKey
mash-etal-2026-paraclean
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-493814-49-4
Conference
The Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Location
Palma, Mallorca, Spain
Date
11 May 2026 16 May 2026

Authors

  • AM

    Audrey Mash

  • EB

    Ella Paulina Bohman

  • MM

    Maite Melero

Links