Back to Main Conference 2026
LREC 2026main

ACAData: Parallel Dataset of Academic Data for Machine Translation

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

DOI:10.63317/4fkj9gvuqsdd

Abstract

We present ACAData, a high-quality parallel dataset for academic translation, that consists of two subsets: ACAD-Train, which contains approximately 1.5 million human-generated paragraph pairs across 12 languages, and ACAD-Bench, a curated evaluation set of almost 6,000 translations covering 12 directions. To validate its usefulness, we fine-tune two Large Language Models (LLMs) on ACAD-Train and benchmark them on ACAD-Bench against specialized machine-translation systems, general-purpose, open-weight LLMs, and several large-scale proprietary models. Experimental results demonstrate that fine tuning on ACAD-Train leads to improvements in academic translation quality by +6.1 and +12.4 d-BLEU points on average for 7B and 2B models respectively, while also improving long-context translation in a general domain by up to 24.9% when translating out of English. The fine-tuned top-performing model surpasses the best proprietary and open-weight models on the academic translation domain. By releasing ACAD-Train, ACAD-Bench and the fine-tuned models, we provide the community with a valuable resource to advance research in the academic domain and long-context translation.

Details

Paper ID
lrec2026-main-671
Pages
pp. 8498-8519
BibKey
lacunza-etal-2026-acadata
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-493814-49-4
Conference
The Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Location
Palma, Mallorca, Spain
Date
11 May 2026 16 May 2026

Authors

  • IL

    Iñaki Lacunza

  • JG

    Javier Garcia Gilabert

  • FF

    Francesca De Luca Fornaciari

  • JA

    Javier Aula-Blasco

  • AG

    Aitor Gonzalez-Agirre

  • MM

    Maite Melero

  • MV

    Marta Villegas

Links