Back to Main Conference 2024
LREC-COLING 2024main

Align-to-Distill: Trainable Attention Alignment for Knowledge Distillation in Neural Machine Translation

Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

DOI:10.63317/496bh35o3ybj

Abstract

The advent of scalable deep models and large datasets has improved the performance of Neural Machine Translation (NMT). Knowledge Distillation (KD) enhances efficiency by transferring knowledge from a teacher model to a more compact student model. However, KD approaches to Transformer architecture often rely on heuristics, particularly when deciding which teacher layers to distill from. In this paper, we introduce the “Align-to-Distill” (A2D) strategy, designed to address the feature mapping problem by adaptively aligning student attention heads with their teacher counterparts during training. The Attention Alignment Module (AAM) in A2D performs a dense head-by-head comparison between student and teacher attention heads across layers, turning the combinatorial mapping heuristics into a learning problem. Our experiments show the efficacy of A2D, demonstrating gains of up to +3.61 and +0.63 BLEU points for WMT-2022 De→Dsb and WMT-2014 En→De, respectively, compared to Transformer baselines.The code and data are available at https://github.com/ncsoft/Align-to-Distill.

Details

Paper ID
lrec2024-main-0064
Pages
pp. 722-732
BibKey
jin-etal-2024-align
Editor
N/A
Publisher
European Language Resources Association (ELRA) and ICCL
ISSN
2522-2686
ISBN
979-10-95546-34-4
Conference
Joint International Conference on Computational Linguistics, Language Resources and Evaluation
Location
Turin, Italy
Date
20 May 2024 25 May 2024

Authors

  • HJ

    Heegon Jin

  • SS

    Seonil Son

  • JP

    Jemin Park

  • YK

    Youngseok Kim

  • HN

    Hyungjong Noh

  • YL

    Yeonsoo Lee

Links