Back to Main Conference 2024
LREC-COLING 2024main

Sequence Reducible Holdout Loss for Language Model Pretraining

Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

DOI:10.63317/4r6d7kkqzhmh

Abstract

Data selection techniques, which adaptively select datapoints inside the training loop, have demonstrated empirical benefits in reducing the number of gradient steps to train neural models. However, these techniques have so far largely been applied to classification. In this work, we study their applicability to language model pretraining, a highly time-intensive task. We propose a simple modification to an existing data selection technique (reducible hold-out loss training) in order to adapt it to the sequence losses typical in language modeling. We experiment on both autoregressive and masked language modelling, and show that applying data selection to pretraining offers notable benefits including a 4.3% reduction in total number of steps, a 21.5% steps reduction in average, to an intermediate target perplexity, over the course of pretraining an autoregressive language model. Further, data selection trained language models demonstrate significantly better generalization ability on out of domain datasets - 7.9% reduction in total number of steps and 23.2% average steps reduction to an intermediate target perplexity.

Details

Paper ID
lrec2024-main-1281
Pages
pp. 14705-14716
BibKey
thirukovalluru-etal-2024-sequence
Editor
N/A
Publisher
European Language Resources Association (ELRA) and ICCL
ISSN
2522-2686
ISBN
979-10-95546-34-4
Conference
Joint International Conference on Computational Linguistics, Language Resources and Evaluation
Location
Turin, Italy
Date
20 May 2024 25 May 2024

Authors

  • RT

    Raghuveer Thirukovalluru

  • NM

    Nicholas Monath

  • BD

    Bhuwan Dhingra

  • SW

    Sam Wiseman

Links