Sentence Selection Strategies for Distilling Word Embeddings from BERT

Proceedings of the Thirteenth International Conference on Language Resources and Evaluation (LREC 2022)

Abstract

Many applications crucially rely on the availability of high-quality word vectors. To learn such representations, several strategies based on language models have been proposed in recent years. While effective, these methods typically rely on a large number of contextualised vectors for each word, which makes them impractical. In this paper, we investigate whether similar results can be obtained when only a few contextualised representations of each word can be used. To this end, we analyse a range of strategies for selecting the most informative sentences. Our results show that with a careful selection strategy, high-quality word vectors can be learned from as few as 5 to 10 sentences.

Resources

Details

Paper ID

lrec2022-main-277

Pages

pp. 2591-2600

DOI

10.63317/4e4c7ea2pzee

BibKey

wang-etal-2022-sentence

Editors

Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis2020

Publisher

European Language Resources Association (ELRA)

ISSN

2522-2686

ISBN

79-10-95546-38-2

Conference

Thirteenth Language Resources and Evaluation Conference

Location

Marseille, France

Date

20 - 25 June 2022

Authors

YW
Yixiao Wang
ZB
Zied Bouraoui
LE
Luis Espinosa Anke
SS
Steven Schockaert

Links

URL

DOI