Back to Main Conference 2022
LREC 2022main

Sentence Selection Strategies for Distilling Word Embeddings from BERT

Proceedings of the Thirteenth International Conference on Language Resources and Evaluation (LREC 2022)

DOI:10.63317/4e4c7ea2pzee

Abstract

Many applications crucially rely on the availability of high-quality word vectors. To learn such representations, several strategies based on language models have been proposed in recent years. While effective, these methods typically rely on a large number of contextualised vectors for each word, which makes them impractical. In this paper, we investigate whether similar results can be obtained when only a few contextualised representations of each word can be used. To this end, we analyse a range of strategies for selecting the most informative sentences. Our results show that with a careful selection strategy, high-quality word vectors can be learned from as few as 5 to 10 sentences.

Details

Paper ID
lrec2022-main-277
Pages
pp. 2591-2600
BibKey
wang-etal-2022-sentence
Editors
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis2020
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
79-10-95546-38-2
Conference
Thirteenth Language Resources and Evaluation Conference
Location
Marseille, France
Date
20 - 25 June 2022

Authors

  • YW

    Yixiao Wang

  • ZB

    Zied Bouraoui

  • LE

    Luis Espinosa Anke

  • SS

    Steven Schockaert

Links