Back to Main Conference 2018
LREC 2018main

BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages

Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

DOI:10.63317/3eqtu58tsqay

Abstract

We present BPEmb, a collection of pre-trained subword unit embeddings in 275 languages, based on Byte-Pair Encoding (BPE). In an evaluation using fine-grained entity typing as testbed, BPEmb performs competitively, and for some languages better than alternative subword approaches, while requiring vastly fewer resources and no tokenization. BPEmb is available at https://github.com/bheinzerling/bpemb.

Details

Paper ID
lrec2018-main-473
Pages
N/A
BibKey
heinzerling-strube-2018-bpemb
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
79-10-95546-00-9
Conference
Eleventh International Conference on Language Resources and Evaluation
Location
Miyazaki, Japan
Date
7 May 2018 12 May 2018

Authors

  • BH

    Benjamin Heinzerling

  • MS

    Michael Strube

Links