Finely Tuned, 2 Billion Token Based Word Embeddings for Portuguese
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
Abstract
A distributional semantics model --- also known as word embeddings --- is a major asset for any language as the research results reported in the literature have consistently shown that it is instrumental to improve the performance of a wide range of applications and processing tasks for that language. In this paper, we describe the development of an advanced distributional model for Portuguese, with the largest vocabulary and the best evaluation scores published so far. This model was made possible by resorting to new languages resources we recently developed: to a much larger training corpus than before and to a more sophisticated evaluation supported by new and more fine-grained evaluation tasks and data sets. We also indicate how the new language resource reported on here is being distributed and where it can be obtained for free under a most permissive license.