HomeLREC 2020WorkshopsCALCSlrec2020-ws-calcs-4
Back to CALCS 2020
LREC 2020workshop

Evaluating Word Embeddings for Indonesian–English Code-Mixed Text Based on Synthetic Data

Proceedings of the 4th Workshop on Computational Approaches to Code Switching

DOI:10.63317/5jgmpd9mdfwe

Abstract

Code-mixed texts are abundant, especially in social media, and poses a problem for NLP tools, which are typically trained on monolingual corpora. In this paper, we explore and evaluate different types of word embeddings for Indonesian–English code-mixed text. We propose the use of code-mixed embeddings, i.e. embeddings trained on code-mixed text. Because large corpora of code-mixed text are required to train embeddings, we describe a method for synthesizing a code-mixed corpus, grounded in literature and a survey. Using sentiment analysis as a case study, we show that code-mixed embeddings trained on synthesized data are at least as good as cross-lingual embeddings and better than monolingual embeddings.

Details

Paper ID
lrec2020-ws-calcs-4
Pages
pp. 26-35
BibKey
rizal-stymne-2020-evaluating
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
N/A
ISBN
N/A
Workshop
Proceedings of the 4th Workshop on Computational Approaches to Code Switching
Location
undefined, undefined
Date
11 May 2020 16 May 2020

Authors

  • AR

    Arra’Di Nur Rizal

  • SS

    Sara Stymne

Links