Back to Main Conference 2022
LREC 2022main

Improving Large-scale Language Models and Resources for Filipino

Proceedings of the Thirteenth International Conference on Language Resources and Evaluation (LREC 2022)

DOI:10.63317/4rmdofoy9bhf

Abstract

In this paper, we improve on existing language resources for the low-resource Filipino language in two ways. First, we outline the construction of the TLUnified dataset, a large-scale pretraining corpus that serves as an improvement over smaller existing pretraining datasets for the language in terms of scale and topic variety. Second, we pretrain new Transformer language models following the RoBERTa pretraining technique to supplant existing models trained with small corpora. Our new RoBERTa models show significant improvements over existing Filipino models in three benchmark datasets with an average gain of 4.47% test accuracy across three classification tasks with varying difficulty.

Details

Paper ID
lrec2022-main-703
Pages
pp. 6548-6555
BibKey
cruz-cheng-2022-improving
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
79-10-95546-38-2
Conference
Thirteenth Language Resources and Evaluation Conference
Location
Marseille, France
Date
20 June 2022 25 June 2022

Authors

  • JC

    Jan Christian Blaise Cruz

  • CC

    Charibeth Cheng

Links