Back to Main Conference 2024
LREC-COLING 2024main

FLOR: On the Effectiveness of Language Adaptation

Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

DOI:10.63317/395bxjhot8he

Abstract

Large language models have amply proven their great capabilities, both in downstream tasks and real-life settings. However, low- and mid-resource languages do not have access to the necessary means to train such models from scratch, and often have to rely on multilingual models despite being underrepresented in the training data. For the particular case of the Catalan language, we prove that continued pre-training with vocabulary adaptation is a better alternative to take the most out of already pre-trained models, even if these have not seen any Catalan data during their pre-training phase. We curate a 26B tokens corpus and use it to further pre-train BLOOM, giving rise to the FLOR models. We perform an extensive evaluation to assess the effectiveness of our method, obtaining consistent gains across Catalan and Spanish tasks. The models, training data, and evaluation framework are made freely available under permissive licenses.

Details

Paper ID
lrec2024-main-0650
Pages
pp. 7377-7388
BibKey
da-dalt-etal-2024-flor
Editor
N/A
Publisher
European Language Resources Association (ELRA) and ICCL
ISSN
2522-2686
ISBN
979-10-95546-34-4
Conference
Joint International Conference on Computational Linguistics, Language Resources and Evaluation
Location
Turin, Italy
Date
20 May 2024 25 May 2024

Authors

  • SD

    Severino Da Dalt

  • JL

    Joan Llop

  • IB

    Irene Baucells

  • MP

    Marc Pamies

  • YX

    Yishi Xu

  • AG

    Aitor Gonzalez-Agirre

  • MV

    Marta Villegas

Links