Back to Main Conference 2022
LREC 2022main

gaBERT — an Irish Language Model

Proceedings of the Thirteenth International Conference on Language Resources and Evaluation (LREC 2022)

DOI:10.63317/2c485c2jd2yz

Abstract

The BERT family of neural language models have become highly popular due to their ability to provide sequences of text with rich context-sensitive token encodings which are able to generalise well to many NLP tasks. We introduce gaBERT, a monolingual BERT model for the Irish language. We compare our gaBERT model to multilingual BERT and the monolingual Irish WikiBERT, and we show that gaBERT provides better representations for a downstream parsing task. We also show how different filtering criteria, vocabulary size and the choice of subword tokenisation model affect downstream performance. We compare the results of fine-tuning a gaBERT model with an mBERT model for the task of identifying verbal multiword expressions, and show that the fine-tuned gaBERT model also performs better at this task. We release gaBERT and related code to the community.

Details

Paper ID
lrec2022-main-511
Pages
pp. 4774-4788
BibKey
barry-etal-2022-gabert
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
79-10-95546-38-2
Conference
Thirteenth Language Resources and Evaluation Conference
Location
Marseille, France
Date
20 June 2022 25 June 2022

Authors

  • JB

    James Barry

  • JW

    Joachim Wagner

  • LC

    Lauren Cassidy

  • AC

    Alan Cowap

  • TL

    Teresa Lynn

  • AW

    Abigail Walsh

  • Mícheál J. Ó Meachair

  • JF

    Jennifer Foster

Links