Introducing the National Corpus of Irish Project

Proceedings of the 4th Celtic Language Technology Workshop within LREC2022

Abstract

This paper introduces the National Corpus of Irish, an initiative to develop a large national corpus of written and spoken contemporary Irish as well as related specialised corpora. The newly-compiled corpora will be hosted at corpas.ie, in what will become a hub for corpus-based research on the Irish language. Users will be able to search the corpora and download data generated during the project from the corpas.ie website and appropriate third-party repositories. Corpus 1 will be a balanced general-purpose corpus containing c.155m words. Corpus 2 will be a written corpus consisting of c100m words. Corpus 3 will be a spoken corpus containing 6.5m words. Corpus 4 will be a monitor corpus with a target size of 1m words per year from 2000 onwards. Token, lemma, and n-gram frequency lists will be published at regular intervals on the project website, and language models will be published there and on other appropriate platforms during the course of the project. This paper focuses on the background and crucial scoping stage of the project, and examines user needs as identified in a survey of potential users.