HomeLREC 2022WorkshopsCLTWlrec2022-ws-cltw-14
Back to CLTW 2022
LREC 2022workshop

Introducing the National Corpus of Irish Project

Proceedings of the 4th Celtic Language Technology Workshop within LREC2022

DOI:10.63317/2c2f29whcwqz

Abstract

This paper introduces the National Corpus of Irish, an initiative to develop a large national corpus of written and spoken contemporary Irish as well as related specialised corpora. The newly-compiled corpora will be hosted at corpas.ie, in what will become a hub for corpus-based research on the Irish language. Users will be able to search the corpora and download data generated during the project from the corpas.ie website and appropriate third-party repositories. Corpus 1 will be a balanced general-purpose corpus containing c.155m words. Corpus 2 will be a written corpus consisting of c100m words. Corpus 3 will be a spoken corpus containing 6.5m words. Corpus 4 will be a monitor corpus with a target size of 1m words per year from 2000 onwards. Token, lemma, and n-gram frequency lists will be published at regular intervals on the project website, and language models will be published there and on other appropriate platforms during the course of the project. This paper focuses on the background and crucial scoping stage of the project, and examines user needs as identified in a survey of potential users.

Details

Paper ID
lrec2022-ws-cltw-14
Pages
pp. 99-103
BibKey
o-meachair-etal-2022-introducing
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
N/A
ISBN
N/A
Workshop
Proceedings of the 4th Celtic Language Technology Workshop within LREC2022
Location
undefined, undefined
Date
20 June 2022 25 June 2022

Authors

  • Mícheál Ó Meachair

  • ÚB

    Úna Bhreathnach

  • Gearóid Ó Cleircín

Links