Back to Main Conference 2018
LREC 2018main

Leveraging Lexical Resources and Constraint Grammar for Rule-Based Part-of-Speech Tagging in Welsh

Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

DOI:10.63317/483te4x2sztc

Abstract

As the quantity of annotated language data and the quality of machine learning algorithms have increased over time, statistical part-of-speech (POS) taggers trained over large datasets have become as robust or better than their rule-based counterparts. However, for lesser-resourced languages such as Welsh there is simply not enough accurately annotated data to train a statistical POS tagger. Furthermore, many of the more popular rule-based taggers still require that their rules be inferred from annotated data, which while not as extensive as that required for training a statistical tagger must still be sizeable. In this paper we describe CyTag, a rule-based POS tagger for Welsh based on the VISL Constraint Grammar parser. Leveraging lexical information from Eurfa (an open-source dictionary for Welsh), we extract lists of possible POS tags for each word token in a running text and then apply various constraints - to prune the number of possible tags until the most appropriate tag for a given token can be selected. We explain how this approach is particularly useful in dealing with some of the specific intricacies of Welsh and present an evaluation of the performance of the tagger using a manually checked test corpus of 611 Welsh sentences.

Details

Paper ID
lrec2018-main-623
Pages
N/A
BibKey
neale-etal-2018-leveraging
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
79-10-95546-00-9
Conference
Eleventh International Conference on Language Resources and Evaluation
Location
Miyazaki, Japan
Date
7 May 2018 12 May 2018

Authors

  • SN

    Steven Neale

  • KD

    Kevin Donnelly

  • GW

    Gareth Watkins

  • DK

    Dawn Knight

Links