SUMMARY : Session P22-W

 

Title Open Source Corpus Analysis Tools for Malay
Authors T. Baldwin, S. Awab
Abstract Tokenisers, lemmatisers and POS taggers are vital to the linguistic and digital furtherment of any language. In this paper, we present an open source toolkit for Malay incorporating a word and sentence tokeniser, a lemmatiser and a partial POS tagger, based on heavy reuse of pre-existing language resources. We outline the software architecture of each component, and present an evaluation of each over a 26K word sample of Malay text.
Keywords sentence tokeniser, lemmatiser, Malay
Full paper Open Source Corpus Analysis Tools for Malay