Back to Main Conference 2006
LREC 2006main

Open Source Corpus Analysis Tools for Malay

Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC 2006)

DOI:10.63317/2wrhtstg7hyt

Abstract

Tokenisers, lemmatisers and POS taggers are vital to the linguistic and digital furtherment of any language. In this paper, we present an open source toolkit for Malay incorporating a word and sentence tokeniser, a lemmatiser and a partial POS tagger, based on heavy reuse of pre-existing language resources. We outline the software architecture of each component, and present an evaluation of each over a 26K word sample of Malay text.

Details

Paper ID
lrec2006-main-418
Pages
N/A
BibKey
baldwin-awab-2006-open
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
2-9517408-2-4
Conference
Fifth International Conference on Language Resources and Evaluation
Location
Genoa, Italy
Date
24 May 2006 26 May 2006

Authors

  • TB

    Timothy Baldwin

  • SA

    Su’ad Awab

Links