Back to Main Conference 2008
LREC 2008main
Designing and Evaluating a Russian Tagset
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC 2008)
Abstract
This paper reports the principles behind designing a tagset to cover Russian morphosyntactic phenomena, modifications of the core tagset, and its evaluation. The tagset is based on the MULTEXT-East framework, while the decisions in designing it were aimed at achieving a balance between parameters important for linguists and the possibility to detect and disambiguate them automatically. The final tagset contains about 500 tags and achieves about 95% accuracy on the disambiguated portion of the Russian National Corpus. We have also produced a test set that can be shared with other researchers.