Back to Main Conference 2004
LREC 2004main

Part-of-Speech Annotation of Biology Research Abstracts

Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC 2004)

DOI:10.63317/4kpxeameezza

Abstract

A part-of-speech (POS) tagged corpus was built on research abstracts in biomedical domain with the Penn Treebank scheme. As consistent annotation was difficult without domain-specific knowledge we made use of the existing term annotation of the GENIA corpus. A list of frequent terms annotated in the GENIA corpus was compiled and the POS of each constituent of those terms were determined with assistance from domain specialists. The POS of the terms in the list are pre-assigned, then a tagger assigns POS to remaining words preserving the pre-assigned POS, whose results are corrected by human annotators. We also modified the PTB scheme slightly. An inter-annotator agreement tested on new 50 abstracts was 98.5%. A POS tagger trained with the annotated abstracts was tested against a gold-standard set made from the interannotator agreement. The untrained tagger had the accuracy of 83.0%. Trained with 2000 annotated abstracts the accuracy rose to 98.2%. The 2000 annotated abstracts are publicly available.

Details

Paper ID
lrec2004-main-322
Pages
N/A
BibKey
tateisi-tsujii-2004-part
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
2-9517408-1-6
Conference
Fourth International Conference on Language Resources and Evaluation
Location
Lisbon, Portugal
Date
26 May 2004 28 May 2004

Authors

  • YT

    Yuka Tateisi

  • JT

    Jun-ichi Tsujii

Links