HomeLREC 2022WorkshopsLAWlrec2022-ws-law-16
Back to LAW 2022
LREC 2022workshop

Building a Biomedical Full-Text Part-of-Speech Corpus Semi-Automatically

Proceedings of the 16th Linguistic Annotation Workshop (LAW-XVI) within LREC2022

DOI:10.63317/4tw9qc3ixsd7

Abstract

This paper presents a method for semi-automatically building a corpus of full-text English-language biomedical articles annotated with part-of-speech tags. The outcomes are a semi-automatic procedure to create a large silver standard corpus of 5 million sentences drawn from a large corpus of full-text biomedical articles annotated for part-of-speech, and a robust, easy-to-use software tool that assists the investigation of differences in two tagged datasets. The method to build the corpus uses two part-of-speech taggers designed to tag biomedical abstracts followed by a human dispute settlement when the two taggers differ on the tagging of a token. The dispute resolution aspect is facilitated by the software tool which organizes and presents the disputed tags. The corpus and all of the software that has been implemented for this study are made publicly available.

Details

Paper ID
lrec2022-ws-law-16
Pages
pp. 129-138
BibKey
elder-etal-2022-building
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
N/A
ISBN
N/A
Workshop
Proceedings of the 16th Linguistic Annotation Workshop (LAW-XVI) within LREC2022
Location
undefined, undefined
Date
20 June 2022 25 June 2022

Authors

  • NE

    Nicholas Elder

  • RM

    Robert E. Mercer

  • SS

    Sudipta Singha Roy

Links