HomeLREC 2022WorkshopsLAWlrec2022-ws-law-13
Back to LAW 2022
LREC 2022workshop

Midas Loop: A Prioritized Human-in-the-Loop Annotation for Large Scale Multilayer Data

Proceedings of the 16th Linguistic Annotation Workshop (LAW-XVI) within LREC2022

DOI:10.63317/3trcjahjw5kb

Abstract

Large scale annotation of rich multilayer corpus data is expensive and time consuming, motivating approaches that integrate high quality automatic tools with active learning in order to prioritize human labeling of hard cases. A related challenge in such scenarios is the concurrent management of automatically annotated data and human annotated data, particularly where different subsets of the data have been corrected for different types of annotation and with different levels of confidence. In this paper we present [REDACTED], a collaborative, version-controlled online annotation environment for multilayer corpus data which includes integrated provenance and confidence metadata for each piece of information at the document, sentence, token and annotation level. We present a case study on improving annotation quality in an existing multilayer parse bank of English called AMALGUM, focusing on active learning in corpus preprocessing, at the surprisingly challenging level of sentence segmentation. Our results show improvements to state-of-the-art sentence segmentation and a promising workflow for getting “silver” data to approach gold standard quality.

Details

Paper ID
lrec2022-ws-law-13
Pages
pp. 103-110
BibKey
gessler-etal-2022-midas
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
N/A
ISBN
N/A
Workshop
Proceedings of the 16th Linguistic Annotation Workshop (LAW-XVI) within LREC2022
Location
undefined, undefined
Date
20 June 2022 25 June 2022

Authors

  • LG

    Luke Gessler

  • LL

    Lauren Levine

  • AZ

    Amir Zeldes

Links