LREC 2010 Tutorial Description

STATISTICAL MODELS OF THE ANNOTATION PROCESS

OVERVIEW

In this tutorial, we will cover a range of basic models for the
linguistic annotation process.  We will focus on the structure of the
models and the forms of inference they support, with examples drawn
from a broad range of natural language annotation tasks.  Along with
the examples, we will also demonstrate how to carry out the
statistical analyses using open source statistics packages.  By the
end of the tutorial, students should be acquainted with how to
analyze their own data.

EXAMPLE STATISTICS

For example, simple agreement or chance-adjusted kappa statistics may
be computed from paired classification annotations.  Because a range
of natural language phenomena from part-of-speech to sentiment may be
coded as classification problems, kappa statistics have a wide use.
Classical kappa calculations support hypothesis testing for a kind of
chance-adjusted agreement.  Thus they attempt to answer the question
of how much better than chance the agreement between a pair of
annotators is.

The main focus of this tutorial will be on richer generative models of
the annotation process.  In the simplest case, we will model annotator
accuracy versus a known gold standard.  It is not surprising that it
is easy to reject the hypothesis that annotators all have the same
accuracy.  So we model each annotator's accuracy individually.

Using these inferred annotator accuracies, we will be able to predict
confidence on labeled examples where there is no gold standard.  We
will be able to draw inferences about whether one annotator is better
than another.  And inferences about the prevalence of various
annotations in the data population.

With a binary classification problem, sensitivity and specificity
statistics are defined as accuracy on positive and negative cases
respectively.  Modeling annotators as having only an overall accuracy
assumes sensitivities and specificities are the same; in other words,
that there is no annotator bias toward positive or negative examples.
This hypothesis is also easily rejected.

By separately modeling sensitivity and specificity, we are able to
asses individual annotator's biases toward positive or negative
categories.  Sensitivity and specificity may also be generalized to
multinomial (more than two outcome) problems such as topic assignment
or named entity chunking.

Even without gold standard data in hand, with enough overlap in
annotations among annotators, we may still infer annotator accuracies
along with the prevalence of the various categories in the data
population.  We may use maximum likelihood estimates or infer complete
Bayesian posteriors from either informative or non-informative priors.

We will consider hierarchical models that add additional parameters
characterizing the population of annotator sensitivities and
specificities in terms of average annotator performance and
inter-annotator variation.  Hierarchical models may be used to infer
accuracy priors by assigning them diffuse hyperpriors.  The inferred
accuracy priors characterize the entire population of annotators, and
hence the difficulty and reporoducibility of the annotation task
itself.

As anyone who has annotated data is well aware, not all instances are
created equal.  Some are easy to annotate, and some are much much
harder.  Simpler models assume all examples are equal, but this
hypothesis is also easily rejected.

Just as we may infer annotator accuracies with enough overlap in
annotation, we may infer item difficulty with enough annotators per
instance.  Item difficulty lends itself to the same kind of
hierarchical modeling as annotator accuracy, thus allowing us to infer
properties of the difficulties of the population of examples.

Time permitting, we will discuss non-categorical data annotation
tasks, such as ordinal annotation (e.g.  height of a vowel in phonemic
annotation or ranking query result documents for information
retrieval), or scalar annotation (e.g. degree of stress in an
intonation annotation or degree of positive sentiment in a statement).


APPLICATIONS

While the models of annotation accuracy and item difficulty along with
their population parameters are of interest themselves, they also lead
to a variety of useful applications.

Applications to corpus creation and coding standard design include
active learning, providing feedback to correct annotator bias and
inaccuracy, and estimating confidence in gold-standard labels.

At both learning and evaluation time, standard performance measures
may be generalized to the probabilistic labelings inferred from a
multiply annotated corpus.


EXAMPLES

We will consider case studies where the full annotation data is
publicly available, including binary decisions for textual entailment,
multi-way decisions for word-sense disambiguation, open-ended
decisions for morphological stemming, span detection for named entity
recognition, and a variety of coreference classification and linkage
tasks.

The examples we have were all crowd-sourced on the web with multiple
untrained annotators per item but little control over which items were
annotated by which annotators.  For many of the examples, gold
standard data was generated in the traditional way, and we compare the
crowdsourced inferred gold standard with the existing gold standards.


INSTRUCTOR BIOS

Bob Carpenter received a Ph.D. in cognitive science from the
University of Edinburgh.  He has since worked as a computational
linguistics professor at Carnegie Mellon University, a speech and
language researcher at Lucent Bell Labs, and a researcher and software
developer at SpeechWorks.  He's now a software architect and research
scientist at Alias-i, where he develops and maintains the LingPipe
suite of natural language processing software.  Over the past two
years, he has published reports and software on hierarchical Bayesian
models of annotation data.

Massimo Poesio received a Ph.D. in computer science from the
University of Rochester.  He has sinced worked as a researcher at the
University of Edinburgh's Centre for Cognitive Science.  He's now
jointly appointed as a reader in computer science at the University of
Essex and a professor of computer science at the University of Trento.
He has worked on corpus annotation since the 1999 MATE project,
continuing with the GNOME and ARRAU projects.  More recently he's
worked on the ANAWIKI coreference online annotation project.  Last
year, he co-authored a detailed survey for the Computational
Linguistics Journal on inter-annotator agreement statistics.

---------------------------------------------------