LREC 2010 Tutorial Description STATISTICAL MODELS OF THE ANNOTATION PROCESS OVERVIEW In this tutorial, we will cover a range of basic models for the linguistic annotation process. We will focus on the structure of the models and the forms of inference they support, with examples drawn from a broad range of natural language annotation tasks. Along with the examples, we will also demonstrate how to carry out the statistical analyses using open source statistics packages. By the end of the tutorial, students should be acquainted with how to analyze their own data. EXAMPLE STATISTICS For example, simple agreement or chance-adjusted kappa statistics may be computed from paired classification annotations. Because a range of natural language phenomena from part-of-speech to sentiment may be coded as classification problems, kappa statistics have a wide use. Classical kappa calculations support hypothesis testing for a kind of chance-adjusted agreement. Thus they attempt to answer the question of how much better than chance the agreement between a pair of annotators is. The main focus of this tutorial will be on richer generative models of the annotation process. In the simplest case, we will model annotator accuracy versus a known gold standard. It is not surprising that it is easy to reject the hypothesis that annotators all have the same accuracy. So we model each annotator's accuracy individually. Using these inferred annotator accuracies, we will be able to predict confidence on labeled examples where there is no gold standard. We will be able to draw inferences about whether one annotator is better than another. And inferences about the prevalence of various annotations in the data population. With a binary classification problem, sensitivity and specificity statistics are defined as accuracy on positive and negative cases respectively. Modeling annotators as having only an overall accuracy assumes sensitivities and specificities are the same; in other words, that there is no annotator bias toward positive or negative examples. This hypothesis is also easily rejected. By separately modeling sensitivity and specificity, we are able to asses individual annotator's biases toward positive or negative categories. Sensitivity and specificity may also be generalized to multinomial (more than two outcome) problems such as topic assignment or named entity chunking. Even without gold standard data in hand, with enough overlap in annotations among annotators, we may still infer annotator accuracies along with the prevalence of the various categories in the data population. We may use maximum likelihood estimates or infer complete Bayesian posteriors from either informative or non-informative priors. We will consider hierarchical models that add additional parameters characterizing the population of annotator sensitivities and specificities in terms of average annotator performance and inter-annotator variation. Hierarchical models may be used to infer accuracy priors by assigning them diffuse hyperpriors. The inferred accuracy priors characterize the entire population of annotators, and hence the difficulty and reporoducibility of the annotation task itself. As anyone who has annotated data is well aware, not all instances are created equal. Some are easy to annotate, and some are much much harder. Simpler models assume all examples are equal, but this hypothesis is also easily rejected. Just as we may infer annotator accuracies with enough overlap in annotation, we may infer item difficulty with enough annotators per instance. Item difficulty lends itself to the same kind of hierarchical modeling as annotator accuracy, thus allowing us to infer properties of the difficulties of the population of examples. Time permitting, we will discuss non-categorical data annotation tasks, such as ordinal annotation (e.g. height of a vowel in phonemic annotation or ranking query result documents for information retrieval), or scalar annotation (e.g. degree of stress in an intonation annotation or degree of positive sentiment in a statement). APPLICATIONS While the models of annotation accuracy and item difficulty along with their population parameters are of interest themselves, they also lead to a variety of useful applications. Applications to corpus creation and coding standard design include active learning, providing feedback to correct annotator bias and inaccuracy, and estimating confidence in gold-standard labels. At both learning and evaluation time, standard performance measures may be generalized to the probabilistic labelings inferred from a multiply annotated corpus. EXAMPLES We will consider case studies where the full annotation data is publicly available, including binary decisions for textual entailment, multi-way decisions for word-sense disambiguation, open-ended decisions for morphological stemming, span detection for named entity recognition, and a variety of coreference classification and linkage tasks. The examples we have were all crowd-sourced on the web with multiple untrained annotators per item but little control over which items were annotated by which annotators. For many of the examples, gold standard data was generated in the traditional way, and we compare the crowdsourced inferred gold standard with the existing gold standards. INSTRUCTOR BIOS Bob Carpenter received a Ph.D. in cognitive science from the University of Edinburgh. He has since worked as a computational linguistics professor at Carnegie Mellon University, a speech and language researcher at Lucent Bell Labs, and a researcher and software developer at SpeechWorks. He's now a software architect and research scientist at Alias-i, where he develops and maintains the LingPipe suite of natural language processing software. Over the past two years, he has published reports and software on hierarchical Bayesian models of annotation data. Massimo Poesio received a Ph.D. in computer science from the University of Rochester. He has sinced worked as a researcher at the University of Edinburgh's Centre for Cognitive Science. He's now jointly appointed as a reader in computer science at the University of Essex and a professor of computer science at the University of Trento. He has worked on corpus annotation since the 1999 MATE project, continuing with the GNOME and ARRAU projects. More recently he's worked on the ANAWIKI coreference online annotation project. Last year, he co-authored a detailed survey for the Computational Linguistics Journal on inter-annotator agreement statistics. ---------------------------------------------------