LREC 2000 2nd International Conference on Language Resources & Evaluation
 

Previous Paper   Next Paper

Title Quality Control in Large Annotation Projects Involving Multiple Judges: The Case of the TDT Corpora
Authors Strassel Stephanie (Linguistic Data Consortium, 3615 Market Street, Philadelphia, PA 19104, USA, strassel@ldc.upenn.edu)
Graff David (Linguistic Data Consortium, University of Pennsylvania, Philadelphia, Pennsylvania, USA, graff@ldc.upenn.edu)
Martey Nii (Linguistic Data Consortium, 3615 Market Street, Philadelphia, PA 19104, USA, nmartey@ldc.upenn.edu)
Cieri Christopher (Linguistic Data Consortium, University of Pennsylvania, Philadelphia, Pennsylvania, USA, ccieri@ldc.upenn.edu)
Keywords  
Session Session SP2 - Spoken Language Resources Issues from Construction to Validation
Full Paper 212.ps, 212.pdf
Abstract The Linguistic Data Consortium at the University of Pennsylvania has recently been engaged in the creation of large-scale annotated corpora of broadcast news materials in support of the ongoing Topic Detection and Tracking (TDT) research project. The TDT corpora were designed to support three basic research tasks: segmentation, topic detection, and topic tracking in newswire, television and radio sources from English and Mandarin Chinese. The most recent TDT corpus, TDT3, added two tasks, story link and first story detection. Annotation of the TDT corpora involved a large staff of annotators who produced millions of human judgements. As with any large corpus creation effort, quality assurance and inter-annotator consistency were a major concern. This paper reports the quality control measures adopted by the LDC during the creation of the TDT corpora, presents techniques that were utilized to evaluate and improve the consistency of human annotators for all annotation tasks, and discusses aspects of project administration that were designed to enhance annotation consistency.