Back to Main Conference 2004
LREC 2004main
Unexpected Productions May Well be Errors
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC 2004)
Abstract
We present a method for detecting annotation errors in treebanks. It assumes that errors are unexpected small tree fragments. We generate statistics over configurations of these fragments using a standard statistical test. We use the test result and the characteristics of their distributions as features to classify unseen configurations as likely errors via machine learning. Evaluation shows that the resulting list of error candidates is reliable, independent of corpus size, annotation quality, and target language.