Comparison of Some Automatic and Manual Methods for Summary Evaluation Based on the Text Summarization Challenge 2

Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC 2004)

Abstract

In this paper, we compare some automatic and manual methods for summary evaluation. One of the essential points for evaluating a summary is how well the evaluation measure recognizes slight differences in the quality of the computer-produced summaries. In terms of this point, we examined 'evaluation by revision' using the data of the Text Summarization Challenge 2 (TSC2). Evaluation by revision is a manual method that was first used in TSC2, whose effectiveness has not been tested. First, we compared evaluation by revision with a ranking evaluation, which is a manual method used both in TSC1 and in TSC2, by checking the gaps of the edit distance from 0 to 1 at 0.1 intervals. To investigate the effectiveness of evaluation by revision, we also tested other automatic methods: content-based evaluation, BLEU and RED, and compare their results with that of evaluation by revision for reference. As a result, we found that evaluation by revision is effective for recognizing slight differences between computer-produced summaries. Second, we evaluated content-based evaluation, BLEU and RED by evaluation by revision, and compared the effectiveness of the three automatic methods. We found that RED is superior to the others in some examinations.