TSix: A Human-involved-creation Dataset for Tweet Summarization

Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

Abstract

We present a new dataset for tweet summarization. The dataset includes six events collected from Twitter from October 10 to November 9, 2016. Our dataset features two prominent properties. Firstly, human-annotated gold-standard references allow to correctly evaluate extractive summarization methods. Secondly, tweets are assigned into sub-topics divided by consecutive days, which facilitate incremental tweet stream summarization methods. To reveal the potential usefulness of our dataset, we compare several well-known summarization methods. Experimental results indicate that among extractive approaches, hybrid term frequency -- document term frequency obtains competitive results in term of ROUGE-scores. The analysis also shows that polarity is an implicit factor of tweets in our dataset, suggesting that it can be exploited as a component besides tweet content quality in the summarization process.

Resources

Details

Paper ID

lrec2018-main-506

Pages

N/A

DOI

10.63317/39q4zq3o8xmu

BibKey

nguyen-etal-2018-tsix

Editors

Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis, Takenobu Tokunaga

Publisher

European Language Resources Association (ELRA)

ISSN

2522-2686

ISBN

79-10-95546-00-9

Conference

Eleventh International Conference on Language Resources and Evaluation

Location

Miyazaki, Japan

Date

7 - 12 May 2018

Authors

MN
Minh-Tien Nguyen
DL
Dac Viet Lai
HN
Huy-Tien Nguyen
LN
Le-Minh Nguyen

Links

URL

DOI