A Large Self-Annotated Corpus for Sarcasm

Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

Abstract

We introduce the Self-Annotated Reddit Corpus (SARC), a large corpus for sarcasm research and for training and evaluating systems for sarcasm detection. The corpus has 1.3 million sarcastic statements --- 10 times more than any previous dataset --- and many times more instances of non-sarcastic statements, allowing for learning in both balanced and unbalanced label regimes. Each statement is furthermore self-annotated --- sarcasm is labeled by the author, not an independent annotator --- and provided with user, topic, and conversation context. We evaluate the corpus for accuracy, construct benchmarks for sarcasm detection, and evaluate baseline methods.

Resources

Details

Paper ID

lrec2018-main-102

Pages

N/A

DOI

10.63317/2ifwj74dgdbw

BibKey

khodak-etal-2018-large

Editors

Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis, Takenobu Tokunaga

Publisher

European Language Resources Association (ELRA)

ISSN

2522-2686

ISBN

79-10-95546-00-9

Conference

Eleventh International Conference on Language Resources and Evaluation

Location

Miyazaki, Japan

Date

7 - 12 May 2018

Authors

MK
Mikhail Khodak
NS
Nikunj Saunshi
KV
Kiran Vodrahalli

Links

URL

DOI