Back to Main Conference 2018
LREC 2018main

A Large Self-Annotated Corpus for Sarcasm

Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

DOI:10.63317/2ifwj74dgdbw

Abstract

We introduce the Self-Annotated Reddit Corpus (SARC), a large corpus for sarcasm research and for training and evaluating systems for sarcasm detection. The corpus has 1.3 million sarcastic statements --- 10 times more than any previous dataset --- and many times more instances of non-sarcastic statements, allowing for learning in both balanced and unbalanced label regimes. Each statement is furthermore self-annotated --- sarcasm is labeled by the author, not an independent annotator --- and provided with user, topic, and conversation context. We evaluate the corpus for accuracy, construct benchmarks for sarcasm detection, and evaluate baseline methods.

Details

Paper ID
lrec2018-main-102
Pages
N/A
BibKey
khodak-etal-2018-large
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
79-10-95546-00-9
Conference
Eleventh International Conference on Language Resources and Evaluation
Location
Miyazaki, Japan
Date
7 May 2018 12 May 2018

Authors

  • MK

    Mikhail Khodak

  • NS

    Nikunj Saunshi

  • KV

    Kiran Vodrahalli

Links