HomeLREC 2020WorkshopsTRAClrec2020-ws-trac-02
Back to TRAC 2020
LREC 2020workshop

TOCP: A Dataset for Chinese Profanity Processing

Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying

DOI:10.63317/2a2o9nq6eqy4

Abstract

This paper introduced TOCP, a larger dataset of Chinese profanity. This dataset contains natural sentences collected from social media sites, the profane expressions appearing in the sentences, and their rephrasing suggestions which preserve their meanings in a less offensive way. We proposed several baseline systems using neural network models to test this benchmark. We trained embedding models on a profanity-related dataset and proposed several profanity-related features. Our baseline systems achieved an F1-score of 86.37% in profanity detection and an accuracy of 77.32% in profanity rephrasing.

Details

Paper ID
lrec2020-ws-trac-02
Pages
pp. 6-12
BibKey
yang-lin-2020-tocp
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
N/A
ISBN
N/A
Workshop
Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying
Location
undefined, undefined
Date
11 May 2020 16 May 2020

Authors

  • HY

    Hsu Yang

  • CL

    Chuan-Jie Lin

Links