Leveraging Comparable Toxicity Lexicons in Prompt Instructions for Multilingual Text Detoxification

Proceedings of the 19th Workshop on Building and Using Comparable Corpora (BUCC)

Abstract

To mitigate the prevalence of toxic language on digital social media, various NLP approaches have been proposed for automatic text detoxification. However, the potential of toxic expression lexicons as a comparable cross-lingual resource to guide this process remains largely unexplored. In this work, we investigate how such resources can be effectively used to inform multilingual language models about what should and should not be considered toxic. We evaluate four models under two settings—zero-shot prompting and fine-tuning—to assess the impact of incorporating toxic expressions in prompt instruction, including in cross-lingual transfer scenarios. Our results show that both zero-shot prompting and fine-tuning approaches benefit considerably from adding toxic expressions in prompt instructions during training and/or inference. Our findings demonstrate that comparable, lightweight, language-specific toxic expression lexicons constitute an effective mechanism for injecting explicit information about lexical toxicity into multilingual language models.