Back to Main Conference 2022
LREC 2022main

NaijaSenti: A Nigerian Twitter Sentiment Corpus for Multilingual Sentiment Analysis

Proceedings of the Thirteenth International Conference on Language Resources and Evaluation (LREC 2022)

DOI:10.63317/2anx3vqhsxdm

Abstract

Sentiment analysis is one of the most widely studied applications in NLP, but most work focuses on languages with large amounts of data. We introduce the first large-scale human-annotated Twitter sentiment dataset for the four most widely spoken languages in Nigeria—Hausa, Igbo, Nigerian-Pidgin, and Yorùbá—consisting of around 30,000 annotated tweets per language, including a significant fraction of code-mixed tweets. We propose text collection, filtering, processing and labeling methods that enable us to create datasets for these low-resource languages. We evaluate a range of pre-trained models and transfer strategies on the dataset. We find that language-specific models and language-adaptive fine-tuning generally perform best. We release the datasets, trained models, sentiment lexicons, and code to incentivize research on sentiment analysis in under-represented languages.

Details

Paper ID
lrec2022-main-063
Pages
pp. 590-602
BibKey
muhammad-etal-2022-naijasenti
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
79-10-95546-38-2
Conference
Thirteenth Language Resources and Evaluation Conference
Location
Marseille, France
Date
20 June 2022 25 June 2022

Authors

  • SM

    Shamsuddeen Hassan Muhammad

  • DA

    David Ifeoluwa Adelani

  • SR

    Sebastian Ruder

  • IA

    Ibrahim Sa’id Ahmad

  • IA

    Idris Abdulmumin

  • BB

    Bello Shehu Bello

  • MC

    Monojit Choudhury

  • CE

    Chris Chinenye Emezue

  • SA

    Saheed Salahudeen Abdullahi

  • AA

    Anuoluwapo Aremu

  • AJ

    Alípio Jorge

  • PB

    Pavel Brazdil

Links