SentiMalti: A Maltese Sentiment Analysis Dataset and Models
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
We present SentiMalti, a new Maltese social media sentiment resource and accompanying baselines. We scrape user‑generated content from YouTube, Reddit, and Facebook, then apply a Maltese‑aware preprocessing pipeline (cleaning, personally identifiable information anonymisation, sentence splitting, and sentence‑level language filtering) to retain Maltese sentences while tolerating realistic code‑switching. The resulting crowdsourced dataset contains 2,327 sentences annotated for positive (39%), negative (31%), and neutral (30%) sentiment. We integrate prior Maltese datasets to create a combined benchmark of 3,772 instances. We evaluate fine‑tuned encoder models (BERTu, Glot500) and few‑shot prompting with instruction‑tuned multilingual LLMs (Aya‑101, Gemma 2 Instruct 9B). On the full test set, five‑shot Aya‑101 attains 68.65 macro‑F1, closely followed by a fine‑tuned BERTu at 68.36 macro‑F1. Error analysis reveals complementary strengths: BERTu better separates polarised classes, while Aya‑101 tends to over‑predict the neutral class. We release the dataset splits, code, and a fine‑tuned BERTu model to facilitate further work in Maltese NLP and sentiment analysis.