HuNeBR: A Multitask Benchmark to Evaluate LLMs’ Understanding of Northeastern Brazilian Portuguese Humor
Proceedings of the SIGUL 2026 Joint Workshop with ELE, EURALI, and DCLRL "Towards Inclusivity and Equality: Language Resources and Technologies for Under-Resourced and Endangered Languages
Abstract
Humor recognition is a major challenge in Natural Language Processing (NLP) due to its subtle and context-dependent nature. Despite advances, Large Language Models (LLMs) still struggle with this task, especially in Brazilian Portuguese, where no dedicated benchmarks exist. This paper presents HuNeBR, a new benchmark of 475 annotated humorous texts from Northeastern Brazilian comedians. The benchmark evaluates LLMs on three tasks: identifying punchlines, classifying texts into eight comic styles, and explaining humor. This is the first benchmark to evaluate LLMs on the in-depth interpretation of humorous texts in Brazilian Portuguese, going beyond the binary tasks of traditional humor benchmarks. Both general-purpose and Portuguese-specialized LLMs were evaluated under zero-shot and few-shot settings. The findings indicate that LLMs perform very well at identifying punchlines, show inconsistent results in classifying comic styles, and produce humor interpretations that mostly align with human judgments. Among the models assessed, general-purpose multilingual systems like GPT-4 and Gemini 2.5 Flash achieved the top overall performance, whereas Sabiá 3.1, a model specialized in Brazilian Portuguese, demonstrated competitive results across all three tasks, highlighting the value of locally trained models in capturing linguistic and cultural subtleties.