Findings of the Second Workshop on Challenges in Processing South Asian Languages (CHiPSAL 2026)

Proceedings of the Second workshop on Challenges in Processing South Asian Languages (CHiPSAL2026)

Abstract

This paper presents the findings of the second workshop on Challenges in Processing South Asian Languages (CHiPSAL 2026), held as part of LREC 2026. South Asia is one of the most linguistically diverse regions in the world, yet its languages remain severely underrepresented in language resources and technologies, particularly in the era of large language models (LLMs). The workshop brings together research addressing key challenges in this space, including data scarcity, morphological complexity, code-mixing, script diversity, and the lack of culturally grounded evaluation benchmarks. The workshop received 57 submissions, covering a wide range of languages, tasks, and modalities, including both widely spoken languages (e.g., Bengali, Hindi, Tamil, and Urdu) and extremely low-resource and endangered languages such as Burushaski, Limbu, and Nepal Bhasha (Newari). Several contributions introduce arguably first-of-their-kind resources and benchmarks for these languages, spanning both text and speech domains, and focusing on linguistically informed and culturally grounded data creation. In addition to the main track, the workshop hosted a shared task on Multimodal Hate and Sentiment Understanding in Low-Resource Memes for Nepali, attracting strong community participation. The results highlight the effectiveness of multimodal approaches while also revealing persistent challenges in modelling culturally nuanced and low-resource data. Across the accepted papers and shared task, key insights include the central role of high-quality data, the limitations of current multilingual models in low-resource settings, and the need for culturally aware and data-centric approaches. Overall, CHiPSAL 2026 demonstrates the growing momentum in South Asian language processing and highlights the importance of sustained, community-driven efforts to build inclusive and representative language technologies.