AI Safety Lost in Translation: Evaluating the Effectiveness of English-Italian Cross-Lingual LLM Safety Alignment
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
Large Language Models (LLMs) have been shown to be vulnerable to various issues of bias and safety, for which new safety alignment techniques have been proposed. In this paper, we investigate the degree to which such techniques improve safety in a non-English language, specifically in Italian, both when they have and don’t have access to safety training data in that language. We evaluate standard mitigation techniques and assess cross-lingual safety transfer by comparing English-only versus bilingual Supervised Fine-Tuning (SFT), on several open-source small LLMs: Qwen3, Llama3.2, and Gemma3. Results confirm a significant cross-lingual safety gap, with most models performing worse in Italian. We find that while prompt engineering is generally effective, the impact of SFT is highly inconsistent. English-only SFT occasionally failed to transfer safety improvements into Italian and even deteriorated the performance of some models. Furthermore, bilingual SFT repeatedly underperformed other mitigation methods. These findings demonstrate that safety alignment does not always generalize across languages and models, and standard mitigation strategies can lead to unpredictable effects. We thus highlight the critical necessity for language-specific evaluation and dedicated multilingual safety research to ensure AI is developed equitably and safely for a global audience.