Helpful or Harmful? The Dual Role of Linguistic Features in LLM-Based Dialectal Machine Translation

The 7th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT7) with 5 Shared Tasks

Abstract

Large Language Models (LLMs) have shown promising results in dialectal machine translation, yet the impact of explicit linguistic features remains underexplored. This paper examines whether part-of-speech (POS) tags and diacritization help or hinder LLM-based translation between Algerian dialect (Darija) and Modern Standard Arabic (MSA). Using a linguistically enriched subset of the PADIC dataset, we conduct bidirectional experiments across several frontier and open-weight LLMs, evaluated with automatic metrics and human judgments of adequacy and fluency. Results reveal a dual and asymmetric effect: diacritics can improve adequacy in the MSA → Algerian dialect direction, while POS tags and forced diacritization often introduce noise, especially for Algerian dialect → MSA translation. We further observe a mismatch between traditional overlap-based metrics and human evaluation, suggesting limitations in current evaluation practices. Overall, explicit linguistic augmentation does not consistently benefit LLM-based dialectal translation and must be applied cautiously.