Fine-Tashkeel at KSAA-2026: A Comprehensive Evaluation of Seq2Seq and Multimodal Approaches for Automatic Diacritization of Arabic Speech Dictation

The 7th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT7) with 5 Shared Tasks

Abstract

This paper presents the Fine-Tashkeel system for Task 2 of the KSAA-2026 Shared Task on Automatic Diacritization of Speech Dictation. Diacritization of speech-derived Arabic text poses challenges due to dialectal variation, morphological ambiguity, and the absence of acoustic cues in text-only pipelines. Our approach treats diacritization as a character-level sequence-to-sequence task, mapping undiacritized text directly to its fully diacritized form. We evaluate 18 models spanning text-only, ASR-augmented, and fine-tuned configurations, finding that text-only Seq2Seq approaches outperform off-the-shelf multimodal models—a gap we attribute to task mismatch in generic ASR systems rather than an inherent audio limitation. Our best submission, using zero-shot inference without task-specific training, achieved a Diacritic Error Rate (DER) of 10.56%, Word Error Rate (WER) of 34.47%, and Sentence Error Rate (SER) of 79.88%, ranking 5th out of 7 teams. Per-nationality error analysis reveals significant dialectal variation (Egyptian 3.70% vs. Algerian 13.73% DER), and diagnostic analysis confirms that case endings and vowel ambiguity are the primary bottlenecks. Code and evaluation scripts are publicly available.