Abjad AI at KSAA-2026 Shared Task 2: Grouped Speech Conditioning for Arabic Diacritization

The 7th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT7) with 5 Shared Tasks

Abstract

We describe Abjad AI’s submission to KSAA-2026 Shared Task 2 on automatic diacritization of Arabic speech dictation. The task requires generating fully diacritized text given speech audio and an undiacritized transcript. Because text-only diacritization cannot resolve ambiguities that are recoverable from the acoustic signal, we propose conditioning a character-level encoder-only Transformer (CATT) (Alasmary et al., 2024) on speech representations. We introduce grouped speech conditioning, which downsamples speech encoder features into a small set of pooled tokens concatenated to the text input, enabling efficient fusion without architectural changes to CATT. We train with a two-phase schedule that first freezes the text encoder, then fine-tunes the full model. Our best system, using Whisper-small (Rad-ford et al., 2022) features with five grouped tokens, achieves a Diacritization Error Rate (DER) of 6.60 and a Word Error Rate (WER) of 18.66 (without case endings, including no-diacritic) on the official test set. Notably, we find that Whisper-small consistently outperforms Whisper-large-v3, suggesting that compact speech representations better suit this fusion setting. This is an extended and revised version of our previous work (Ghannam et al., 2025).