Dynamic Layer Selection for Efficient Tone Recognition in Self-Supervised Speech Models

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

Abstract

Low-resource tonal languages present significant challenges to speech processing technologies, due to limited training data and the critical role of pitch variation in expressing meaning. This paper applies established weighted layer combination methods to tone recognition in such languages, with a specific focus on Yoruba and Yemba. Building on our previous work with Wav2vec 2.0 representations and the weighted-sum methodology from Yang et al. (2024), we investigate layer specialisation in the SSA-HuBERT self-supervised speech model for tonal tasks. Our systematic analysis reveals significant performance differences between different layers, with middle layers generally outperforming both lower and upper layers for tonal recognition tasks. While typical approaches only use the output of the last layer, our experiments show that weighted layer combination outperforms the last layer by 20.4% and 15.8% relative improvement in tone error rate (TER) for Yoruba and Yemba, respectively. In addition to performance improvements, our approach provides dramatic computational efficiency gains, reducing the resources required by over 90% compared to evaluating each layer separately. Analysis of the learned layer weights reveals language-specific patterns, with Yoruba favouring middle layers and Yemba giving more weight to early layers. These results provide valuable insights into how tonal information is encoded in self-supervised speech models, and demonstrate a practical application of established layer combination methods in low-resource language contexts.