Request Correction
Use this form to request corrections to the paper metadata. Select the fields that need correction and provide the correct information.
Correction Guidelines
- Click the edit button next to a field to report a correction.
- Fill in the suggested correction value for each field you want to correct.
- Provide your name and email so we can contact you if needed.
Paper Information
Development of Burushaski Speech - English Text Translation Dataset
Paper Fields
Click the edit button next to a field to report a correction.
Development of Burushaski Speech - English Text Translation Dataset
Burushaski is a language isolate spoken in northern Pakistan with a predominantly oral tradition, limited standardized orthography, and virtually no existing speech technology infrastructure. These characteristics make conventional text-centric NLP pipelines unsuitable and position speech data collection as the primary scientific challenge. This paper introduces an audio-first, linguistically informed methodology for developing a Burushaski–English speech translation resource. Rather than prioritizing model architecture, we focus on principled corpus design tailored to the language’s morphological complexity, ergative-absolutive alignment, and four-gender agreement system. The dataset combines structured elicitation targeting high-frequency and morphologically diverse constructions, functional and formulaic speech, and oral narratives that capture discourse-level phenomena. We describe the design of a custom data collection application, community-embedded crowdsourcing strategy, and translation-aligned workflow for generating parallel speech–English data. The resulting pilot corpus comprises approximately 10 hours of curated audio from 42 speakers across controlled and naturalistic settings. While we present the results of preliminary translation experiments using Whisper, the primary contribution of this work is methodological: a scalable framework for speech-first corpus development in morphologically rich, under-resourced, and predominantly oral languages. We argue that for languages lacking stable orthography and large textual corpora, data design, not model selection constitutes the central research problem
Authors
Expand an author to correct their information. Use the remove button to request author removal, or add a new author.
PDF Attachment
You may attach a PDF as a corrected version of the paper. Max file size: 10MB. Only PDF files are accepted.
Your Information
Author Declaration *
Select at least one field to correct using the edit buttons above.