Back to Main Conference 2012
LREC 2012main

New language resources for the Pashto language

Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012)

DOI:10.63317/2363vwa3ru4b

Abstract

This paper reports on the development of new language resources for the Pashto language, a very low-resource language spoken in Afghanistan and Pakistan. In the scope of a multilingual data collection project, three large corpora are collected for Pashto. Firstly a monolingual text corpus of 100 million words is produced. Secondly a 100 hours speech database is recorded and manually transcribed. Finally a bilingual Pashto-French parallel corpus of around 2 million is produced by translating Pashto texts into French. These resources will be used to develop Human Language Technology systems for Pashto with a special focus on Machine Translation.

Details

Paper ID
lrec2012-main-490
Pages
pp. 2917-2922
BibKey
mostefa-etal-2012-new
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-9517408-7-7
Conference
Eighth International Conference on Language Resources and Evaluation
Location
Istanbul, Turkey
Date
21 May 2012 27 May 2012

Authors

  • DM

    Djamel Mostefa

  • KC

    Khalid Choukri

  • SB

    Sylvie Brunessaux

  • KB

    Karim Boudahmane

Links