Back to Main Conference 2010
LREC 2010main

Resource Creation for Training and Testing of Transliteration Systems for Indian Languages

Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010)

DOI:10.63317/3wyw9iep4zxt

Abstract

Machine transliteration is used in a number of NLP applications ranging from machine translation and information retrieval to input mechanisms for non-roman scripts. Many popular Input Method Editors for Indian languages, like Baraha, Akshara, Quillpad etc, use back-transliteration as a mechanism to allow users to input text in a number of Indian language. The lack of a standard dataset to evaluate these systems makes it difficult to make any meaningful comparisons of their relative accuracies. In this paper, we describe the methodology for the creation of a dataset of ~2500 transliterated sentence pairs each in Bangla, Hindi and Telugu. The data was collected across three different modes from a total of 60 users. We believe that this dataset will prove useful not only for the evaluation and training of back-transliteration systems but also help in the linguistic analysis of the process of transliterating Indian languages from native scripts to Roman.

Details

Paper ID
lrec2010-main-121
Pages
N/A
BibKey
b-etal-2010-resource
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
2-9517408-6-7
Conference
Seventh International Conference on Language Resources and Evaluation
Location
Valletta, Malta
Date
17 May 2010 23 May 2010

Authors

  • SB

    Sowmya V. B.

  • MC

    Monojit Choudhury

  • KB

    Kalika Bali

  • TD

    Tirthankar Dasgupta

  • AB

    Anupam Basu

Links