Back to Main Conference 2010
LREC 2010main

Language Identification of Short Text Segments with N-gram Models

Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010)

DOI:10.63317/5d3whbd4ovb2

Abstract

There are many accurate methods for language identification of long text samples, but identification of very short strings still presents a challenge. This paper studies a language identification task, in which the test samples have only 5-21 characters. We compare two distinct methods that are well suited for this task: a naive Bayes classifier based on character n-gram models, and the ranking method by Cavnar and Trenkle (1994). For the n-gram models, we test several standard smoothing techniques, including the current state-of-the-art, the modified Kneser-Ney interpolation. Experiments are conducted with 281 languages using the Universal Declaration of Human Rights. Advanced language model smoothing techniques improve the identification accuracy and the respective classifiers outperform the ranking method. The higher accuracy is obtained at the cost of larger models and slower classification speed. However, there are several methods to reduce the size of an n-gram model, and our experiments with model pruning show that it provides an easy way to balance the size and the identification accuracy. We also compare the results to the language identifier in Google AJAX Language API, using a subset of 50 languages.

Details

Paper ID
lrec2010-main-193
Pages
N/A
BibKey
vatanen-etal-2010-language
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
2-9517408-6-7
Conference
Seventh International Conference on Language Resources and Evaluation
Location
Valletta, Malta
Date
17 May 2010 23 May 2010

Authors

  • TV

    Tommi Vatanen

  • JV

    Jaakko J. Väyrynen

  • SV

    Sami Virpioja

Links