Back to Main Conference 2010
LREC 2010main

Learning Language Identification Models: A Comparative Analysis of the Distinctive Features of Names and Common Words

Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010)

DOI:10.63317/3pic6wjon22h

Abstract

The intuition and basic hypothesis that this paper explores is that names are more characteristic of their language than common words are, and that a single name can have enough clues to confidently identify its language where random text of the same length wouldn't. To test this hypothesis, n-gramm modelling is used to learn language models which identify the language of isolated names and equally short document fragments. As the empirical results corroborate the prior intuition, an explanation is sought for the higher accuracy at which the language of names can be identified. The results of the application of these models, as well as the models themselves, are quantitatively and qualitatively analysed and a hypothesis is formed about the explanation of this difference. The conclusions derived are both technologically useful in information extraction or text-to-speech tasks, and theoretically interesting as a tool for improving our understanding of the morphology and phonology of the languages involved in the experiments.

Details

Paper ID
lrec2010-main-313
Pages
N/A
BibKey
konstantopoulos-2010-learning
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
2-9517408-6-7
Conference
Seventh International Conference on Language Resources and Evaluation
Location
Valletta, Malta
Date
17 May 2010 23 May 2010

Authors

  • SK

    Stasinos Konstantopoulos

Links