Back to Main Conference 2016
LREC 2016main

Classifying Out-of-vocabulary Terms in a Domain-Specific Social Media Corpus

Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)

DOI:10.63317/2sfruoxcckcu

Abstract

In this paper we consider the problem of out-of-vocabulary term classification in web forum text from the automotive domain. We develop a set of nine domain- and application-specific categories for out-of-vocabulary terms. We then propose a supervised approach to classify out-of-vocabulary terms according to these categories, drawing on features based on word embeddings, and linguistic knowledge of common properties of out-of-vocabulary terms. We show that the features based on word embeddings are particularly informative for this task. The categories that we predict could serve as a preliminary, automatically-generated source of lexical knowledge about out-of-vocabulary terms. Furthermore, we show that this approach can be adapted to give a semi-automated method for identifying out-of-vocabulary terms of a particular category, automotive named entities, that is of particular interest to us.

Details

Paper ID
lrec2016-main-474
Pages
pp. 2971-2975
BibKey
park-etal-2016-classifying
Editors
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asunción Moreno, Jan Odijk, Stelios Piperidis
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-9517408-9-1
Conference
Tenth International Conference on Language Resources and Evaluation
Location
Portorož, Slovenia
Date
23 - 28 May 2016

Authors

  • SP

    SoHyun Park

  • AF

    Afsaneh Fazly

  • AL

    Annie Lee

  • BS

    Brandon Seibel

  • WZ

    Wenjie Zi

  • PC

    Paul Cook

Links