HomeLREC 2020WorkshopsCOMPUTERMlrec2020-ws-computerm-01
Back to COMPUTERM 2020
LREC 2020workshop

Automatic Term Extraction from Newspaper Corpora: Making the Most of Specificity and Common Features

Proceedings of the 6th International Workshop on Computational Terminology

DOI:10.63317/3viiidr5frr6

Abstract

The first step of any terminological work is to setup a reliable, specialized corpus composed of documents written by specialists and then to apply automatic term extraction (ATE) methods to this corpus in order to retrieve a first list of potential terms. In this paper, the experiment we describe differs quite drastically from this usual process since we are applying ATE to unspecialized corpora. The corpus used for this study was built from newspaper articles retrieved from the Web using a short list of keywords. The general intuition on which this research is based is that ATE based corpus comparison techniques can be used to capture both similarities and dissimilarities between corpora. The former are exploited through a termhood measure and the latter through word embeddings. Our initial results were validated manually and show that combining a traditional ATE method that focuses on dissimilarities between corpora to newer methods that exploit similarities (more specifically distributional features of candidates) leads to promising results.

Details

Paper ID
lrec2020-ws-computerm-01
Pages
pp. 1-7
BibKey
drouin-etal-2020-automatic
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
N/A
ISBN
N/A
Workshop
Proceedings of the 6th International Workshop on Computational Terminology
Location
undefined, undefined
Date
11 May 2020 16 May 2020

Authors

  • PD

    Patrick Drouin

  • JM

    Jean-Benoît Morel

  • ML

    Marie-Claude L’ Homme

Links