Back to Main Conference 2014
LREC 2014main

Sublanguage Corpus Analysis Toolkit: A tool for assessing the representativeness and sublanguage characteristics of corpora

Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014)

DOI:10.63317/5b99urno4q42

Abstract

Sublanguages are varieties of language that form “subsets” of the general language, typically exhibiting particular types of lexical, semantic, and other restrictions and deviance. SubCAT, the Sublanguage Corpus Analysis Toolkit, assesses the representativeness and closure properties of corpora to analyze the extent to which they are either sublanguages, or representative samples of the general language. The current version of SubCAT contains scripts and applications for assessing lexical closure, morphological closure, sentence type closure, over-represented words, and syntactic deviance. Its operation is illustrated with three case studies concerning scientific journal articles, patents, and clinical records. Materials from two language families are analyzed―English (Germanic), and Bulgarian (Slavic). The software is available at sublanguage.sourceforge.net under a liberal Open Source license.

Details

Paper ID
lrec2014-main-531
Pages
pp. 1714-1718
BibKey
temnikova-etal-2014-sublanguage
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-9517408-8-4
Conference
Ninth International Conference on Language Resources and Evaluation
Location
Reykjavik, Iceland
Date
26 May 2014 31 May 2014

Authors

  • IT

    Irina Temnikova

  • WB

    William A. Baumgartner Jr.

  • NH

    Negacy D. Hailu

  • IN

    Ivelina Nikolova

  • TM

    Tony McEnery

  • AK

    Adam Kilgarriff

  • GA

    Galia Angelova

  • KC

    K. Bretonnel Cohen

Links