Summary of the paper

Title Quality Indicators of LSP Texts ― Selection and Measurements Measuring the Terminological Usefulness of Documents for an LSP Corpus
Authors Jakob Halskov, Dorte Haltrup Hansen, Anna Braasch and Sussi Olsen
Abstract This paper describes and evaluates a prototype quality assurance system for LSP corpora. The system will be employed in compiling a corpus of 11 M tokens for various linguistic and terminological purposes. The system utilizes a number of linguistic features as quality indicators. These represent two dimensions of quality, namely readability/formality (e.g. word length and passive constructions) and density of specialized knowledge (e.g. out-of-vocabulary items). Threshold values for each indicator are induced from a reference corpus of general (fiction, magazines and newspapers) and specialized language (the domains of Health/Medicine and Environment/Climate). In order to test the efficiency of the indicators, a number of terminologically relevant, irrelevant and possibly relevant texts are manually selected from target web sites as candidate texts. By applying the indicators to these candidate texts, the system is able to filter out non-LSP and “poor” LSP texts with a precision of 100% and a recall of 55%. Thus, the experiment described in this paper constitutes fundamental work towards a formulation of ‘best practice’ for implementing quality assurance when selecting appropriate texts for an LSP corpus. The domain independence of the quality indicators still remains to be thoroughly tested on more than just two domains.
Topics Corpus (creation, annotation, etc.), Information Extraction, Information Retrieval, Other
Full paper Quality Indicators of LSP Texts ― Selection and Measurements Measuring the Terminological Usefulness of Documents for an LSP Corpus
Slides -
Bibtex @InProceedings{HALSKOV10.505,
  author = {Jakob Halskov and Dorte Haltrup Hansen and Anna Braasch and Sussi Olsen},
  title = {Quality Indicators of LSP Texts ― Selection and Measurements Measuring the Terminological Usefulness of Documents for an LSP Corpus},
  booktitle = {Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)},
  year = {2010},
  month = {may},
  date = {19-21},
  address = {Valletta, Malta},
  editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis and Mike Rosner and Daniel Tapias},
  publisher = {European Language Resources Association (ELRA)},
  isbn = {2-9517408-6-7},
  language = {english}
 }
Powered by ELDA © 2010 ELDA/ELRA