Back to Main Conference 2002
LREC 2002main

The Reuters Corpus Volume 1 -from Yesterday’s News to Tomorrow’s Language Resources

Proceedings of the Third International Conference on Language Resources and Evaluation (LREC 2002)

DOI:10.63317/5aeuc8zmytsy

Abstract

Reuters, the global information, news and technology group, has for the first time made available free of charge, large quantities of archived Reuters news stories for use by research communities around the world. The Reuters Corpus Volume 1 (RCV1) includes over 800,000 news stories - typical of the annual English language news output of Reuters. This paper describes the origins of RCV1, the motivations behind its creation, and how it differs from previous corpora. In addition we discuss the system of category coding, whereby each story is annotated for topic, region and industry sector. We also discuss the process by which these codes were applied, and examine the issues involved in maintaining quality and consistency of coding in an operational, commercial environment.

Details

Paper ID
lrec2002-main-080
Pages
N/A
BibKey
rose-etal-2002-reuters
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
N/A
Conference
Third International Conference on Language Resources and Evaluation
Location
Las Palmas, Spain
Date
29 May 2002 31 May 2002

Authors

  • TR

    Tony Rose

  • MS

    Mark Stevenson

  • MW

    Miles Whitehead

Links