Back to Main Conference 2016
LREC 2016main

Publishing the Trove Newspaper Corpus

Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)

DOI:10.63317/4wdwnbotn6o5

Abstract

The Trove Newspaper Corpus is derived from the National Library of Australia's digital archive of newspaper text. The corpus is a snapshot of the NLA collection taken in 2015 to be made available for language research as part of the Alveo Virtual Laboratory and contains 143 million articles dating from 1806 to 2007. This paper describes the work we have done to make this large corpus available as a research collection, facilitating access to individual documents and enabling large scale processing of the newspaper text in a cloud-based environment.

Details

Paper ID
lrec2016-main-715
Pages
pp. 4520-4525
BibKey
cassidy-2016-publishing
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-9517408-9-1
Conference
Tenth International Conference on Language Resources and Evaluation
Location
Portorož, Slovenia
Date
23 May 2016 28 May 2016

Authors

  • SC

    Steve Cassidy

Links