Back to Main Conference 2012
LREC 2012main

The WeSearch Corpus, Treebank, and Treecache – A Comprehensive Sample of User-Generated Content

Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012)

DOI:10.63317/2bbysorymv2u

Abstract

We present the WeSearch Data Collection (WDC)―a freely redistributable, partly annotated, comprehensive sample of User-Generated Content. The WDC contains data extracted from a range of genres of varying formality (user forums, product review sites, blogs and Wikipedia) and covers two different domains (NLP and Linux). In this article, we describe the data selection and extraction process, with a focus on the extraction of linguistic content from different sources. We present the format of syntacto-semantic annotations found in this resource and present initial parsing results for these data, as well as some reflections following a first round of treebanking.

Details

Paper ID
lrec2012-main-454
Pages
pp. 1829-1835
BibKey
read-etal-2012-wesearch
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-9517408-7-7
Conference
Eighth International Conference on Language Resources and Evaluation
Location
Istanbul, Turkey
Date
21 May 2012 27 May 2012

Authors

  • JR

    Jonathon Read

  • DF

    Dan Flickinger

  • RD

    Rebecca Dridan

  • SO

    Stephan Oepen

  • Lilja Øvrelid

Links