Back to Main Conference 2010
LREC 2010main

Mining Naturally-occurring Corrections and Paraphrases from Wikipedia’s Revision History

Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010)

DOI:10.63317/4mn5bc2no7ja

Abstract

Naturally-occurring instances of linguistic phenomena are important both for training and for evaluating automatic text processing. When available in large quantities, they also prove interesting material for linguistic studies. In this article, we present WiCoPaCo (Wikipedia Correction and Paraphrase Corpus), a new freely-available resource built by automatically mining Wikipedia’s revision history. The WiCoPaCo corpus focuses on local modifications made by human revisors and include various types of corrections (such as spelling error or typographical corrections) and rewritings, which can be categorized broadly into meaning-preserving and meaning-altering revisions. We present an initial hand-built typology of these revisions, but the resource allows for any possible annotation scheme. We discuss the main motivations for building such a resource and describe the main technical details guiding its construction. We also present applications and data analysis on French and report initial results on spelling error correction and morphosyntactic rewriting. The WiCoPaCo corpus can be freely downloaded from http://wicopaco.limsi.fr.

Details

Paper ID
lrec2010-main-571
Pages
N/A
BibKey
max-wisniewski-2010-mining
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
2-9517408-6-7
Conference
Seventh International Conference on Language Resources and Evaluation
Location
Valletta, Malta
Date
17 May 2010 23 May 2010

Authors

  • AM

    Aurélien Max

  • GW

    Guillaume Wisniewski

Links