Back to Main Conference 2008
LREC 2008main

Using the Complexity of the Distribution of Lexical Elements as a Feature in Authorship Attribution

Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC 2008)

DOI:10.63317/2id75yrzanby

Abstract

Traditional Authorship Attribution models extract normalized counts of lexical elements such as nouns, common words and punctuation and use these normalized counts or ratios as features for author fingerprinting. The text is viewed as a “bag-of-words” and the order of words and their position relative to other words is largely ignored. We propose a new method of feature extraction which quantifies the distribution of lexical elements within the text using Kolmogorov complexity estimates. Testing carried out on blog corpora indicates that such measures outperform ratios when used as features in an SVM authorship attribution model. Moreover, by adding complexity estimates to a model using ratios, we were able to increase the F-measure by 5.2-11.8%

Details

Paper ID
lrec2008-main-031
Pages
N/A
BibKey
spracklin-etal-2008-using
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
2-9517408-4-0
Conference
Sixth International Conference on Language Resources and Evaluation
Location
Marrakech, Morocco
Date
28 May 2008 30 May 2008

Authors

  • LS

    Leanne Spracklin

  • DI

    Diana Inkpen

  • AN

    Amiya Nayak

Links