Back to Main Conference 2022
LREC 2022main

The Subject Annotations of the Danish Parliament Corpus (2009-2017) - Evaluated with Automatic Multi-label Classification

Proceedings of the Thirteenth International Conference on Language Resources and Evaluation (LREC 2022)

DOI:10.63317/4isqf2cpmwfg

Abstract

This paper addresses the semi-automatic annotation of subjects, also called policy areas, in the Danish Parliament Corpus (2009-2017) v.2. Recently, the corpus has been made available through the CLARIN-DK repository, the Danish node of the European CLARIN infrastructure. The paper also contains an analysis of the subjects in the corpus, and a description of multi-label classification experiments act to verify the consistency of the subject annotation and the utility of the corpus for training classifiers on this type of data. The analysis of the corpus comprises an investigation of how often the parliament members addressed each subject and the relation between subjects and gender of the speaker. The classification experiments show that classifiers can determine the two co-occurring subjects of the speeches from the agenda titles with a performance similar to that of human annotators. Moreover, a multilayer perceptron achieved an F1-score of 0.68 on the same task when trained on bag of words vectors obtained from the speeches’ lemmas. This is an improvement of more than 0.6 with respect to the baseline, a majority classifier that accounts for the frequency of the classes. The result is promising given the high number of subject combinations (186) and the skewness of the data.

Details

Paper ID
lrec2022-main-153
Pages
pp. 1428-1436
BibKey
navarretta-haltrup-hansen-2022-subject
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
79-10-95546-38-2
Conference
Thirteenth Language Resources and Evaluation Conference
Location
Marseille, France
Date
20 June 2022 25 June 2022

Authors

  • CN

    Costanza Navarretta

  • DH

    Dorte Haltrup Hansen

Links