A Bayesian Topic Model for Human-Evaluated Interpretability

Proceedings of the Thirteenth International Conference on Language Resources and Evaluation (LREC 2022)

Abstract

One desiderata of topic modeling is to produce interpretable topics. Given a cluster of document-tokens comprising a topic, we can order the topic by counting each word. It is natural to think that each topic could easily be labeled by looking at the words with the highest word count. However, this is not always the case. A human evaluator can often have difficulty identifying a single label that accurately describes the topic as many top words seem unrelated. This paper aims to improve interpretability in topic modeling by providing a novel, outperforming interpretable topic model Our approach combines two previously established subdomains in topic modeling: nonparametric and weakly-supervised topic models. Given a nonparametric topic model, we can include weakly-supervised input using novel modifications to the nonparametric generative model. These modifications lay the groundwork for a compelling setting—one in which most corpora, without any previous supervised or weakly-supervised input, can discover interpretable topics. This setting also presents various challenging sub-problems of which we provide resolutions. Combining nonparametric topic models with weakly-supervised topic models leads to an exciting discovery—a complete, self-contained and outperforming topic model for interpretability.

Resources

Details

Paper ID

lrec2022-main-674

Pages

pp. 6271-6279

DOI

10.63317/3osvfrcf4y9r

BibKey

wood-etal-2022-bayesian

Editors

Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis2020

Publisher

European Language Resources Association (ELRA)

ISSN

2522-2686

ISBN

79-10-95546-38-2

Conference

Thirteenth Language Resources and Evaluation Conference

Location

Marseille, France

Date

20 - 25 June 2022

Authors

JW
Justin Wood
CA
Corey Arnold
WW
Wei Wang

Links

URL

DOI