Back to Main Conference 2014
LREC 2014main

TweetCaT: a tool for building Twitter corpora of smaller languages

Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014)

DOI:10.63317/3rx4a5m9weub

Abstract

This paper presents TweetCaT, an open-source Python tool for building Twitter corpora that was designed for smaller languages. Using the Twitter search API and a set of seed terms, the tool identifies users tweeting in the language of interest together with their friends and followers. By running the tool for 235 days we tested it on the task of collecting two monitor corpora, one for Croatian and Serbian and the other for Slovene, thus also creating new and valuable resources for these languages. A post-processing step on the collected corpus is also described, which filters out users that tweet predominantly in a foreign language thus further cleans the collected corpora. Finally, an experiment on discriminating between Croatian and Serbian Twitter users is reported.

Details

Paper ID
lrec2014-main-642
Pages
pp. 2279-2283
BibKey
ljubesic-etal-2014-tweetcat
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-9517408-8-4
Conference
Ninth International Conference on Language Resources and Evaluation
Location
Reykjavik, Iceland
Date
26 May 2014 31 May 2014

Authors

  • NL

    Nikola Ljubešić

  • DF

    Darja Fišer

  • TE

    Tomaž Erjavec

Links