MirasText: An Automatically Generated Text Corpus for Persian

Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

Abstract

Natural Language Processing is one of the most important fields of artificial intelligence. The rapid growth of digital content has made this field both practical and challenging at the same time. As opposed to less-resourced languages like Persian, there are several text corpora in dominant languages like English which can be used for NLP applications. \\In this paper, MirasText which is an automatically generated text corpus for Persian language is presented. In this study, over 250 Persian websites were crawled and several fields like content, description, keywords, title, etc have been extracted to generate MirasText. Topic modeling and language modeling are used to validate the generated corpus. MirasText has over 2.8 million documents and over 1.4 billion tokens, which to our knowledge is the largest Persian corpus currently available.

Resources

Details

Paper ID

lrec2018-main-188

Pages

N/A

DOI

10.63317/4mhrs6r7ch29

BibKey

sabeti-etal-2018-mirastext

Editors

Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis, Takenobu Tokunaga

Publisher

European Language Resources Association (ELRA)

ISSN

2522-2686

ISBN

79-10-95546-00-9

Conference

Eleventh International Conference on Language Resources and Evaluation

Location

Miyazaki, Japan

Date

7 - 12 May 2018

Authors

BS
Behnam Sabeti
HA
Hossein Abedi Firouzjaee
AJ
Ali Janalizadeh Choobbasti
SM
S.H.E. Mortazavi Najafabadi
AV
Amir Vaheb

Links

URL

DOI