Sudachi: a Japanese Tokenizer for Business

Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

Abstract

This paper presents Sudachi, a Japanese tokenizer and its accompanying language resources for business use. Tokenization, or morphological analysis, is a fundamental and important technology for processing a Japanese text, especially for industrial applications. However, we often face many obstacles for Japanese tokenization, such as the inconsistency of token unit in different resources, notation variations, discontinued maintenance of the resources, and various issues with the existing tokenizer implementations. In order to improve this situation, we develop a new tokenizer and a dictionary with features such as multi-granular output for different purposes and normalization of notation variations. In addition to this, we are planning to continuously maintain our software and resource in long-term as a part of the company business. We release the resulting tokenizer software and language resources freely available to the public as an open source software. You can access them at https://github.com/WorksApplications/Sudachi.

Resources

Details

Paper ID

lrec2018-main-355

Pages

N/A

DOI

10.63317/2hsnuu4nxsnq

BibKey

takaoka-etal-2018-sudachi

Editors

Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis, Takenobu Tokunaga

Publisher

European Language Resources Association (ELRA)

ISSN

2522-2686

ISBN

79-10-95546-00-9

Conference

Eleventh International Conference on Language Resources and Evaluation

Location

Miyazaki, Japan

Date

7 - 12 May 2018

Authors

KT
Kazuma Takaoka
SH
Sorami Hisamoto
NK
Noriko Kawahara
MS
Miho Sakamoto
YU
Yoshitaka Uchida
YM
Yuji Matsumoto

Links

URL

DOI