Automatic Myanmar Image Captioning using CNN and LSTM-Based Language Model

Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)

DOI:10.63317/2yb2xc3mygbn

Abstract

An image captioning system involves modules on computer vision as well as natural language processing. Computer vision module is for detecting salient objects or extracting features of images and Natural Language Processing (NLP) module is for generating correct syntactic and semantic image captions. Although many image caption datasets such as Flickr8k, Flickr30k and MSCOCO are publicly available, most of the datasets are captioned in English language. There is no image caption corpus for Myanmar language. Myanmar image caption corpus is manually built as part of the Flickr8k dataset in this current work. Furthermore, a generative merge model based on Convolutional Neural Network (CNN) and Long-Short Term Memory (LSTM) is applied especially for Myanmar image captioning. Next, two conventional feature extraction models Visual Geometry Group (VGG) OxfordNet 16-layer and 19-layer are compared. The performance of this system is evaluated on Myanmar image caption corpus using BLEU scores and 10-fold cross validation.

Resources

Details

Paper ID

lrec2020-ws-sltu-19

Pages

pp. 139-143

DOI

10.63317/2yb2xc3mygbn

BibKey

pa-pa-aung-etal-2020-automatic

Editors

N/A

Publisher

European Language Resources Association (ELRA)

ISSN

N/A

ISBN

N/A

Workshop

Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)

Location

Marseille, France

Date

11 - 16 May 2020

Authors

SP
San Pa Pa Aung
WP
Win Pa Pa
TN
Tin Lay Nwe

Links

URL

DOI