Extracting French-Japanese Word Pairs from Bilingual Corpora based on Transliteration Rules

Proceedings of the Third International Conference on Language Resources and Evaluation (LREC 2002)

Abstract

It has been shown so far that using transliteration rules to extract Japanese Katakana and English word pairs is highly useful and promising. But for Japanese-French pairs, the method is not guaranteed to work, because only a very few Japanese Katakana words are borrowed directly from French. In this paper we will show the possibility of extracting Japanese Katakana and French word pairs based on transliteration from loosely aligned Japanese French bilingual corpora. The method applies all the existing transliteration rules to each mora unit in a Katakana word, and extracts the French word which matches or partially-matches one of these transliteration candidates as translation. For instance, if we have `Ot' in the Japanese part of a bilingual corpora, we generate such transliteration candidates as <graf>, <graphe>, <gulerph>,... and identify similar words from French part of the corpora. The method performed reasonably well, achieving 80% precision at 20% recall. We had also observed that Japanese-English transliteration rules worked well for extracting Katakana-French word pairs.