International Journal of Computational Linguistics & Chinese Language Processing                                   []
                                                                                          Vol. 15, No. 2, June 2010


Title:
A Punjabi to Hindi Machine Transliteration System

Author:
Gurpreet Singh Josan, and Gurpreet Singh Lehal

Abstract:
Transliteration is the general choice for handling named entities and out of vocabulary words in any MT application, particularly in machine translation. Transliteration (or forward transliteration) is the process of mapping source language phonemes or graphemes into target language approximations; the reverse process is called back transliteration. This paper presents a novel approach to improve Punjabi to Hindi transliteration by combining a basic character to character mapping approach with rule based and Soundex based enhancements. Experimental results show that our approach effectively improves the word accuracy rate and average Levenshtein distance of the various categories by a large margin.

Keywords: Transliteration, Punjabi, Hindi, Soundex Approach, Rule based Approach, Word Accuracy Rate


Title:
A Posteriori Individual Word Language Models for Vietnamese Language

Author:
Le Quan Ha, Tran Thi Thu Van, Hoang Tien Long, Nguyen Huu Tinh, Nguyen Ngoc Tham, and Le Trong Ngoc

Abstract:
It is shown that the enormous improvement in the size of disk storage space in recent years can be used to build individual word-domain statistical language models, one for each significant word of a language that contributes to the context of the text. Each of these word-domain language models is a precise domain model for the relevant significant word; when combined appropriately, they provide a highly specific domain language model for the language following a cache, even a short cache. Our individual word probability and frequency models have been constructed and tested in the Vietnamese and English languages. For English, we employed the Wall Street Journal corpus of 40 million English word tokens; for Vietnamese, we used the QUB corpus of 6.5 million tokens. Our testing methods used a priori and a posteriori approaches. Finally, we explain adjustment of a previously exaggerated prediction of the potential power of a posteriori models. Accurate improvements in perplexity for 14 kinds of individual word language models have been obtained in tests, (i) between 33.9% and 53.34% for Vietnamese and (ii) between 30.78% and 44.5% for English, over a baseline global tri-gram weighted average model. For both languages, the best a posteriori model is the a posteriori weighted frequency model of 44.5% English perplexity improvement and 53.34% Vietnamese perplexity improvement. In addition, five Vietnamese a posteriori models were tested to obtain from 9.9% to 16.8% word-error-rate (WER) reduction over a Katz trigram model by the same Vietnamese speech decoder.

Keywords:
A Posteriori, Stop Words, Individual Word Language Models, Frequency Models


Title:
Improving the Template Generation for Chinese Character Error Detection with Confusion Sets

Author:
Yong-Zhi Chen, Shih-Hung Wu, Ping-che Yang, and Tsun Ku

Abstract:
In this paper, we propose a system that automatically generates templates for detecting Chinese character errors. We first collect the confusion sets for each high-frequency Chinese character. Error types include pronunciation-related errors and radical-related errors. With the help of the confusion sets, our system generates possible error patterns in context, which will be used as detection templates. Combined with a word segmentation module, our system generates more accurate templates. The experimental results show the precision of performance approaches 95%. Such a system should not only help teachers grade and check student essays, but also effectively help students learn how to write.

Keywords:
Template Generation, Template Mining, Chinese Character Error


Title:
Annotating Phonetic Component of Chinese Characters Using Constrained Optimization and Pronunciation Distribution

Author:
Chia-Hui Chang, Shu-Yen Lin, Shu-Ying Li, Meng-Feng Tsai, Shu-Ping Li, Hsiang-Mei Liao, Chih-Wen Sun, and Norden E. Huang

Abstract:
Generally speaking, Chinese characters are graphic characters that do not allow immediate pronunciation unless they are accompanied with Mandarin phonetic symbols (zhuyin) or other pinyin methods (e.g. romanization system). In fact, about 80 to 90 percents of Chinese characters are pictophonetic characters which are composed of a phonetic component and a semantic component. Therefore, even if one had not seen the character before, one can make a logical guess at the character's pronunciation and meaning from its phonetic and semantic symbols. In order to analyze such relations, we start by analyzing the characteristics of phonetic components. We found two interesting features that could automatically identify the phonectic components of Chinese characters. One is pronunciation similarity, the other is pronunciation distribution. Experiments show that these two methods have high accuracy (90.8% and 98.1% for 9593 pictophonetic characters) in predicting the phonetic components of pictophonetic characters. These methods can save a lot of time and effort during the annotation of phonetic symbols in the early stage.

Keywords:
Picto-phonetic compounds, phonetic component, pronunciation similarity, pronunciation distribution, optimization


@