International Journal of Computa

International Journal of Computational Linguistics & Chinese Language Processing [銝剜�]
Vol. 11, No. 3, September 2006

An Approach to Using the Web as a Live Corpus for Spoken Transliteration Name Access
Ming-Shun Lin, Chia-Ping Chen, and Hsin-Hsi Chen
[pdf | html]

An Empirical Study of Word Error Minimization Approaches for Mandarin Large Vocabulary Continuous Speech Recognition
Jen-Wei Kuo, Shih-Hung Liu, Hsin-Min Wang, and Berlin Chen
[pdf | html]

Sense Extraction and Disambiguation for Chinese Words from Bilingual Terminology Bank
Ming-Hong Bai, Keh-Jiann Chen, and Jason S. Chang
[pdf | html]

A Probe into Ambiguities of Determinative-Measure Compounds
Shih-Min Li, Su-Chu Lin, Chia-Hung Tai, and Keh-Jiann Chen
[pdf | html]

Tonal Errors of Japanese Students Learning Chinese: A Study of Disyllabic Words
Ke-Jia Chang, Li-Mei Chen, and Nien-Chen Lee
[pdf | html]
Performance Analysis and Visualization of Machine Translation Evaluation
Jianmin Yao, Yunqian Qu, Qiang Lv, Qiaoming Zhu, and Jing Zhang
[pdf | html]

Title:
An Approach to Using the Web as a Live Corpus for Spoken Transliteration Name Access

Author:
Ming-Shun Lin, Chia-Ping Chen, and Hsin-Hsi Chen

Abstract:
Recognizing transliteration names is challenging due to their flexible formulation and lexical coverage. In our approach, we employ the Web as a giant corpus. The patterns extracted from the Web are used as a live dictionary to correct speech recognition errors. The plausible character strings recognized by an Automated Speech Recognition (ASR) system are regarded as query terms and submitted to Google. The top N snippets are entered into PAT trees. The terms of the highest scores are selected. Our experiments show that the ASR model with a recovery mechanism can achieve 21.54% performance improvement compared with the ASR only model on the character level. The recall rate is improved from 0.20 to 0.42, and the MRR from 0.07 to 0.31. For collecting transliteration names, we propose a named entity (NE) ontology generation engine, called the XNE-Tree engine, which produces relational named entities by a given seed. The engine incrementally extracts high co-occurring named entities with the seed. A total of 7,642 named entities in the ontology were initiated by 100 seeds. When the bi-character language model is combined with the NE ontology, the ASR recall rate and MRR are improved to 0.48 and 0.38, respectively.

Keyword:

Title:
An Empirical Study of Word Error Minimization Approaches for Mandarin Large Vocabulary Continuous Speech Recognition

Author:
Jen-Wei Kuo, Shih-Hung Liu, Hsin-Min Wang, and Berlin Chen

Abstract:
This paper presents an empirical study of word error minimization approaches for Mandarin large vocabulary continuous speech recognition (LVCSR). First, the minimum phone error (MPE) criterion, which is one of the most popular discriminative training criteria, is extensively investigated for both acoustic model training and adaptation in a Mandarin LVCSR system. Second, the word error minimization (WEM) criterion, used to rescore N-best word strings, is appropriately modified for a Mandarin LVCSR system. Finally, a series of speech recognition experiments is conducted on the MATBN Mandarin Chinese broadcast news corpus. The experiment results demonstrate that the MPE training approach reduces the character error rate (CER) by 12% for a system initially trained with the maximum likelihood (ML) approach. Meanwhile, for unsupervised acoustic model adaptation, MPE-based linear regression (MPELR) adaptation outperforms conventional maximum likelihood linear regression (MLLR) in terms of CER reduction. When the WEM decoding approach is used for N-best rescoring, a slight performance gain over the conventional maximum a posteriori (MAP) decoding method is also observed.

Keyword:
Broadcast News, Continuous Speech Recognition, Discriminative Training, Minimum Phone Error, Word Error Minimization

Title:
Sense Extraction and Disambiguation for Chinese Words from Bilingual Terminology Bank

Author:
Ming-Hong Bai, Keh-Jiann Chen, and Jason S. Chang

Abstract:
Using lexical semantic knowledge to solve natural language processing problems has been getting popular in recent years. Because semantic processing relies heavily on lexical semantic knowledge, the construction of lexical semantic databases has become urgent. WordNet is the most famous English semantic knowledge database at present; many researches of word sense disambiguation adopt it as a standard. Because of the success of WordNet, there is a trend to construct WordNet in different languages. In this paper, we propose a methodology for constructing Chinese WordNet by extracting information from a bilingual terminology bank. We developed an algorithm of word-to-word alignment to extract the English-Chinese translation-equivalent word pairs first. Then, the algorithm disambiguates word senses and maps Chinese word senses to WordNet synsets to achieve the goal. In the word-to-word alignment experiment, this alignment algorithm achieves the f-score of 98.4%. In the word sense disambiguation experiment, the extracted senses cover 36.89% of WordNet synsets and the accuracy of the three proposed disambiguation rules achieve the accuracies of 80%, 83% and 87%, respectively.

Keyword:
Word Alignment, Word Sense Disambiguation, WordNet, EM Algorithm, Sense Tagging.

Title:
A Probe into Ambiguities of Determinative-Measure Compounds

Author:
Shih-Min Li, Su-Chu Lin, Chia-Hung Tai, and Keh-Jiann Chen

Abstract:
This paper aims to further probe into the problems of ambiguities for automatic identification of determinative-measure compounds (DMs) in Chinese and to develop sets of rules to identify DMs and their parts of speech. It is known that Chinese DMs are identifiable by regular expressions. DM rule matching helps one solve word segmentation ambiguities, and parts of speech help one improve sense recognition and part-of-speech tagging. In this paper, a deep analysis based on corpus data was studied. With analyses of error identification and disambiguation of DM compounds, the authors classified three types of ambiguities, i.e. word segmentation, sense, and pos ambiguities. DM rules are necessary complements to dictionaries and helpful to resolve word segmentation ambiguities by applying resolution principles and segmentation models. Sense and pos ambiguities are also expected to be resolved by different approaches during postprocessing.

Keywords:
Ambiguities, Word Segmentation Ambiguities, Sense Ambiguities, Part-of Speech Ambiguities, Determinative-Measure Compounds

Title:
Tonal Errors of Japanese Students Learning Chinese: A Study of Disyllabic Words

Author:
Ke-Jia Chang, Li-Mei Chen, and Nien-Chen Lee

Abstract:
To foreigners, how to manage tone is the greatest challenge in learning Chinese. What causes foreign students to be unable to distinguish different tones is the phonological system of their native language. The accent in standard Japanese (Tokyo dialect) is distributed in the pitch change within each syllable, and the first syllable must be the opposite of the second in accent. The discrepancy between the tonal production of Japanese students learning Chinese and that of Chinese native speakers was investigated in this study. It is found that the two Japanese students in this study made the most frequent mistakes in reading Chinese disyllabic words when the first syllable was tone 2 or tone 3, and the tonal errors were mostly found in disyllabic words with tone combinations of 2-1, 2-4, and 3-4. We also found that in Group B (2-1, 2-2, 2-3, 2-4), whatever the original tones were, the two subjects always mispronounced them as 2-3. This is primarily attributed to the fact that, in Japanese, only one pitch peak is allowed in disyllabic compounds.

Keyword:
Japanese Students Learning Chinese, Disyllabic Words, Tonal Errors
��

Title:
Performance Analysis and Visualization of Machine Translation Evaluation

Author:
Jianmin Yao, Yunqian Qu, Qiang Lv, Qiaoming Zhu, and Jing Zhang

Abstract:
Automatic translation evaluation is popular in development of MT systems, but further research is necessary for better evaluation methods and selection of an appropriate evaluation suite. This paper is an attempt for an in-depth analysis of the performance of MT evaluation methods. Difficulty, discriminability and reliability characteristics are proposed and tested in experiments. Visualization of the evaluation scores, which is more intuitional, is proposed to see the translation quality and is shown as a natural way to assemble different evaluation methods.

Keyword:
Machine Translation, Performance, Analysis, Visualization, Clustering, Natural Language Processing
��

��