International Journal of Computa

International Journal of Computational Linguistics & Chinese Language Processing
Vol. 10, No. 3, September 2005

Using Lexical Constraints to Enhance the Quality of Computer-Generated Multiple-Choice Cloze Items
Chao-Lin Liu, Chun-Hung Wang and Zhao-Ming Gao
[pdf | html]

Collocational Translation Memory Extraction Based on Statistical and Linguistic Information
Thomas C. Chuang, Jia-Yan Jian, Yu-Chia Chang and Jason S. Chang
[pdf | html]

Detecting Emotions in Mandarin Speech
Tsang-Long Pao, Yu-Te Chen, Jun-Heng Yeh and Wen-Yuan Liao
[pdf | html]

Modeling Pronunciation Variation for Bi-Lingual Mandarin/Taiwanese Speech Recognition
Dau-Cheng Lyu, Ren-Yuan Lyu, Yuang-Chin Chiang and Chun-Nan Hsu
[pdf | html]

Chinese Word Segmentation by Classification of Characters
Chooi-Ling Goh, Masayuki Asahara and Yuji Matsumoto
[pdf | html]
The Design and Construction of the PolyU Shallow Treebank
Ruifeng Xu, Qin Lu, Yin Li and Wanyin Li
[pdf | html]

Title:
Using Lexical Constraints to Enhance the Quality of Computer-Generated Multiple-Choice Cloze Items

Author:
Chao-Lin Liu, Chun-Hung Wang and Zhao-Ming Gao

Abstract:
Multiple-choice cloze items constitute a prominent tool for assessing students�� competency in using the vocabulary of a language correctly. Without a proper estimation of students�� competency in using vocabulary, it will be hard for a computer-assisted language learning system to provide course material tailored to each individual student�䏭 needs. Computer-assisted item generation allows the creation of large-scale item pools and further supports Web-based learning and assessment. With the abundant text resources available on the Web, one can create cloze items that cover a wide range of topics, thereby achieving usability, diversity and security of the item pool. One can apply keyword-based techniques like concordancing that extract sentences from the Web, and retain those sentences that contain the desired keyword to produce cloze items. However, such techniques fail to consider the fact that many words in natural languages are polysemous so that the recommended sentences typically include a non-negligible number of irrelevant sentences. In addition, a substantial amount of labor is required to look for those sentences in which the word to be tested really carries the sense of interest. We propose a novel word sense disambiguation-based method for locating sentences in which designated words carry specific senses, and apply generalized collocation-based methods to select distractors that are needed for multiple-choice cloze items. Experimental results indicated that our system was able to produce a usable cloze item for every 1.6 items it returned.

Keyword:
Computer-assisted language learning, Computer-assisted item generation, Advanced authoring systems, Natural language processing, Word sense disambiguation, Collocations, Selectional preferences

Title:
Collocational Translation Memory Extraction Based on Statistical and Linguistic Information

Author:
Thomas C. Chuang, Jia-Yan Jian, Yu-Chia Chang and Jason S. Chang

Abstract:
In this paper, we propose a new method for extracting bilingual collocations from a parallel corpus to provide phrasal translation memories. The method integrates statistical and linguistic information to achieve effective extraction of bilingual collocations. The linguistic information includes parts of speech, chunks, and clauses. The method involves first obtaining an extended list of English collocations from a very large monolingual corpus, then identifying the collocations in a parallel corpus, and finally extracting translation equivalents of the collocations based on word alignment information. Experimental results indicate that phrasal translation memories can be effectively used for computer assisted language learning (CALL) and computer assisted translation (CAT).

Keyword:
Bilingual Collocation Extraction, Collocational Translation Memory, Collocational Concordancer

Title:
Detecting Emotions in Mandarin Speech

Author:
Tsang-Long Pao, Yu-Te Chen, Jun-Heng Yeh and Wen-Yuan Liao

Abstract:
The importance of automatically recognizing emotions in human speech has grown with the increasing role of spoken language interfaces in human-computer interaction applications. In this paper, a Mandarin speech based emotion classification method is presented. Five primary human emotions, including anger, boredom, happiness, neutral and sadness, are investigated. Combining different feature streams to obtain a more accurate result is a well-known statistical technique. For speech emotion recognition, we combined 16 LPC coefficients, 12 LPCC components, 16 LFPC components, 16 PLP coefficients, 20 MFCC components and jitter as the basic features to form the feature vector. Two corpora were employed. The recognizer presented in this paper is based on three classification techniques: LDA, K-NN and HMMs. Results show that the selected features are robust and effective for the emotion recognition in the valence and arousal dimensions of the two corpora. Using the HMMs emotion classification method, an average accuracy of 88.7% was achieved.

Keyword:
Mandarin, emotion recognition, LPC, LFPC, PLP, MFCC

Title:
Modeling Pronunciation Variation for Bi-Lingual Mandarin/Taiwanese Speech Recognition

Author:
Dau-Cheng Lyu, Ren-Yuan Lyu, Yuang-Chin Chiang and Chun-Nan Hsu

Abstract:
In this paper, a bi-lingual large vocaburary speech recognition experiment based on the idea of modeling pronunciation variations is described. The two languages under study are Mandarin Chinese and Taiwanese (Min-nan). These two languages are basically mutually unintelligible, and they have many words with the same Chinese characters and the same meanings, although they are pronounced differently. Observing the bi-lingual corpus, we found five types of pronunciation variations for Chinese characters. A one-pass, three-layer recognizer was developed that includes a combination of bi-lingual acoustic models, an integrated pronunciation model, and a tree-structure based searching net. The recognizer�䏭 performance was evaluated under three different pronunciation models. The results showed that the character error rate with integrated pronunciation models was better than that with pronunciation models, using either the knowledge-based or the data-driven approach. The relative frequency ratio was also used as a measure to choose the best number of pronunciation variations for each Chinese character. Finally, the best character error rates in Mandarin and Taiwanese testing sets were found to be 16.2% and 15.0%, respectively, when the average number of pronunciations for one Chinese character was 3.9.

Keywords:
Bi-lingual, One-pass ASR, Pronunciation Modeling

Title:
Chinese Word Segmentation by Classification of Characters

Author:
Chooi-Ling Goh, Masayuki Asahara and Yuji Matsumoto

Abstract:
During the process of Chinese word segmentation, two main problems occur: segmentation ambiguities and unknown word occurrences. This paper describes a method to solve the segmentation problem. First, we use a dictionary-based approach to segment the text. We apply the Maximum Matching algorithm to segment the text forwards (FMM) and backwards (BMM). Based on the difference between FMM and BMM, and the context, we apply a classification method based on Support Vector Machines to re-assign the word boundaries. In so doing, we use the output of a dictionary-based approach, and then apply a machine-learning-based approach to solve the segmentation problem. Experimental results show that our model can achieve an F-measure of 99.0 for overall segmentation, given the condition that there are no unknown words in the text, and an F-measure of 95.1 if unknown words exist.

Keyword:
Chinese, word segmentation, segmentation ambiguity, unknown word, maximum matching algorithm, support vector machines
��

Title:
The Design and Construction of the PolyU Shallow Treebank

Author:
Ruifeng Xu, Qin Lu, Yin Li and Wanyin Li

Abstract:
This paper presents the design and construction of the PolyU Treebank, a manually annotated Chinese shallow treebank. The PolyU Treebank is based on shallow annotation where only partial syntactical structures within sentences are annotated. Guided by the Phrase-Standard Grammar proposed by Peking University, the PolyU Treebank has been designed and constructed to provide a large amount of annotated data containing shallow syntactical information and limited semantic information for use in natural language processing (NLP) research. This paper describes the relevant design principles, annotation guidelines, and implementation issues, including the achievement of high quality annotation through the use of well-designed annotation workflow and effective post-annotation checking tools. Currently, the PolyU Treebank consists of a one-million-word annotated corpus and has been used in a number of NLP research projects with promising results.

Keyword:
Shallow Treebank, Shallow Parsing, Corpus Annotation, Natural Language Processing
��

��