International Journal of Computational Linguistics & Chinese Language Processing                                   []
                                                                                          Vol. 14, No. 3, September 2009


Title:
Modeling Taiwanese POS Tagging Using Statistical Methods and Mandarin Training Data

Author:
Un-Gian Iunn, Jia-hung Tai, Kiat-Gak Lau, Cheng-yan Kao, and Keh-jiann Chen

Abstract:
In this paper, we introduce a POS tagging method for Taiwan Southern Min. We use the more than 62,000 entries of the Taiwanese-Mandarin dictionary and 10 million words of Mandarin training data to tag Taiwanese. The literary written Taiwanese corpora have both Romanized script and Han-Romanization mixed script, and include prose, novels, and dramas. We follow the tagset drawn up by CKIP.
We developed a word alignment checker to assist with the word alignment for the two scripts. It searches the Taiwanese-Mandarin dictionary to find corresponding Mandarin candidate words, selects the most suitable Mandarin word using an HMM probabilistic model from the Mandarin training data, and tags the word using an MEMM classifier.
We achieve an accuracy rate of 91.6% on Taiwanese POS tagging work, and we analyze the errors. We also discover some preliminary Taiwanese training data.

Keywords: Taiwan Southern Min, POS tagging, written Taiwanese, Hidden Markov Model, Maximal Entropy Markov Model.


Title:
A Thesaurus-Based Semantic Classification of English Collocations

Author:
Chung-Chi Huang, Kate H. Kao, Chiung-Hui Tseng and Jason S. Chang

Abstract:
Researchers have developed many computational tools aimed at extracting collocations for both second language learners and lexicographers. Unfortunately, the tremendously large number of collocates returned by these tools usually overwhelms language learners. In this paper, we introduce a thesaurus-based semantic classification model that automatically learns semantic relations for classifying adjective-noun (A-N) and verb-noun (V-N) collocations into different thesaurus categories. Our model is based on iterative random walking over a weighted graph derived from an integrated knowledge source of word senses in WordNet and semantic categories of a thesaurus for collocation classification. We conduct an experiment on a set of collocations whose collocates involve varying levels of abstractness in the collocation usage box of Macmillan English Dictionary. Experimental evaluation with a collection of 150 multiple-choice questions commonly used as a similarity benchmark in the TOEFL synonym test shows that a thesaurus structure is successfully imposed to help enhance collocation production for L2 learners. As a result, our methodology may improve the effectiveness of state-of-the-art collocation reference tools concerning the aspects of language understanding and learning, as well as lexicography.

Keywords:
Collocations, Semantic Classification, Semantic Relations, Random Walk Algorithm, Meaning Access Index and WordNet.


Title:
Automatic Recognition of Cantonese-English Code-Mixing Speech

Author:
Joyce Y. C. Chan, Houwei Cao, P. C. Ching, and Tan Lee

Abstract:
Code-mixing is a common phenomenon in bilingual societies. It refers to the intra-sentential switching of two different languages in a spoken utterance. This paper presents the first study on automatic recognition of Cantonese-English code-mixing speech, which is common in Hong Kong. This study starts with the design and compilation of code-mixing speech and text corpora. The problems of acoustic modeling, language modeling, and language boundary detection are investigated. Subsequently, a large-vocabulary code-mixing speech recognition system is developed based on a two-pass decoding algorithm. For acoustic modeling, it is shown that cross-lingual acoustic models are more appropriate than language-dependent models. The language models being used are character tri-grams, in which the embedded English words are grouped into a small number of classes. Language boundary detection is done either by exploiting the phonological and lexical differences between the two languages or is done based on the result of cross-lingual speech recognition. The language boundary information is used to re-score the hypothesized syllables or words in the decoding process. The proposed code-mixing speech recognition system attains the accuracies of 56.4% and 53.0% for the Cantonese syllables and English words in code-mixing utterances.

Keywords:
Automatic Speech Recognition, Code-mixing, Acoustic Modeling, Language Modeling.


Title:
Corpus, Lexicon, and Construction: A Quantitative Corpus Approach to Mandarin Possessive Construction

Author:
Cheng-Hsien Chen

Abstract:
Taking Mandarin Possessive Construction (MPC) as an example, the present study investigates the relation between lexicon and constructional schemas in a quantitative corpus linguistic approach. We argue that the wide use of raw frequency distribution in traditional corpus linguistic studies may undermine the validity of the results and reduce the possibility for interdisciplinary communication. Furthermore, several methodological issues in traditional corpus linguistics are discussed. To mitigate the impact of these issues, we utilize phylogenic hierarchical clustering to identify semantic classes of the possessor NPs, thereby reducing the subjectivity in categorization that most traditional corpus linguistic studies suffer from. It is hoped that our rigorous endeavor in methodology may have far-reaching implications for theory in usage-based approaches to language and cognition.

Keywords:
Discourse-functional Grammar, Construction Grammar, Quantitative Corpus Linguistics, Possession, Clustering.


@