International Journal of Computa

International Journal of Computational Linguistics & Chinese Language Processing
Vol. 10, No. 1, March 2005

· Lightly Supervised and Data-Driven Approaches to Mandarin Broadcast News Transcription

Berlin Chen, Jen-Wei Kuo and Wen-Hung Tsai

[pdf | html]

· Reduced N-Grams for Chinese Evaluation

Le Quan Ha, R. Seymour, P. Hanna and F. J. Smith

[pdf | html]

· Automated Alignment and Extraction of a Bilingual Ontology for Cross-Language Domain-Specific Applications

Jui-Feng Yeh, Chung-Hsien Wu, Ming-Jun Chen and Liang-Chih Yu

[pdf | html]

· Chinese Main Verb Identification: From Specification to Realization

Bing-Gong Ding, Chang-Ning Huang and De-Gen Huang

[pdf | html]

· Aligning Parallel Bilingual Corpora Statistically with Punctuation Criteria

Thomas C. Chuang and Kevin C. Yeh

[pdf | html]

· Similarity Based Chinese Synonym Collocation Extraction

Wanyin Li, Qin Lu and Ruifeng Xu

[pdf | html]

��

Title:

Lightly Supervised and Data-Driven Approaches to Mandarin Broadcast News Transcription

Author:

Berlin Chen, Jen-Wei Kuo and Wen-Hung Tsai

Abstract:

This article investigates the use of several lightly supervised and data-driven approaches to Mandarin broadcast news transcription. With the special structural properties of the Chinese language taken into consideration, a fast acoustic look-ahead technique for estimating the unexplored part of a speech utterance is integrated into lexical tree search to improve search efficiency. This technique is used in conjunction with the conventional language model look-ahead technique. Then, a verification-based method for automatic acoustic training data acquisition is proposed to make use of large amounts of untranscribed speech data. Finally, two alternative strategies for language model adaptation are studied with the goal of achieving accurate language model estimation. With the above approaches, the overall system was found in experiments to yield an 11.88% character error rate when applied to Mandarin broadcast news collected in Taiwan.

Keyword:

acoustic look-ahead, lightly supervised acoustic model training, language model adaptation, Mandarin broadcast news

Title:

Reduced N-Grams for Chinese Evaluation

Author:

Le Quan Ha, R. Seymour, P. Hanna and F. J. Smith

Abstract:

Theoretically, an improvement in a language model occurs as the size of the n-grams increases from 3 to 5 or higher. As the n-gram size increases, the number of parameters and calculations, and the storage requirement increase very rapidly if we attempt to store all possible combinations of n-grams. To avoid these problems, the reduced n-grams�� approach previously developed by O�� Boyle and Smith [1993] can be applied. A reduced n-gram language model, called a reduced model, can efficiently store an entire corpus�䏭 phrase-history length within feasible storage limits. Another advantage of reduced n-grams is that they usually are semantically complete. In our experiments, the reduced n-gram creation method or the O�� Boyle-Smith reduced n-gram algorithm was applied to a large Chinese corpus. The Chinese reduced n-gram Zipf curves are presented here and compared with previously obtained conventional Chinese n-grams. The Chinese reduced model reduced perplexity by 8.74% and the language model size by a factor of 11.49. This paper is the first attempt to model Chinese reduced n-grams, and may provide important insights for Chinese linguistic research.

Keyword:

Reduced n-grams, reduced n-gram algorithm / identification, reduced model, Chinese reduced n-grams, Chinese reduced model

Title:

Automated Alignment and Extraction of a Bilingual Ontology for Cross-Language Domain-Specific Applications

Author:

Jui-Feng Yeh, Chung-Hsien Wu, Ming-Jun Chen and Liang-Chih Yu

Abstract:

This paper presents a novel approach to ontology alignment and domain ontology extraction from two existing knowledge bases: WordNet and HowNet. These two knowledge bases are automatically aligned to construct a bilingual ontology based on the co-occurrence of words in a bilingual parallel corpus. The bilingual ontology achieves greater structural and semantic information coverage from these two complementary knowledge bases. For domain-specific applications, a domain-specific ontology is further extracted from the bilingual ontology using the island-driven algorithm and domain-specific corpus. Finally, domain-dependent terminology and axioms between domain terminology defined in a medical encyclopedia are integrated into the domain-specific ontology. In addition, a metric based on a similarity measure for ontology evaluation is also proposed. For evaluation purposes, experiments were conducted comparing an automatically constructed ontology with a benchmark ontology constructed by ontology engineers or experts. The experimental results show that the constructed bilingual domain-specific ontology mostly coincided with the benchmark ontology. As for application of this approach to the medical domain, the experimental results show that the proposed approach outperformed the synonym expansion approach to web search.

Keyword:

Ontology, island driven algorithm, cross language application, WordNet, HowNet

Title:

Chinese Main Verb Identification: From Specification to Realization

Author:

Bing-Gong Ding, Chang-Ning Huang and De-Gen Huang

Abstract:

Main verb identification is the task of automatically identifying the predicate-verb in a sentence. It is useful for many applications in Chinese Natural Language Processing. Although most studies have focused on the model used to identify the main verb, the definition of the main verb should not be overlooked. In our specification design, we have found many complicated issues that still need to be resolved since they haven�脌 been well discussed in previous works. Thus, the first novel aspect of our work is that we carefully design a specification for annotating the main verb and investigate various complicated cases. We hope this discussion will help to uncover the difficulties involved in this problem. Secondly, we present an approach to realizing main verb identification based on the use of chunk information, which leads to better results than the approach based on part-of-speech. Finally, based on careful observation of the studied corpus, we propose new local and contextual features for main verb identification. According to our specification, we annotate a corpus and then use a Support Vector Machine (SVM) to integrate all the features we propose. Our model, which was trained on our annotated corpus, achieved a promising F score of 92.8%. Furthermore, we show that main verb identification can improve the performance of the Chinese Sentence Breaker, one of the applications of main verb identification, by 2.4%.

Keyword:

Chinese Main Verb Identification, Text Analysis, Natural Language Processing, SVM

Title:

Aligning Parallel Bilingual Corpora Statistically with Punctuation Criteria

Author:

Thomas C. Chuang and Kevin C. Yeh

Abstract:

We present a new approach to aligning sentences in bilingual parallel corpora based on punctuation, especially for English and Chinese. Although the length-based approach produces high accuracy rates of sentence alignment for clean parallel corpora written in two Western languages, such as French-English or German-English, it does not work as well for parallel corpora that are noisy or written in two disparate languages such as Chinese-English. It is possible to use cognates on top of the length-based approach to increase the alignment accuracy. However, cognates do not exist between two disparate languages, which limit the applicability of the cognate-based approach. In this paper, we examine the feasibility of exploiting the statistically ordered matching of punctuation marks in two languages to achieve high accuracy sentence alignment. We have experimented with an implementation of the proposed method on parallel corpora, the Chinese-English Sinorama Magazine Corpus and Scientific American Magazine articles, with satisfactory results. Compared with the length-based method, the proposed method exhibits better precision rates based on our experimental reuslts. Highly promising improvement was observed when both the punctuation-based and length-based methods were adopted within a common statistical framework. We also demonstrate that the method can be applied to other language pairs, such as English-Japanese, with minimal additional effort.

Keyword:

Sentence Alignment, Cognate Alignment, Machine Translation

Title:

Similarity Based Chinese Synonym Collocation Extraction

Author:

Wanyin Li, Qin Lu and Ruifeng Xu

Abstract:

Collocation extraction systems based on pure statistical methods suffer from two major problems. The first problem is their relatively low precision and recall rates. The second problem is their difficulty in dealing with sparse collocations. In order to improve performance, both statistical and lexicographic approaches should be considered. This paper presents a new method to extract synonymous collocations using semantic information. The semantic information is obtained by calculating similarities from HowNet. We have successfully extracted synonymous collocations which normally cannot be extracted using lexical statistics. Our evaluation conducted on a 60MB tagged corpus shows that we can extract synonymous collocations that occur with very low frequency and that the improvement in the recall rate is close to 100%. In addition, compared with a collocation extraction system based on the Xtract system for English, our algorithm can improve the precision rate by about 44%.

Keyword:

Lexical Statistics, Synonymous Collocations, Similarity, Semantic Information