International Journal of Computational Linguistics & Chinese Language Processing
Vol. 10, No. 2, June 2005


Title:
Automatic Segmentation and Labeling for Mandarin Chinese Speech Corpora for Concatenation-based TTS

Author:
Cheng-Yuan Lin, Jyh-Shing Roger Jang and Kuan-Ting Chen

Abstract:
Precise phone/syllable boundary labeling of the utterances in a speech corpus plays an important role in constructing a corpus-based TTS (text-to-speech) system. However, automatic labeling based on Viterbi forced alignment does not always produce satisfactory results. Moreover, a suitable labeling method for one language does not necessarily produce desirable results for another language. Hence in this paper, we propose a new procedure for refining the boundaries of utterances in a Mandarin speech corpus. This procedure employs different sets of acoustic features for four different phonetic categories. In addition, a new scheme is proposed to deal with the ˇ§periodic voiced + periodic voicedˇ¨ case, which produced most of the segmentation errors in our experiment. Several experiments were conducted to demonstrate the feasibility of the proposed approach.

Keyword:
speech assessment methods phonetic alphabet, speech corpus, sequential forward selection, k-nearest neighbor rule, leave-one-out, speaker-adapted model, context-dependent hidden Markov model (HMM)


Title:
The Formosan Language Archive: Linguistic Analysis and Language Processing

Author:
Elizabeth Zeitoun and Ching-Hua Yu

Abstract:
In this paper, we deal with the linguistic analysis approach adopted in the Formosan Language Corpora, one of the three main information databases included in the Formosan Language Archive, and the language processing programs that have been built upon it. We first discuss problems related to the transcription of different language corpora. We then deal with annotation rules and standards. We go on to explain the linguistic identification of clauses, sentences and paragraphs, and the computer programs used to obtain an alignment of words, glosses and sentences in Chinese and English. We finally show how we try to cope with analytic inconsistencies through programming. This paper is a complement to Zeitoun et al. [2003] in which we provided an overview of the whole architecture of the Formosan Language Archive.

Keyword:
Formosan languages, Formosan Language Archive, corpora, linguistic analysis, language processing


Title:
Mandarin Topic-oriented Conversations

Author:
Shu-Chuan Tseng

Abstract:
This paper describes the collection and processing of a pilot speech corpus annotated in dialogue acts. The Mandarin Topic-oriented Conversational Corpus (MTCC) consists of annotated transcripts and sound files of conversations between two familiar persons. Particular features of spoken Mandarin, such as discourse particles and paralinguistic sounds, are taken into account in the orthographical transcription. In addition, the dialogue structure is annotated using an annotation scheme developed for topic-specific conversations. Using the annotated materials, we present the results of a preliminary analysis of dialogue structure and dialogue acts. Related transcription tools and web query applications are also introduced in this paper.

Keyword:
Taiwan Mandarin, dialogue act, speech corpus


Title:
MATBN: A Mandarin Chinese Broadcast News Corpus

Author:
Hsin-Min Wang, Berlin Chen, Jen-Wei Kuo and Shih-Sian Cheng

Abstract:
The MATBN Mandarin Chinese broadcast news corpus contains a total of 198 hours of broadcast news from the Public Television Service Foundation (Taiwan) with corresponding transcripts. The primary purpose of this collection is to provide training and testing data for continuous speech recognition evaluation in the broadcast news domain. In this paper, we briefly introduce the speech corpus and report on some preliminary statistical analysis and speech recognition evaluation results.

Keywords:
broadcast news, corpus, speech recognition, Mandarin Chinese, transcription, annotation


Title:
TAICAR-The Collection and Annotation of an In-Car Speech Database Created in Taiwan

Author:
Hsien-Chang Wang, Chung-Hsien Yang, Jhing-Fa Wang, Chung-Hsien Wu and Jen-Tzung Chien

Abstract:
This paper describes a project that aims to create a Mandarin speech database for the automobile setting (TAICAR). A group of researchers from several universities and research institutes in Taiwan have participated in the project. The goal is to generate a corpus for the development and testing of various speech-processing techniques. There are six recording sites in this project. Various words, sentences, and spontaneously queries uttered in the vehicular navigation setting have been collected in this project. A preliminary corpus of utterances from 192 speakers was created from utterances generated in different vehicles. The database contains more than 163,000 files, occupying 16.8 gigabytes of disk space.

Keyword:
TAICAR, in-car speech, speech database, multi-channel recording, corpus collection and annotation
ˇ@


Title:
Design and Development of a Bilingual Reading Comprehension Corpus

Author:
Kui Xu and Helen Meng

Abstract:
This paper describes our initial attempt to design and develop a bilingual reading comprehension corpus (BRCC). RC is a task that conventionally evaluates the reading ability of an individual. An RC system can automatically analyze a passage of natural language text and generate an answer for each question based on information in the passage. The RC task can be used to drive advancements of natural language processing (NLP) technologies imparted in automatic RC systems.  Furthermore, an RC system presents a novel paradigm of information search, when compared to the predominant paradigm of text retrieval in search engines on the Web. Previous works on automatic RC typically involved English-only language learning materials (Remedia and CBC4Kids) designed for children/students, which included stories, human-authored questions, and answer keys. These corpora are important for supporting empirical evaluation of RC performance. In the present work, we attempted to utilize RC as a driver for NLP techniques in both English and Chinese. We sought parallel English, and Chinese learning materials and incorporated annotations deemed relevant to the RC task. We measured the comparative levels of difficulty among the three corpora by means of the baseline bag-of-words (BOW) approach. Our results show that the BOW approach achieves better RC performance in BRCC (67%) when compared to Remedia (29%) and CBC4Kids (63%). This reveals that BRCC has the highest degree of word overlap between questions and passages among the three corpora, which artificially simplifies the RC task. This result suggests that additional effort should be devoted to authoring questions with a various grades of difficulty in order for BRCC to better support RC research across the English and Chinese languages.

Keyword:
bilingual, reading comprehension, corpus
ˇ@


Title:
A Chinese Term Clustering Mechanism for Generating Semantic Concepts of a News Ontology

Author:
Chang-Shing Lee, Yau-Hwang Kuo, Chia-Hsin Liao and Zhi-Wei Jian

Abstract:
In order to efficiently manage and use knowledge, ontology technologies are widely applied to various kinds of domain knowledge. This paper proposes a Chinese term clustering mechanism for generating semantic concepts of a news ontology. We utilize the parallel fuzzy inference mechanism to infer the conceptual resonance strength of a Chinese term pair. There are four input fuzzy variables, consisting of a Part-of-Speech (POS) fuzzy variable, Term Vocabulary (TV) fuzzy variable, Term Association (TA) fuzzy variable, and Common Term Association (CTA) fuzzy variable, and one output fuzzy variable, the Conceptual Resonance Strength (CRS), in the mechanism. In addition, the CKIP tool is used in Chinese natural language processing tasks, including POS tagging, refining tagging, and stop word filtering. The fuzzy compatibility relation approach to the semantic concept clustering is also proposed. Simulation results show that our approach can effectively cluster Chinese terms to generate the semantic concepts of a news ontology.

Keyword:
Ontology, Chinese Natural Language Processing, Fuzzy Inference, Feature Selection, Concept Clustering



ˇ@