International Journal of Computational Linguistics & Chinese Language Processing
Vol. 2, No. 1, February 1997


Title:
Computational Tools and Resources for Linguistic Studies

Author:
Yu-Ling Una Hsu, Jing-Shin Chang and Keh-Yih Su

Abstract:
This paper presents several useful computational tools and available resources to facilitate linguistic studies. For each computational tool, we demonstrate why it is useful and how can it be used for research. In addition, linguistic examples are given for illustration. First, a very useful searching engine, Key Word in Context (KWIC), is introduced. This tool can automatically extract linguistically significant patterns from large corpora and help linguists discover syntagmatic generalizations. Second, Dynamic Clustering and Hierarchical Clustering are introduced for identifying natural clusters of words or phrases in distribution. Third, statistical measures which could be used to measure the degree of cohesion and correlation among linguistic units are presented. These tools can help linguists identify the boundaries of lexical units. Fourth, alignment tools for aligning parallel texts at the word, sentence and structure levels are presented for linguists who do comparative studies of different languages. Fifth, we introduce Sequential Forward Selection (SFS) and Classification and Regression Tree (CART) for automatic rule ordering. Finally, some available electronic Chinese resources are described to provide reference purposes for those who are interested.

Keyword:
extraction, clustering, cohesion, alignment, Chinese corpora, electronic dictionary


Title:
Measuring Relationship among Dialects: DOC and Related Resources

Author:
Chin-Chuan Cheng

Abstract:
This paper is a synthesis of the past studies in measurements of dialect relationships. The phonological data of 17 Chinese dialects that were computerized in the late 1960s have been utilized for measurements of dialect distance. In addition, a file of over 6,400 lexical variants in 18 dialects was also used to quantify dialect affinity. This writing first explains the nature, the organization, and the coding of these files. A series of steps illustrate how the phonological file was processed to derive the needed information for calculation of correlation coefficients. The coefficients are considered as indices of dialect affinity. The dialects are then grouped by the average linking method of cluster analysis of the coefficients. The appropriateness of the correlation method to the data is then discussed. Recent work on calculation of dialect mutual intelligibility is presented to indicate the future direction of research.

Keyword:
Chinese dialects, measurements of affinity, measurements of mutual intelligibility, comparative dialectology


Title:
MAT -- A Project to Collect Mandarin Speech Data Through Telephone Net works in Taiwan

Author:
Hsiao-Chuan Wang

Abstract:
A cooperative project, called Polyphone, was initiated by the Coordinating Committee on Speech Databases and Speech I/O Systems Assessment (COCOSDA) in 1992. Accordingly, a project to collect Mandarin speech data across Taiwan (MAT) was conducted by a group of researchers from several universities and research organizations in Taiwan. The purpose was to generate a speech corpus for the development of Mandarin-based speech technology and products. The speech data were collected at eight recording stations through telephone networks. The speakers were chosen so as to reflect the population of the gender, the dialect, the educational level, and the residence in Taiwan. A preliminary Mandarin speech database of 800 speakers has been produced. The final goal is to generate a speech database of at least 5000 speakers.

Keyword:
Mandarin speech, Speech database, Speech I/O systems assessment, Telephone network


Title:
A Synchronous Chinese Language Corpus from Different Speech Communities: Construction and Applications

Author:
Benjamin K. T'sou, Hing-Lung Lin, Godfrey Liu, Terence Chan, Jerome Hu, Ching-hai Chew, and John K.P. Tse

Abstract:
Similar to other languages such as English, Spanish and Arabic, Chinese is used by a large number of speakers in distinct speech communities which, despite sharing the unity of language, vary in interesting ways, and a systematic study of such linguistic variation is invaluable to appreciate the diversity and richness of the underlying cultures. This paper describes Project LIVAC (Linguistic Variation in Chinese Communities), which focuses on the development of a Chinese corpus, based on data taken concurrently at regular intervals from multiple Chinese speech communities. The resulting database and computerized concordance from the approximately 20 million word corpus with uniform time reference points extending across two years enable linguists and social scientists to undertake meaningful qualitative and quantitative comparative analysis of the development of linguistic and cultural variation. To facilitate these studies, a framework for integrating the corpus with specific corpus analysis applications is proposed. Based on this framework, a prototype retrieval system, which supports longitudinal studies on word and concept distribution, as well as lexical and other linguistic variation, is designed and implemented.


Title:
A Survey of Full-text Data Bases and Related Techniques for Chinese Ancient Documents in Academia Sinica (
中央研究院古籍全文資料庫的發展概要)

Author:
Hsieh Ching-Chun, Lin Shih (
謝清俊, 林晰)

Abstract:
A survey of full-text data bases and related text processing techniques for Chinese ancient document in the past 12 years in Academia Sinica is presented in this paper. Five Institutes, (namely the Institute of History and phonology, the Institute of Taiwan History, the Institute of Literature and Philosophy, the Institute of Information Science and the Institute of Modern History ) and the Computing Center of Academia Sinica actively participated in this long range project since 1984. Beside, the Archival Library of National History also participated in developing the database of Ching Dynasty. Since 1995, some co-laboration projects with other Universities, such as London University in England, Stanford University, Michigan University in USA, Chinese University in Hong Kong and Chung-Cheng University, Chung-San University and National Taiwan Normal University in Taiwan have been launched to produce more digital texts. Now, the total character count of on-line full-text data bases are over 115 millions, and the data bases of more then 80 million characters are coming. In this report, we also survey some important techniques developed, including the structure of full-text database, the ways of handling missing characters, the management of data entry jobs, the development of markup system, etc. Besides, the status of some on going related research projects are summarized in this paper as a future perspective of the development of digital Chinese ancient documents.

中央研究院利用計算機處理古籍已有十二年,其中以全文資料庫的發展最受矚目,目前上線的全文資料庫文總字數已超過一億一仟萬字,其所用的技術則全由院內同仁自行開發。參與製作資料庫的共有五所:史語所、臺史所、資訊所、近史所、文哲所,以及本院計算中心,總統府國史館亦積極參與清史資料庫之開發。1995年開始,有些大學與本院發展合作關係共享古籍資料,包括國內的中山、中正、師大各大學,國外的倫敦大學、史丹佛大學、密西根大學、香港中文大學等。本文首先介紹各全文資料庫的發展現況,其次介紹自行開發的相關技術,包括:全文資料庫的結構、文章的標誌系統、資料登錄之管理、缺字造字之管理以及目前各單位相關的研究發展計劃等。

Keyword:
Full-text Database, Markup, Full-text Search, HTML, CTP, Font Database, Content Index


Title:
Historical Corpora for Synchronic and Diachronic Linguistics Studies (
建構一個以共時與歷時語言研究為導向的歷史語料庫)

Author:
Pei-chuan Wei, P.M. Thompson, Cheng-hui Liu, Chu-Ren Huang, Chaofen Sun (
魏培泉, 譚樸森, 劉承慧, 黃居仁, 孫朝奮)

Abstract:
The Academia Sinica Ancient Chinese Corpus is designed for linguistic research. The corpus contains ancient texts that are selected because of their usefulness in grammatical and lexical studies, as well as an inspection program with keyword searching, statistics, and collocation functions. The corpus is divided into three subcorpora according to stages of grammatical developments, thus both synchronic and diachronic studies can be performed on them. Their current sizes are as follows: a. Old Chinese subcorpus (from pre-Qin to Pre-Han): 5,128,068 characters. b. Middle Chinese subcorpus (from Late Han to the Six Dynasties): 8,101,662 characters. c. Early Mandarin Chinese subcorpus (from Tang to Ching): 4,406,381 characters. A great portion of the texts from the Old Chinese subcorpus (4,497,051 characters) has been textually classified and marked-up according to their source books , author, text genre etc. A substantive part (520,794 characters) of the same subcorpus has also been segmented into words, which are in turn given part-of-speech tagging. results of the above two tasks form the basis of our Old Chinese Lexical Database.

中央研究院古漢語語料庫是為古漢語語言研究而構建的。這個語料庫不但具有大量的可作為古漢語語法及詞彙研究的電子文獻,而且擁有可以對文獻的語詞進行檢索、統計、搭配的多功能程式。以語法的發展為準,這個語料庫又分作上古漢語、中古漢語、近代漢語等三個次語料庫,相信這樣的劃分對古漢語的共時或歷時的研究都是頗為便益的。 現在上古漢語語料庫中有相當數量的文獻已經依據其原典、作者、文體等等完成了分類及標注的工作,其中又有不少文獻已經做了斷詞,在已斷詞的文獻中又有幾部古籍已完成詞類的標記。這些斷詞以及詞類標記的成果現已構成我們上古漢語詞彙庫的基礎。

Keyword:
corpus, lexical database, part-of-speech, mark-up, tagging, Old Chinese, Middle Chinese, Early Mandarin Chinese.