International Journal of Computational Linguistics & Chinese Language Processing                                   [¤¤¤ĺ]
                                                                                          Vol. 19, No. 1, March 2014


Title:
A Novel Approach for Handling Unknown Word Problem in Chinese-Vietnamese Machine Translation

Author:
Phuoc Tran, and Dien Dinh

Abstract:
For languages where space cannot be a boundary of a word, such as Chinese and Vietnamese, word segmentation is always the task to be done first in a statistical machine translation system (SMT). The word segmentation increases the translation quality, but it causes many unknown words (UKW) in the target translation. In this paper, we will present a novel approach to translate UKW. Based on the meaning relationship between Chinese and Vietnamese, we built a model which based on the meaning of the characters forming the UKW before translating the UKW through the model. Experiments show that our method significantly improved the performance of SMT.

Keywords: Chinese-Vietnamese SMT, Unknown Word, Sino-Vietnamese, Pure-Vietnamese, SVBUT Model, PVBUT Model


Title:
Joint Learning of Entity Linking Constraints Using a Markov-Logic Network

Author:
Hong-Jie Dai, Richard Tzong-Han Tsai, and Wen-Lian Hsu

Abstract:
Entity linking (EL) is the task of linking a textual named entity mention to a knowledge base entry. Traditional approaches have addressed the problem by dividing the task into separate stages: entity recognition/classification, entity filtering, and entity mapping, in which different constraints are used to improve the systemˇ¦s performance. Nevertheless, these constraints are executed separately and cannot be used interactively. In this paper, we propose an integrated solution to the task based on a Markov logic network (MLN). We show how the stage decision can be formulated and combined in an MLN. We conducted experiments on the biomedical EL task, gene mention linking (GML), and compared our modelˇ¦s performance with those of two other GML approaches. Our experimental results provide the first comprehensive GML evaluations from three different perspectives: article-wide precision/recall/F-measure (PRF), instance-based PRF, and question answering accuracy. This paper also provides formal definitions of all of the above EL tasks. Experimental results show that our method outperforms the baseline and state-of-the-art systems under all three evaluation schemes.

Keywords:
Entity Linking, Entity Disambiguation, Markov Logic Network, Gene Normalization


Title:
Linking Databases using Matched Arabic Names

Author:
Tarek El-Shishtawy

Abstract:
In this paper, a new hybrid algorithm that combines both token-based and character-based approaches is presented. The basic Levenshtein approach also has been extended to the token-based distance metric. The distance metric is enhanced to set the proper granularity level behavior of the algorithm. It smoothly maps a threshold of misspelling differences at the character level and the importance of token level errors in terms of token position and frequency.

Using a large Arabic dataset, the experimental results show that the proposed algorithm successfully overcomes many types of errors, such as typographical errors, omission or insertion of middle name components, omission of non-significant popular name, and different writing style character variations. When compared with other classical algorithms, using the same dataset, the proposed algorithm was found to increase the minimum success level of the best tested lower limit algorithm (Soft TFIDF) from 69% to about 80%, while achieving an upper accuracy level of 99.67%.

Keywords:
Name Matching, Record Linkage, Data Integration, Arabic NLP, Information Retrieval


Title:
On the Use of Speech Recognition Techniques to Identify Bird Species

Author:
Wei-Ho Tsai, and Yu-Zhi Xue

Abstract:
Wild bird watching has become a popular leisure activity in recent years. Very often, people can see birds or hear their sounds, but have no idea what kind of bird species they are seeing. To help people learn to identify bird species from their sounds, we apply speech recognition techniques to build an automatic bird sound identification system. In this system, two acoustic cues are used for analysis, timbre and pitch. In the timbre-based analysis, Mel-Frequency Cepstral Coefficients (MFCCs) are used to characterize the bird sound. Then, we use Gaussian Mixture Models to represent the MFCCs as a set of parameters. In the pitch-based analysis, we convert bird sounds from their waveform representations into a sequence of MIDI notes. Then, Bigram models are used to capture the dynamic change information of the notes. We chose the top ten common bird species in the Taipei urban area to examine our system. Experiments conducted using audio data collected from commercial CDs and websites show that the timbre-based, pitch-based, and the combination thereof systems achieve 71.1%, 72.1%, and 75.0% accuracy of bird sound identification, respectively.

Keywords:
Bird Species Identification, Bigram Model, Gaussian Mixture Model, Pitch, Timbre


ˇ@

ˇ@