Ӥhפ(Master Thesis)

Generating Chinese Sentences: A Corpus-Based Approach

Yue-Shi Lee

Department of Computer Science and Information Engineering

National Taiwan University

Taipei, Taiwan, R.O.C.


Machine translation (MT) is an old research field. Today, conventional rule-based machine translation systems have suffered from some problems and sentence generation is one of the major burdens and gaps in such machine translation systems.

    In this thesis, we emphasize our forcus on the Chinese sentence generation problem and propose corpus-based approaches to deal with this problem. Markov-based and word-association-based language models are used to select a suitable candidate in testing. Forward training and backward training model are used to get two different types of training tables in training. BDC corpus, 7,010 sentences about 50,000 words, is used as training data. These sentences are segmented and syntactically tagged to provide some constraints for word-association-based language models. The portion of 7,010 sentences are used as testing sentences.

    Various language models are formulated. Different issues are also considered in these models: constraints (word/word or POS/POS linear relation), word importance and distance. The experimental results show that word association language model with distance and approximate n-gram Markov language model are two most powerful and useful language models in practical systems because of their higher correct rate and less number of parameters.

    Finally, parser is integrated in the system to provide linguistic constraints and speed up the system execution speed. Besides, we also propose a corpus-based lexicon choice system to select a suitable lexicon and integrate it into the system. From this scheme, we can easily extend it to multi-lingual machine translation systems.