Zפ(Journal Papers)

Building a Bracketed Corpus Using f2 Statistics

Yue-Shi Lee and Hsin-Hsi Chen

Department of Computer Science and Information Engineering

National Taiwan University

Taipei, Taiwan, R.O.C.


Research based on treebanks is active for many natural language applications. However, the work to build a large-scale treebank is laborious and tedious. This paper proposes two versions of probabilistic chunkers to help the development of a bracketed corpus. The basic version partitions part-of-speech sequences into chunk sequences which form a partially bracketed corpus. Applying the chunking actions recursively, the recursive version generates a fully bracketed corpus. Rather than using a treebank as a training corpus, a corpus which is tagged with part-of-speech information only is used. The experimental results show that the probabilistic chunker has more than 92% correct rate in producing a partially bracketed corpus, and also gives very encouraging results in generating a fully bracketed corpus. Besides, this simple but effective design can be extended to other applications.