Zפ(Journal Papers)

A Corpus-Based Language Model for Topic Identification

Hsin-Hsi Chen, Kuang-Hua Chen and Yue-Shi Lee

Department of Computer Science and Information Engineering

National Taiwan University

Taipei, Taiwan, R.O.C.

Abstract

This paper proposes a corpus-based language model for discourse analysis. We analyze the association of noun-noun and noun-verb pairs in LOB corpus. The word association norms are based on the three factors: (1) word importance, (2) pair occurrence, and (3) distance. They are trained on paragraphic and sentential levels for noun-noun and noun-verb pairs respectively. Under the topic coherence postulation, the nouns that have the stronger connectivities with the other nouns and verbs in the discourse form the preferred topic set. The collocational semantics is used to identify the topic from paragraphs, to discuss the topic shift phenomena, and to abstract the text topics.