A Corpus-Based Language Model for Topic Identification
Hsin-Hsi Chen, Kuang-Hua Chen and Yue-Shi Lee
Department of Computer Science and Information Engineering
National Taiwan University
Taipei, Taiwan, R.O.C.
Abstract
This paper proposes a corpus-based language model for discourse analysis. We analyze the
association of noun-noun and noun-verb pairs in LOB corpus. The word association norms are based
on the three factors: (1) word importance, (2) pair occurrence, and (3) distance. They are trained on
paragraphic and sentential levels for noun-noun and noun-verb pairs respectively. Under the topic
coherence postulation, the nouns that have the stronger connectivities with the other nouns and verbs
in the discourse form the preferred topic set. The collocational semantics is used to identify the topic from
paragraphs, to discuss the topic shift phenomena, and to abstract the text topics.