Document Clustering with Cluster Refinement and Model Selection Capabilities

1Intelligent Database Systems Lab

國立雲林科技大學National Yunlin University of Science and Technology

Document Clustering with Cluster Refinement and Model Selection Capabilities

Advisor : Dr. Hsu

Presenter : Shu-Ya Li

Authors : Xin Liu, Yihong Gong, Wei Xu,

　　 Shenghuo Zhu

2002 . SIGIR . Page(s) : 191 - 198


N.Y.U.S.T.

I. M.Outline

Motivation

Objective

Method

Experimental Result

Conclusion

Personal Opinions


N.Y.U.S.T.

I. M.Motivation

The problems and limitations:

1) The user must formulate the query using the keywords.

2) Traditional text search engines is a narrowly specified search for documents matching the user’s query.

3) Traditional search engine returns hundreds, or even thousands of hits.


N.Y.U.S.T.

I. M.Objective

We propose a document clustering method that strives to achieve:

1) a high accuracy of document clustering

2) the capability of estimating the number of clusters in the document corpus (i.e. the model selection capability)


N.Y.U.S.T.

I. M.Method

Feature Set Term frequencies (TF)

Name entities (NE)

Term pairs (TP)

The documents reporting the Clinton-Lewinsky scandal

The common name entities ： ”Clinton”, ”Lewinsky”, ”Ken Starr”, ”Linda Tripp”, etcThe word pairs ： ”grand jury”, ”independent counsel”, ”supreme court”


N.Y.U.S.T.

I. M.Method - self-refinement process

Apply the iterative voting scheme to refine the document clusters.

1) GMM + EM algorithm

GMM

EM algorithm


N.Y.U.S.T.

I. M.Method - self-refinement process

2) Identify discriminative features F = {f1, f2, . . . , fΛ} along with cluster labels S = {σ1, σ2, . . . , σΛ}

Define the discriminative feature metric DFM(fi)

3)

4) Compare the new document cluster set with C.

The result converges →terminate the process Otherwise →set C to the new cluster set, and go to Step 2.


N.Y.U.S.T.

I. M.Method - Model Selection

measure the similarity between C and C’

The model selection algorithm1) Guess the possible number of document clusters from the data range (Rl,Rh).

2) Set k = Rl.

3) Cluster the document corpus into k clusters.

4) Compute between each pair of the results, and take the average on all the .

5) If k < Rh, k = k + 1, go to Step 3; otherwise, go to Step 6.

6) Select the k which yields the largest average .


N.Y.U.S.T.

I. M.

1) GMM + EM algorithm

2)

3)

Experimental Result - Document Clustering Evaluation

ABC+CNN-01-13-18-32-48-70-71-77-86[ 新聞機構 - 新聞事件類別 - 報導次數 ]


N.Y.U.S.T.

I. M.Experimental Result - Model Selection Evaluation

Compared with the BIC-based model selection method


N.Y.U.S.T.

I. M.Conclusion

To accurately cluster the given document corpus by using the GMM Model together with EM algorithm.

The model selection capability has been achieved by guessing a value C for the number of clusters N.


N.Y.U.S.T.

I. M.Personal Opinions

Advantage

1) high accuracy of document clustering

2) the model selection capability

Drawback …

Application …

Documents

Document Clustering with Cluster Refinement and Model Selection Capabilities