12
1 Intelligent Database Systems Lab 國國國國國國國國 National Yunlin University of Science and T echnology Document Clustering with Cluster Refinement and Model Selection Capabilities Advisor : Dr. Hsu Presenter : Shu-Ya Li Authors : Xin Liu, Yihong Gong, W ei Xu, Shenghuo Zhu 2002 . SIGIR . Page(s) : 191 - 198

Document Clustering with Cluster Refinement and Model Selection Capabilities

Embed Size (px)

DESCRIPTION

Document Clustering with Cluster Refinement and Model Selection Capabilities. Advisor : Dr. Hsu Presenter : Shu-Ya Li Authors : Xin Liu, Yihong Gong, Wei Xu, Shenghuo Zhu. 2002 . SIGIR . Page(s) : 191 - 198. Outline. Motivation Objective Method Experimental Result - PowerPoint PPT Presentation

Citation preview

1Intelligent Database Systems Lab

國立雲林科技大學National Yunlin University of Science and Technology

Document Clustering with Cluster Refinement and Model Selection Capabilities

Advisor : Dr. Hsu

Presenter : Shu-Ya Li

Authors : Xin Liu, Yihong Gong, Wei Xu,

   Shenghuo Zhu

2002 . SIGIR . Page(s) : 191 - 198

2Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Outline

Motivation

Objective

Method

Experimental Result

Conclusion

Personal Opinions

3Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Motivation

The problems and limitations:

1) The user must formulate the query using the keywords.

2) Traditional text search engines is a narrowly specified search for documents matching the user’s query.

3) Traditional search engine returns hundreds, or even thousands of hits.

4Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Objective

We propose a document clustering method that strives to achieve:

1) a high accuracy of document clustering

2) the capability of estimating the number of clusters in the document corpus (i.e. the model selection capability)

5Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Method

Feature Set Term frequencies (TF)

Name entities (NE)

Term pairs (TP)

The documents reporting the Clinton-Lewinsky scandal

The common name entities : ”Clinton”, ”Lewinsky”, ”Ken Starr”, ”Linda Tripp”, etcThe word pairs : ”grand jury”, ”independent counsel”, ”supreme court”

6Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Method - self-refinement process

Apply the iterative voting scheme to refine the document clusters.

1) GMM + EM algorithm

GMM

EM algorithm

7Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Method - self-refinement process

2) Identify discriminative features F = {f1, f2, . . . , fΛ} along with cluster labels S = {σ1, σ2, . . . , σΛ}

Define the discriminative feature metric DFM(fi)

3)

4) Compare the new document cluster set with C.

The result converges →terminate the process Otherwise →set C to the new cluster set, and go to Step 2.

8Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Method - Model Selection

measure the similarity between C and C’

The model selection algorithm1) Guess the possible number of document clusters from the data range (Rl,Rh).

2) Set k = Rl.

3) Cluster the document corpus into k clusters.

4) Compute between each pair of the results, and take the average on all the .

5) If k < Rh, k = k + 1, go to Step 3; otherwise, go to Step 6.

6) Select the k which yields the largest average .

9Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

1) GMM + EM algorithm

2)

3)

Experimental Result - Document Clustering Evaluation

ABC+CNN-01-13-18-32-48-70-71-77-86[ 新聞機構 - 新聞事件類別 - 報導次數 ]

10Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Experimental Result - Model Selection Evaluation

Compared with the BIC-based model selection method

11Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Conclusion

To accurately cluster the given document corpus by using the GMM Model together with EM algorithm.

The model selection capability has been achieved by guessing a value C for the number of clusters N.

12Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Personal Opinions

Advantage

1) high accuracy of document clustering

2) the model selection capability

Drawback …

Application …