44
Incorporating User Provided Constraints into Document Clustering Yanhua Chen, Manjeet Rege, Ming Dong, Jing Hua, Farshad Fotouhi Department of Computer Science Wayne State University Detroit, MI48202 {chenyanh, rege, mdong, jinghua, fotouhi}@wayne.edu

Incorporating User Provided Constraints into Document Clustering Yanhua Chen, Manjeet Rege, Ming Dong, Jing Hua, Farshad Fotouhi Department of Computer

  • View
    228

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Incorporating User Provided Constraints into Document Clustering Yanhua Chen, Manjeet Rege, Ming Dong, Jing Hua, Farshad Fotouhi Department of Computer

Incorporating User Provided Constraints into Document Clustering

Yanhua Chen, Manjeet Rege, Ming Dong, Jing Hua, Farshad Fotouhi

Department of Computer Science

Wayne State University

Detroit, MI48202

{chenyanh, rege, mdong, jinghua, fotouhi}@wayne.edu

Page 2: Incorporating User Provided Constraints into Document Clustering Yanhua Chen, Manjeet Rege, Ming Dong, Jing Hua, Farshad Fotouhi Department of Computer

Outline

• Introduction

• Overview of related work

• Semi-Supervised Non-negative Matrix Factorization (SS-NMF) for document clustering

• Theoretical result for SS-NMF

• Experiments and results

• Conclusion

Page 3: Incorporating User Provided Constraints into Document Clustering Yanhua Chen, Manjeet Rege, Ming Dong, Jing Hua, Farshad Fotouhi Department of Computer

What is clustering?

• Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups

Inter-cluster distances are maximized

Intra-cluster distances are

minimized

Page 4: Incorporating User Provided Constraints into Document Clustering Yanhua Chen, Manjeet Rege, Ming Dong, Jing Hua, Farshad Fotouhi Department of Computer

Document Clustering

• Grouping of text documents into meaningful clusters in an unsupervised manner.

Government

Science

Arts

Page 5: Incorporating User Provided Constraints into Document Clustering Yanhua Chen, Manjeet Rege, Ming Dong, Jing Hua, Farshad Fotouhi Department of Computer

Unsupervised Clustering Example

. .. ..

..

...

.

. .. ... .. ...

...

.. .. .

. ...

. .

Page 6: Incorporating User Provided Constraints into Document Clustering Yanhua Chen, Manjeet Rege, Ming Dong, Jing Hua, Farshad Fotouhi Department of Computer

Semi-supervised clustering: problem definition

• Input:– A set of unlabeled objects– A small amount of domain knowledge (labels or pairwise

constraints)

• Output:– A partitioning of the objects into k clusters

• Objective:– Maximum intra-cluster similarity– Minimum inter-cluster similarity– High consistency between the partitioning and the

domain knowledge

Page 7: Incorporating User Provided Constraints into Document Clustering Yanhua Chen, Manjeet Rege, Ming Dong, Jing Hua, Farshad Fotouhi Department of Computer

• According to different given domain knowledge:– Users provide class labels (seeded points) a priori to some

of the documents

– Users know about which few documents are related (must-link) or unrelated (cannot-link)

Semi-Supervised Clustering

Seeded points

Must-link

Cannot-link

Page 8: Incorporating User Provided Constraints into Document Clustering Yanhua Chen, Manjeet Rege, Ming Dong, Jing Hua, Farshad Fotouhi Department of Computer

Why semi-supervised clustering?

• Large amounts of unlabeled data exists– More is being produced all the time

• Expensive to generate Labels for data– Usually requires human intervention

• Use human input to provide labels for some of the data– Improve existing naive clustering methods– Use labeled data to guide clustering of unlabeled data– End result is a better clustering of data

• Potential applications– Document/word categorization– Image categorization – Bioinformatics (gene/protein clustering)

Page 9: Incorporating User Provided Constraints into Document Clustering Yanhua Chen, Manjeet Rege, Ming Dong, Jing Hua, Farshad Fotouhi Department of Computer

Outline

• Introduction

• Overview of related work

• Semi-supervised Non-negative Matrix Factorization (SS-NMF) for document clustering

• Theoretical work for SS-NMF

• Experiments and results

• Conclusion

Page 10: Incorporating User Provided Constraints into Document Clustering Yanhua Chen, Manjeet Rege, Ming Dong, Jing Hua, Farshad Fotouhi Department of Computer

Clustering Algorithm• Document hierarchical clustering

– Bottom-up, agglomerative– Top-down, divisive

• Document partitioning (flat clustering)– K-means– probabilistic clustering using the Naïve Bayes or Gaussian

mixture model, etc.

• Document clustering based on graph model

Page 11: Incorporating User Provided Constraints into Document Clustering Yanhua Chen, Manjeet Rege, Ming Dong, Jing Hua, Farshad Fotouhi Department of Computer

Semi-supervised Clustering Algorithm

• Semi-supervised Clustering with labels (Partial label information is given ) :– SS-Seeded-Kmeans ( Sugato Basu, et al. ICML 2002)- SS-Constraint-Kmeans ( Sugato Basu, et al. ICML 2002)

• Semi-supervised Clustering with Constraints (Pairwise Constraints (Must-link, Cannot-link) is given):– SS-COP-Kmeans (Wagstaff et al. ICML01)– SS-HMRF-Kmeans (Sugato Basu, et al. ACM SIGKDD 2004)– SS-Kernel-Kmeans (Brian Kulis, et al. ICML 2005)– SS-Spectral-Normalized-Cuts (X. Ji, et al. ACM SIGIR 2006)

Page 12: Incorporating User Provided Constraints into Document Clustering Yanhua Chen, Manjeet Rege, Ming Dong, Jing Hua, Farshad Fotouhi Department of Computer

Overview of K-means Clustering

• K-means is a partition clustering algorithm based on iterative relocation that partitions a dataset into k clusters.

• Objective function: Locally minimizes sum of squared distance between the data points and their corresponding cluster centers:

Algorithm: Initialize k cluster centers randomly. Repeat until convergence:

– Cluster Assignment Step: Assign each data point xi to the cluster fh such that distance of xi from center of fh is minimum

– Center Re-estimation Step: Re-estimate each cluster center as the mean of the points in that cluster

Page 13: Incorporating User Provided Constraints into Document Clustering Yanhua Chen, Manjeet Rege, Ming Dong, Jing Hua, Farshad Fotouhi Department of Computer

Semi-supervised Kernel K-means (SS-KK) [Brian Kulis, et al. ICML 2005]

• Semi-supervised Kernel K-means algorithm :

where is kernel function mapping from , is centroid, is the cost of violating the constraint between two points

– First term: kernel k-means objective function– Second term: reward function for satisfying must-link constraints– Third term: penalty function for violating cannot-link constraints

Page 14: Incorporating User Provided Constraints into Document Clustering Yanhua Chen, Manjeet Rege, Ming Dong, Jing Hua, Farshad Fotouhi Department of Computer

Overview of Spectral Clustering

• Spectral clustering is a graph-theoretic clustering algorithmWeighted Graph G=(V, E, A)

min between-cluster similarities (weights : Aij)

Page 15: Incorporating User Provided Constraints into Document Clustering Yanhua Chen, Manjeet Rege, Ming Dong, Jing Hua, Farshad Fotouhi Department of Computer

Spectral Normalized Cuts

• Min similarity between & :

Balance weights:

Cluster indicator:

• Graph partition becomes:

• Solution is eigenvector of:

Page 16: Incorporating User Provided Constraints into Document Clustering Yanhua Chen, Manjeet Rege, Ming Dong, Jing Hua, Farshad Fotouhi Department of Computer

Semi-supervised Spectral Normalized Cuts (SS-SNC) [X. Ji, et al. ACM SIGIR 2006]

• Semi-supervised Spectral Learning algorithm :

where , – First term: spectral normalized cut objective function – Second term: reward function for satisfying must-link

constraints– Third term: penalty function for violating cannot-link

constraints

Page 17: Incorporating User Provided Constraints into Document Clustering Yanhua Chen, Manjeet Rege, Ming Dong, Jing Hua, Farshad Fotouhi Department of Computer

Outline• Introduction

• Related work

• Semi-Supervised Non-negative Matrix Factorization (SS-NMF) for document clustering– NMF review– Model formulation and algorithm derivation

• Theoretical result for SS-NMF

• Experiments and results

• Conclusion

Page 18: Incorporating User Provided Constraints into Document Clustering Yanhua Chen, Manjeet Rege, Ming Dong, Jing Hua, Farshad Fotouhi Department of Computer

Non-negative Matrix Factorization (NMF)

• NMF is to decompose matrix into two parts( D. Lee et al., Nature 1999)

• Symmetric NMF for clustering (C. Ding et al. SIAM ICDM 2005)

3172.03148.02568.02640.02650.0

3148.03244.02055.02090.02038.0

2568.02055.07202.07411.08311.0

2640.02090.07411.07822.08749.0

2650.02038.08311.08749.00000.1

X F G~=

min || X – FGT||2

~=

0348.05476.0

0005.05355.05256.03698.0

5538.03765.0

6449.03672.0

x

0402.20

00735.1

x

0348.00005.0

5476.05355.0

5256.05538.06449.0

3698.03765.03672.0

min || A – GSGT||2

Page 19: Incorporating User Provided Constraints into Document Clustering Yanhua Chen, Manjeet Rege, Ming Dong, Jing Hua, Farshad Fotouhi Department of Computer

SS-NMF

CLji Cdd ),(

• Incorporate prior knowledge into NMF based framework for document clustering.

• Users provide pairwise constraints:– Must-link constraints CML : two documents di

and dj must belong to the same cluster.

– Cannot-link constraints CCL : two documents di and dj must belong to the different cluster.

MLji Cdd ),(

• Constraints are defined by associated violation cost matrix W:– W reward : cost of violating the constraint between document

di and dj if a constraint exists.– Wpenalty : cost of violating the constraints between document

di and dj if a constraint exists.

MLji Cdd ),(

CLji Cdd ),(

Page 20: Incorporating User Provided Constraints into Document Clustering Yanhua Chen, Manjeet Rege, Ming Dong, Jing Hua, Farshad Fotouhi Department of Computer

SS-NMF Algorithm

• Define the objective function of SS-NMF:

where

2

0,0

~min T

GSNMFSS GSGAJ

}..,),(|{ jiMLjiijreward yytsCddwW

}..,),(|{ jiCLjiijpenalty yytsCddwW

penaltyreward WWAA ~

is the cluster label of iy id

Page 21: Incorporating User Provided Constraints into Document Clustering Yanhua Chen, Manjeet Rege, Ming Dong, Jing Hua, Farshad Fotouhi Department of Computer

Summary of SS-NMF Algorithm

Page 22: Incorporating User Provided Constraints into Document Clustering Yanhua Chen, Manjeet Rege, Ming Dong, Jing Hua, Farshad Fotouhi Department of Computer

Outline

• Introduction

• Overview of related work

• Semi-supervised Non-negative Matrix Factorization (SS-NMF) for document clustering

• Theoretical result for SS-NMF

• Experiments and results

• Conclusion

Page 23: Incorporating User Provided Constraints into Document Clustering Yanhua Chen, Manjeet Rege, Ming Dong, Jing Hua, Farshad Fotouhi Department of Computer

Algorithm Correctness and Convergence

Based on constraint optimization theory, auxiliary function, we can prove SS-NMF:

1. Correctness: Solution converges to local minimum

2. Convergence: Iterative algorithm converges(Details in paper [1], [2])

[1] Y. Chen, M. Rege, M. Dong and J. Hua, “Incorporating User provided Constraints into Document Clustering”, Proc. of IEEE ICDM, Omaha, NE, October 2007. (Regular Paper, acceptance rate 7.2% ) [2] Y. Chen, M. Rege, M. Dong and J. Hua, “Non-negative Matrix Factorization for Semi-supervised Data Clustering”, Journal of Knowledge and Information Systems, to appear, 2008.

Page 24: Incorporating User Provided Constraints into Document Clustering Yanhua Chen, Manjeet Rege, Ming Dong, Jing Hua, Farshad Fotouhi Department of Computer

SS-NMF: General Framework for Semi-supervised Clustering

jiCLjih jiMLji yytsCddij

k

h Xi yytsCddijhiKKSS wwdJ

..),(1 ..),(

2

,,

)(

Proof: (1)

(2)

(3)

Orthogonal Symmetric Semi-supervised NMF is equivalent to Semi-supervised

Kernel K-means (SS-KK) and Semi-supervised Spectral Normalized Cuts (SS-SNC)!

Page 25: Incorporating User Provided Constraints into Document Clustering Yanhua Chen, Manjeet Rege, Ming Dong, Jing Hua, Farshad Fotouhi Department of Computer

Advantages of SS-NMF

SS-KK SS-SNC SS-NMF

Clustering Indicator

•Hard clustering•Exact orthogonal

•The derived latent semantic space to be orthogonal•No direct relationship between the singular vectors and the clusters

•Soft clustering•Map the documents into non-negative latent semantic space which may not be orthogonal•Cluster label can be determined by the axis with the largest projection value

Time Complexity

•Iterative algorithm

•Solving a computationally expensive constrained eigen-decomposition

•Iterative algorithm to obtain partial answer at intermediate stages of the solution by specifying a fixed number of iterations•Simple basic matrix computation and easily deployed over a distributed computing environment when dealing with large document collections.

Page 26: Incorporating User Provided Constraints into Document Clustering Yanhua Chen, Manjeet Rege, Ming Dong, Jing Hua, Farshad Fotouhi Department of Computer

Outline• Introduction

• Overview of related work

• Semi-supervised Non-negative Matrix Factorization (SS-NMF) for document clustering

• Theoretical result for SS-NMF

• Experiments and results– Artificial Toy Data– Real Data

• Conclusion

Page 27: Incorporating User Provided Constraints into Document Clustering Yanhua Chen, Manjeet Rege, Ming Dong, Jing Hua, Farshad Fotouhi Department of Computer

Experiments on Toy Data

1. Artificial toy data: consisting of two natural clusters

Page 28: Incorporating User Provided Constraints into Document Clustering Yanhua Chen, Manjeet Rege, Ming Dong, Jing Hua, Farshad Fotouhi Department of Computer

Results on Toy Data (SS-KK and SS-NMF)

Right Table:

Difference between cluster indicator G of SS-KK (hard clustering) and SS-NMF (soft clustering) for the toy data

• Hard Clustering: Each object belongs to a single cluster

• Soft Clustering: Each object is

probabilistically assigned to clusters.

Page 29: Incorporating User Provided Constraints into Document Clustering Yanhua Chen, Manjeet Rege, Ming Dong, Jing Hua, Farshad Fotouhi Department of Computer

Results on Toy Data (SS-SNC and SS-NMF)

(b) Data distribution in the SS-NMF subspace of two column vectors of G. The data points from the two clusters get distributed along the two axes.

(a) Data distribution in the SS-SNC subspace of the first two singular vectors. There is no relationship between the axes and the clusters.

Page 30: Incorporating User Provided Constraints into Document Clustering Yanhua Chen, Manjeet Rege, Ming Dong, Jing Hua, Farshad Fotouhi Department of Computer

Time Complexity Analysis

Up Figure: Computational Speed comparison for SS-KK, SS-SNC and SS-NMF ( ))( 2tkn

Page 31: Incorporating User Provided Constraints into Document Clustering Yanhua Chen, Manjeet Rege, Ming Dong, Jing Hua, Farshad Fotouhi Department of Computer

Experiments on Text Data

iy

2. Summary of data sets[1] used in the experiments.

[1]http://www.cs.umn.edu/~han/data/tmdata.tar.gz

• Evaluation Metric:

where n is the total number of documents in the experiment, δis the delta function that equals one if , is the estimated label, is the ground truth.

iy ii yy ˆ

Page 32: Incorporating User Provided Constraints into Document Clustering Yanhua Chen, Manjeet Rege, Ming Dong, Jing Hua, Farshad Fotouhi Department of Computer

Results on Text Data (Compare with Unsupervised

Clustering)• (1) Comparison with unsupervised clustering approaches:

Note: SS-NMF adds 3% constraints

Page 33: Incorporating User Provided Constraints into Document Clustering Yanhua Chen, Manjeet Rege, Ming Dong, Jing Hua, Farshad Fotouhi Department of Computer

Results on Text Data(Before Clustering and After

Clustering)

(a) Typical document-document matrix before clustering

(b) Document-document similarity matrix after clustering with SS-NMF (k=2)

(c) Document-document similarity matrix after clustering with SS-NMF (k=5)

Page 34: Incorporating User Provided Constraints into Document Clustering Yanhua Chen, Manjeet Rege, Ming Dong, Jing Hua, Farshad Fotouhi Department of Computer

Results on Text Data (Clustering with Different

Constraints)

Left Table:

Comparison of confusion matrix C and normalized cluster centroid matrix S of SS-NMF for different percentage of documents pairwise constrained

Page 35: Incorporating User Provided Constraints into Document Clustering Yanhua Chen, Manjeet Rege, Ming Dong, Jing Hua, Farshad Fotouhi Department of Computer

Results on Text Data (Compare with Semi-supervised

Clustering)• (2) Comparison with SS-KK and SS-SNC

(a) Graft-Phos (b) England-Heart (c) Interest-Trade

Page 36: Incorporating User Provided Constraints into Document Clustering Yanhua Chen, Manjeet Rege, Ming Dong, Jing Hua, Farshad Fotouhi Department of Computer

• Comparison with SS-KK and SS-SNC (Fbis2, Fbis3, Fbis4, Fbis5)

Results on Text Data (Compare with Semi-supervised

Clustering)

Page 37: Incorporating User Provided Constraints into Document Clustering Yanhua Chen, Manjeet Rege, Ming Dong, Jing Hua, Farshad Fotouhi Department of Computer

Experiments on Image Data

Up Figure: Sample images for images categorization. (From up to down: O-Owls, R-Roses, L-Lions, E-Elephants, H-Horses)

3. Image data sets[2] used in the experiments.

[2] http://kdd.ics.uci.edu/databases/CorelFeatures/CorelFeatures.data.html

Page 38: Incorporating User Provided Constraints into Document Clustering Yanhua Chen, Manjeet Rege, Ming Dong, Jing Hua, Farshad Fotouhi Department of Computer

Results on Image Data (Compare with Unsupervised

Clustering)

Up Table : Comparison of image clustering accuracy between KK, SNC, NMF and SS-NMF with only 3% pair-wise constraints on the images. It shows that SS-NMF consistently outperforms other well-established unsupervised image clustering methods.

• (1) Comparison with unsupervised clustering approaches:

Page 39: Incorporating User Provided Constraints into Document Clustering Yanhua Chen, Manjeet Rege, Ming Dong, Jing Hua, Farshad Fotouhi Department of Computer

Results on Image Data (Compare with Semi-supervised

Clustering)• (2) Comparison with SS-KK and SS-SNC:

Left Figure:

Comparison of image clustering accuracy between SS-KK, SS-SNC, and SS-NMF for different percentages of images pairs constrained (a) O-R, (b) L-H, (c) R-L, (d) O-R-L.

Page 40: Incorporating User Provided Constraints into Document Clustering Yanhua Chen, Manjeet Rege, Ming Dong, Jing Hua, Farshad Fotouhi Department of Computer

Results on Image Data (Compare with Semi-supervised

Clustering)• (2) Comparison with SS-KK and SS-SNC:

Left Figure:

Comparison of image clustering accuracy between SS-KK, SS-SNC, and SS-NMF for different percentages of images pairs constrained (e) L-E-H, (f) O-R-L-E, (g) O-L-E-H, (h) O-R-L-E-H

Page 41: Incorporating User Provided Constraints into Document Clustering Yanhua Chen, Manjeet Rege, Ming Dong, Jing Hua, Farshad Fotouhi Department of Computer

Outline

• Introduction

• Related work

• Semi-supervised Non-negative Matrix Factorization (SS-NMF) for document clustering

• Theoretical result for SS-NMF

• Experiments and results

• Conclusion

Page 42: Incorporating User Provided Constraints into Document Clustering Yanhua Chen, Manjeet Rege, Ming Dong, Jing Hua, Farshad Fotouhi Department of Computer

Conclusion

• Semi-supervised Clustering: - many real world applications- outperform the traditional clustering algorithms

• Semi-supervised NMF algorithm provides a unified mathematic framework for semi-supervised clustering.

• Many existing semi-supervised clustering algorithms can be extended to achieve multi-type objects co-clustering tasks.

Page 43: Incorporating User Provided Constraints into Document Clustering Yanhua Chen, Manjeet Rege, Ming Dong, Jing Hua, Farshad Fotouhi Department of Computer

Reference

[1] Y. Chen, M. Rege, M. Dong and F. Fotouhi, “Deriving Semantics for Image Clustering from Accumulated User Feedbacks”, Proc. of ACM Multimedia, Germany, 2007.

[2] Y. Chen, M. Rege, M. Dong and J. Hua, “Incorporating User provided Constraints into Document Clustering”, Proc. of IEEE ICDM, Omaha, NE, October 2007. (Regular Paper, acceptance rate 7.2%)

[3] Y. Chen, M. Rege, M. Dong and J. Hua, “Non-negative Matrix Factorization for Semi-supervised Data Clustering”, Journal of Knowledge and Information Systems, invited as a best paper of ICDM 07, to appear 2008.

Page 44: Incorporating User Provided Constraints into Document Clustering Yanhua Chen, Manjeet Rege, Ming Dong, Jing Hua, Farshad Fotouhi Department of Computer