1
SoF: Soft-Cluster Matrix Factorization for Probabilistic Clustering Han Zhao , Pascal Poupart , Yongfeng Zhang and Martin Lysy {han.zhao, ppoupart, mlysy}@uwaterloo.ca, [email protected] University of Waterloo, Tsinghua University Introduction Probabilistic clustering without making explicit assumptions on data density distributions. Axiomatic approach to define 4 properties that the probability of co-clustered pairs of points should satisfy. Establish a connection between probabilistic clustering and constrained symmetric Nonnegative Matrix Factorization (NMF). Sequential minimization algorithm using penalty method to convert the constrained optimization problem into an unconstrained problem and solve it using gradient descent. Figure 1: Synthetic experiments on 6 data sets. Dif- ferent colors for different clusters found by SoF. −15 −10 −5 0 5 10 15 −15 −10 −5 0 5 15 10 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 −15 −10 −5 0 5 10 15 −15 −10 −5 0 5 10 15 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 −15 −10 −5 0 5 10 15 −15 −10 −5 0 5 10 15 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 −15 −10 −5 0 5 10 15 −15 −10 −5 0 5 10 15 0 0.2 0.4 0.6 0.8 1 1.2 1.4 −15 −10 −5 0 5 10 15 −15 −10 −5 0 5 10 15 0 0.5 1 1.5 −15 −10 −5 0 5 10 15 −15 −10 −5 0 5 10 15 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Figure 2: Entropy graph of probabilistic assign- ment. Brighter colors correspond to higher entropy and hence the uncertainty in clustering. Soft-Cluster Matrix Factorization Co-cluster Probability p C (v 1 , v 2 ) induced by distance function L : R d × R d R + should satisfy the following 4 properties: Boundary property: v 1 , v 2 R d , if L(v 1 , v 2 )=0, then p C (v 1 , v 2 )=1; if L(v 1 , v 2 ) →∞, then p C (v 1 , v 2 ) 0. Symmetry property: v 1 , v 2 R d , p C (v 1 , v 2 )= p C (v 2 , v 1 ). Monotonicity property: v 1 , v 2 , v 3 R d , p C (v 1 , v 2 ) p C (v 1 , v 3 ) L(v 1 , v 2 ) L(v 1 , v 3 ). Tail property: v 1 , v 2 , v 3 R d , ∂p C (v 1 ,v 2 ) ∂L(v 1 ,v 2 ) ∂p C (v 1 ,v 3 ) ∂L(v 1 ,v 3 ) L(v 1 , v 2 ) L(v 1 , v 3 ). Given a distance function L, there exists a family of co-cluster probability functions p C (v 1 , v 2 )= e -cL(v 1 ,v 2 ) which satisfy the boundary, symmetry, monotonicity and tail properties simultaneously for any constant c> 0. Empirical co-cluster probability: P ij , p C (v i , v j )= e -L(v i ,v j ) True co-cluster probability: Pr(v i v j )=Σ K k =1 Pr(c i = k |v i ) × Pr(c j = k |v j )= p T v i p v j Let W ij = Pr(c i = j |v i ), learning paradigm: minimize W ||P - WW T || 2 F subject to W R N ×K + ,W 1 K = 1 N P enaltyM ethod ========= minimize W ||P - WW T || 2 F - λ 1 Σ ij min{0,W ij } +λ 2 ||W 1 K - 1 N || 2 2 (1) Sequential minimization: For each fixed pair (λ 1 2 ) solve the unconstrained problem and then increase (λ 1 2 ) iteratively until convergence. Kernel Kmeans Spectral Clustering SymNMF SoF Objective min ||K - WW T || 2 F min ||L - WW T || 2 F min ||A - WW T || 2 F min ||P - WW T || 2 F Property K is S.P.D. L is graph Laplacian, S.P.D. A is similarity matrix P is nonnegative, S.P.D. Constraint W T W = I , W 0 W T W = I W 0 W 0,W 1 = 1 Experiments Purity Rand Index Accuracy BL BT GL IRIS ECO IMG DIG BL BT GL IRIS ECO IMG DIG BL BT GL IRIS ECO IMG DIG K-means 0.76 0.43 0.56 0.87 0.80 0.70 0.71 0.60 0.71 0.69 0.86 0.81 0.81 0.91 0.73 0.34 0.51 0.86 0.60 0.63 0.68 S-means 0.76 0.43 0.57 0.92 0.76 0.63 0.71 0.52 0.76 0.69 0.92 0.78 0.81 0.91 0.61 0.39 0.52 0.90 0.56 0.61 0.67 GMM 0.77 0.46 0.52 0.89 0.81 0.74 0.70 0.57 0.76 0.64 0.88 0.85 0.83 0.91 0.66 0.43 0.46 0.86 0.68 0.69 0.67 SP-NC 0.78 0.42 0.38 0.86 0.81 0.30 0.11 0.64 0.74 0.43 0.85 0.81 0.32 0.11 0.73 0.41 0.37 0.85 0.61 0.28 0.11 SP-NJW 0.76 0.40 0.38 0.59 0.80 0.48 0.13 0.51 0.71 0.53 0.60 0.80 0.74 0.41 0.56 0.42 0.33 0.56 0.58 0.37 0.12 NMF 0.76 0.43 0.58 0.79 0.73 0.57 0.47 0.56 0.65 0.71 0.81 0.81 0.80 0.86 0.68 0.39 0.50 0.79 0.66 0.53 0.44 RNMF 0.76 0.28 0.52 0.84 0.70 0.56 0.45 0.59 0.38 0.65 0.85 0.78 0.79 0.86 0.71 0.27 0.51 0.84 0.59 0.52 0.41 GNMF 0.76 0.44 0.55 0.85 0.77 0.71 0.71 0.60 0.68 0.70 0.85 0.80 0.77 0.91 0.72 0.36 0.44 0.81 0.56 0.54 0.67 SymNMF 0.76 0.47 0.60 0.88 0.81 0.77 0.76 0.62 0.77 0.70 0.88 0.83 0.83 0.91 0.76 0.42 0.46 0.87 0.66 0.68 0.67 SoF 0.76 0.51 0.64 0.95 0.85 0.75 0.82 0.64 0.79 0.73 0.93 0.85 0.86 0.94 0.76 0.48 0.47 0.94 0.74 0.71 0.82 ab a Algorithm which achieves best score is highlighted using bold font. b Algorithm which achieves best score and is significantly better than other algorithms under Wilcoxon signed-rank test is highlighted in bold and blue color. Data sets # Objects # Attributes # Classes BL 748 4 2 BT 106 9 6 GL 214 9 6 IRIS 150 4 3 ECO 336 7 8 IMG 4435 36 6 DIG 10992 16 10 Analysis P nonnegative and S.P.D., an approximation to completely positive matrix. The optimization problem is constrained version of completely positive matrix factorization. Optimal solution not unique, label switching. Conclusion A novel probabilistic clustering algorithm derived from a set of properties characterizing the co-cluster probability in terms of pairwise distances. A relaxation of C.P. programming, revealing a close relationship between probabilistic clustering and symmetric NMF-based algorithms. A sequential minimization framework to find a local minimum solution. References [1] Dhillon, I. S.; Guan, Y.; and Kulis, B. Kernel k-means: spectral clustering and normalized cuts. In Proceedings of the tenth ACM SIGKDD international conference on Knowl- edge discovery and data mining. [2] Dickinson, P. J., and Gijben, L. On the computational complexity of membership problems for the completely positive cone and its dual. Computational Optimization and Applications.

SoF: Soft-Cluster Matrix Factorization for Probabilistic ...hzhao1/papers/AAAI2015/poster.pdf · SoF: Soft-Cluster Matrix Factorization for Probabilistic Clustering Han Zhao†, Pascal

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: SoF: Soft-Cluster Matrix Factorization for Probabilistic ...hzhao1/papers/AAAI2015/poster.pdf · SoF: Soft-Cluster Matrix Factorization for Probabilistic Clustering Han Zhao†, Pascal

SoF: Soft-Cluster Matrix Factorization forProbabilistic Clustering

Han Zhao†, Pascal Poupart†, Yongfeng Zhang‡ and Martin Lysy††{han.zhao, ppoupart, mlysy}@uwaterloo.ca, ‡[email protected]

†University of Waterloo, ‡Tsinghua University

Introduction

• Probabilistic clustering without makingexplicit assumptions on data densitydistributions.

• Axiomatic approach to define 4 propertiesthat the probability of co-clustered pairs ofpoints should satisfy.

• Establish a connection between probabilisticclustering and constrained symmetricNonnegative Matrix Factorization (NMF).

• Sequential minimization algorithm usingpenalty method to convert the constrainedoptimization problem into an unconstrainedproblem and solve it using gradient descent.

Figure 1: Synthetic experiments on 6 data sets. Dif-ferent colors for different clusters found by SoF.

−15 −10 −5 0 5 10 15−15

−10

−5

0

5

15

10

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

−15 −10 −5 0 5 10 15−15

−10

−5

0

5

10

15

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

−15 −10 −5 0 5 10 15−15

−10

−5

0

5

10

15

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

−15 −10 −5 0 5 10 15−15

−10

−5

0

5

10

15

0

0.2

0.4

0.6

0.8

1

1.2

1.4

−15 −10 −5 0 5 10 15−15

−10

−5

0

5

10

15

0

0.5

1

1.5

−15 −10 −5 0 5 10 15−15

−10

−5

0

5

10

15

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Figure 2: Entropy graph of probabilistic assign-ment. Brighter colors correspond to higher entropyand hence the uncertainty in clustering.

Soft-Cluster Matrix FactorizationCo-cluster Probability pC(v1,v2) induced by distance function L : Rd×Rd 7→ R+ should satisfy the following 4 properties:

• Boundary property: ∀v1,v2 ∈ Rd, if L(v1,v2) = 0, then pC(v1,v2) = 1; if L(v1,v2)→∞, then pC(v1,v2)→ 0.• Symmetry property: ∀v1,v2 ∈ Rd, pC(v1,v2) = pC(v2,v1).• Monotonicity property: ∀v1,v2,v3 ∈ Rd, pC(v1,v2) ≤ pC(v1,v3)⇔ L(v1,v2) ≥ L(v1,v3).• Tail property: ∀v1,v2,v3 ∈ Rd,

∣∣∣∣∣∣∣∣∣∂pC(v1,v2)∂L(v1,v2)

∣∣∣∣∣∣∣∣∣ ≤∣∣∣∣∣∣∣∣∣∂pC(v1,v3)∂L(v1,v3)

∣∣∣∣∣∣∣∣∣ ⇔ L(v1,v2) ≥ L(v1,v3).

Given a distance function L, there exists a family of co-cluster probability functions pC(v1,v2) = e−cL(v1,v2) which satisfythe boundary, symmetry, monotonicity and tail properties simultaneously for any constant c > 0.Empirical co-cluster probability: Pij , pC(vi,vj) = e−L(vi,vj)

True co-cluster probability: Pr(vi ∼ vj) = ΣKk=1 Pr(ci = k|vi)× Pr(cj = k|vj) = pTvipvj

Let Wij = Pr(ci = j|vi), learning paradigm:minimizeW ||P −WW T ||2Fsubject to W ∈ RN×K

+ ,W1K = 1NPenaltyMethod=========⇒ minimizeW ||P −WW T ||2F − λ1Σij min{0,Wij}

+λ2||W1K − 1N ||22(1)

Sequential minimization: For each fixed pair (λ1, λ2) solve the unconstrained problem and then increase (λ1, λ2) iterativelyuntil convergence.

Kernel Kmeans Spectral Clustering SymNMF SoFObjective min ||K −WW T ||2F min ||L−WW T ||2F min ||A−WW T ||2F min ||P −WW T ||2FProperty K is S.P.D. L is graph Laplacian, S.P.D. A is similarity matrix P is nonnegative, S.P.D.Constraint W TW = I , W � 0 W TW = I W � 0 W � 0,W1 = 1

Experiments

Purity Rand Index AccuracyBL BT GL IRIS ECO IMG DIG BL BT GL IRIS ECO IMG DIG BL BT GL IRIS ECO IMG DIG

K-means 0.76 0.43 0.56 0.87 0.80 0.70 0.71 0.60 0.71 0.69 0.86 0.81 0.81 0.91 0.73 0.34 0.51 0.86 0.60 0.63 0.68S-means 0.76 0.43 0.57 0.92 0.76 0.63 0.71 0.52 0.76 0.69 0.92 0.78 0.81 0.91 0.61 0.39 0.52 0.90 0.56 0.61 0.67GMM 0.77 0.46 0.52 0.89 0.81 0.74 0.70 0.57 0.76 0.64 0.88 0.85 0.83 0.91 0.66 0.43 0.46 0.86 0.68 0.69 0.67SP-NC 0.78 0.42 0.38 0.86 0.81 0.30 0.11 0.64 0.74 0.43 0.85 0.81 0.32 0.11 0.73 0.41 0.37 0.85 0.61 0.28 0.11

SP-NJW 0.76 0.40 0.38 0.59 0.80 0.48 0.13 0.51 0.71 0.53 0.60 0.80 0.74 0.41 0.56 0.42 0.33 0.56 0.58 0.37 0.12NMF 0.76 0.43 0.58 0.79 0.73 0.57 0.47 0.56 0.65 0.71 0.81 0.81 0.80 0.86 0.68 0.39 0.50 0.79 0.66 0.53 0.44

RNMF 0.76 0.28 0.52 0.84 0.70 0.56 0.45 0.59 0.38 0.65 0.85 0.78 0.79 0.86 0.71 0.27 0.51 0.84 0.59 0.52 0.41GNMF 0.76 0.44 0.55 0.85 0.77 0.71 0.71 0.60 0.68 0.70 0.85 0.80 0.77 0.91 0.72 0.36 0.44 0.81 0.56 0.54 0.67

SymNMF 0.76 0.47 0.60 0.88 0.81 0.77 0.76 0.62 0.77 0.70 0.88 0.83 0.83 0.91 0.76 0.42 0.46 0.87 0.66 0.68 0.67SoF 0.76 0.51 0.64 0.95 0.85 0.75 0.82 0.64 0.79 0.73 0.93 0.85 0.86 0.94 0.76 0.48 0.47 0.94 0.74 0.71 0.82

a b

aAlgorithm which achieves best score is highlighted using bold font.bAlgorithm which achieves best score and is significantly better than other algorithms under Wilcoxon signed-rank test is highlighted in bold and blue color.

Data sets # Objects # Attributes # ClassesBL 748 4 2BT 106 9 6GL 214 9 6

IRIS 150 4 3ECO 336 7 8IMG 4435 36 6DIG 10992 16 10

Analysis

•P nonnegative and S.P.D., an approximationto completely positive matrix.

• The optimization problem is constrainedversion of completely positive matrixfactorization.

• Optimal solution not unique, label switching.

Conclusion

• A novel probabilistic clustering algorithmderived from a set of propertiescharacterizing the co-cluster probability interms of pairwise distances.

• A relaxation of C.P. programming, revealinga close relationship between probabilisticclustering and symmetric NMF-basedalgorithms.

• A sequential minimization framework to finda local minimum solution.

References[1] Dhillon, I. S.; Guan, Y.; and Kulis, B. Kernel k-means: spectral clustering

and normalized cuts. In Proceedings of the tenth ACM SIGKDDinternational conference on Knowl- edge discovery and data mining.

[2] Dickinson, P. J., and Gijben, L. On the computational complexity ofmembership problems for the completely positive cone and its dual.Computational Optimization and Applications.