37
Spectral Clustering Guokun Lai 2016/10 1 / 37

Spectral Clustering - Carnegie Mellon Universitynyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Guokun... · 2016. 10. 18. · Another solution The paper proposed a solution is that

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Spectral Clustering - Carnegie Mellon Universitynyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Guokun... · 2016. 10. 18. · Another solution The paper proposed a solution is that

Spectral Clustering

Guokun Lai

2016/10

1 / 37

Page 2: Spectral Clustering - Carnegie Mellon Universitynyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Guokun... · 2016. 10. 18. · Another solution The paper proposed a solution is that

Organization

I Graph Cut

I Fundamental Limitations of Spectral Clustering

I Ng 2002 paper (if we have time)

2 / 37

Page 3: Spectral Clustering - Carnegie Mellon Universitynyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Guokun... · 2016. 10. 18. · Another solution The paper proposed a solution is that

Notation

I We define a undirected weighted graph G (V ,E ), where V isthe G ’s nodes set, and E is the G ’s edges set. The adjacencymatrix is Wij = E (i , j), Wij ≥ 0.

I The degree Matrix D ∈ Rn×n is a diagonal matrix andDi ,i =

∑nj=1Wi ,j .

I The Laplacian Matrix L ∈ Rn×n is L = D −W .

I Indicator vector of a cluster: The indicator vector Ic of acluster C is ,

Ic,i =

{1 vi ∈ C

0 otherwise(1)

3 / 37

Page 4: Spectral Clustering - Carnegie Mellon Universitynyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Guokun... · 2016. 10. 18. · Another solution The paper proposed a solution is that

Graph Cut

The intuition of clustering is to separate points in different groupsaccording to their similarities. If we try to separate the node set Ginto two disjoint sets A and B, we define

Cut(A,B) =∑

i∈A,j∈Bwij

If we split the node set into K disjoint set, then

Cut(A1, · · · ,Ak) =K∑i=1

Cut(A,A)

Where A is the complement set of A.

4 / 37

Page 5: Spectral Clustering - Carnegie Mellon Universitynyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Guokun... · 2016. 10. 18. · Another solution The paper proposed a solution is that

Defect of Graph Cut

The simplest idea to cluster the node set V is to find a partition tominimize the Graph Cut function. But usually it will lead tosolutions that the subset with few nodes.

5 / 37

Page 6: Spectral Clustering - Carnegie Mellon Universitynyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Guokun... · 2016. 10. 18. · Another solution The paper proposed a solution is that

Normalization Cut

For overcoming the defect of the Graph Cut, the Shi proposed anew cost function to regularize the size of the subset. First, wedefine Vol(A) =

∑i∈A,j∈V w(i , j), and we have

Ncut(A,B) =cut(A,B)

V (A)+

cut(A,B)

V (B)

6 / 37

Page 7: Spectral Clustering - Carnegie Mellon Universitynyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Guokun... · 2016. 10. 18. · Another solution The paper proposed a solution is that

Relation between NCut and Spectral Clustering

Given a vertex subset Ai ∈ V , we define the vectorfi = IAi

∗ 1√Vol(Ai )

. Then we can write the optimization problem as,

minAi

NCut =1

2

n∑i=0

f Ti Lfi =1

2Tr(FTLF )

s.t. fi = IAi∗ 1√

Vol(Ai )

FTDF = I

(2)

7 / 37

Page 8: Spectral Clustering - Carnegie Mellon Universitynyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Guokun... · 2016. 10. 18. · Another solution The paper proposed a solution is that

Optimization

Because the constraint fi = IAi∗ 1√

Vol(Ai ), the optimization

problem is a np-hard problem. So we can relax this constraint tothe Rn. Then the optimization problem is,

minfi

Tr(FTLF )

s.t. FTDF = I(3)

Then we found the solution is the kth smallest eigenvector ofD−1L.Based on the F , we recover the Ai by the k-mean algorithm.

8 / 37

Page 9: Spectral Clustering - Carnegie Mellon Universitynyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Guokun... · 2016. 10. 18. · Another solution The paper proposed a solution is that

Unnormalized Laplacian Matrix

Similar to the above approach, we can prove that the eigenvectorsof the unnormalized Laplacian matrix is the relaxed solution forRatioCut(A,B) = cut(A,B)

|A| + cut(A,B)|B| . We can set fi = IAi

∗ 1√|Ai |

and get the relaxed optimization problem,

minfi

Tr(FTLF )

s.t. FTF = I(4)

9 / 37

Page 10: Spectral Clustering - Carnegie Mellon Universitynyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Guokun... · 2016. 10. 18. · Another solution The paper proposed a solution is that

Approximation

The solution from the spectral method is approximately for theNormalized Cut objective function. And there is not bound for thegap between them. We can easily construct a case to make thesolution to the relaxed problem very different from the originproblem.

10 / 37

Page 11: Spectral Clustering - Carnegie Mellon Universitynyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Guokun... · 2016. 10. 18. · Another solution The paper proposed a solution is that

Experiment Result of Shi paper

11 / 37

Page 12: Spectral Clustering - Carnegie Mellon Universitynyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Guokun... · 2016. 10. 18. · Another solution The paper proposed a solution is that

Organization

I Graph Cut

I Fundamental Limitations of Spectral Clustering

I Ng 2002 paper (if we have time)

12 / 37

Page 13: Spectral Clustering - Carnegie Mellon Universitynyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Guokun... · 2016. 10. 18. · Another solution The paper proposed a solution is that

Fundamental Limitations of Spectral Clustering

As mentioned above, the spectral clustering approximately solvethe Normalized Graph Cut objective function. But is that theNormalized Graph Cut a good criterion for the all situations?

13 / 37

Page 14: Spectral Clustering - Carnegie Mellon Universitynyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Guokun... · 2016. 10. 18. · Another solution The paper proposed a solution is that

Limitation of NCut

The NCut function is more likely to capture the global structure.But sometimes, we may want to extract some local feature of thegraph.

The Graph Normalized Cut cannot separate the Gaussiandistribution and the band.

14 / 37

Page 15: Spectral Clustering - Carnegie Mellon Universitynyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Guokun... · 2016. 10. 18. · Another solution The paper proposed a solution is that

Limitation of Spectral Clustering

Next we analyze the spectral method based on the view of randomwalk process.We define the Markov transition matrix as M = D−1W , it haseigenvalue λi and eigenvector vi . And the random walk process inthe graph converges to the unique equilibrium distribution πs .Then we can found the relationship between eigenvector and the’diffusion distance’ between points,∑

j

λ2tj (vj(x)− vj(y))2 = ||p(z , t|x)− p(z , t|y)||2L2(1/πs)

So we see that the spectral method want to capture the majorpattern of the random walk on whole graph.

15 / 37

Page 16: Spectral Clustering - Carnegie Mellon Universitynyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Guokun... · 2016. 10. 18. · Another solution The paper proposed a solution is that

Limitation of Spectral Clustering

But this method would fail in the situation, which the scale ofclusters are very different.

16 / 37

Page 17: Spectral Clustering - Carnegie Mellon Universitynyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Guokun... · 2016. 10. 18. · Another solution The paper proposed a solution is that

Self-Tuning Spectral Clustering

One way to solve above case is that we can accelerate the randomwalk process in the low density area. Assume we define thedistance between node is,

Ai ,j = exp(−d(vi , vj)

2

σiσj)

And σi = d(vi , vk), where vk is the k-th nearest neighbor of vi .

17 / 37

Page 18: Spectral Clustering - Carnegie Mellon Universitynyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Guokun... · 2016. 10. 18. · Another solution The paper proposed a solution is that

Result of Self-Tuning Spectral Clustering

18 / 37

Page 19: Spectral Clustering - Carnegie Mellon Universitynyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Guokun... · 2016. 10. 18. · Another solution The paper proposed a solution is that

Failure case

19 / 37

Page 20: Spectral Clustering - Carnegie Mellon Universitynyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Guokun... · 2016. 10. 18. · Another solution The paper proposed a solution is that

Another solution

The paper proposed a solution is that we split the graph into twosubsets recursively. And stop criterion is based on the relaxationtime of the graph, which is τV = 1/(1− λ2).

I Then if the size of two subsets after splitting is comparable,we expect τV >> τ1 + τ2

I Otherwise, we expect max(τ1, τ2) >> min(τ1, τ2).

If the partition satisfy either condition, we accept separation andcontinue to split the subset. If not, we stop. But it didn’t addresshow to deal with K clustering problem.

20 / 37

Page 21: Spectral Clustering - Carnegie Mellon Universitynyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Guokun... · 2016. 10. 18. · Another solution The paper proposed a solution is that

Tong Zhang 2007 paper

This paper gave a upper bound of expectation error in thesemi-supervised learning task on graph. Because of the room ofpresentation, I will just introduce a interesting conclusion of thispaper.

21 / 37

Page 22: Spectral Clustering - Carnegie Mellon Universitynyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Guokun... · 2016. 10. 18. · Another solution The paper proposed a solution is that

S-Normalized Laplacian Matrix

We define the S-Normalized Laplacian Matrix as

LS = S−1/2LS−1/2

where S is a diagonal matrix. According to the analyze of the thispaper, the best choice of S is Si ,i = |Cj |, where Cj is the size of thecluster j . So this is an approach want to solve the different scalecluster problem cannot be dealt with by the spectral clustering.We can find this is similar to the self-tuning spectral clustering, it

renormalized the adjacency matrix as Wij =Wij√|Ci |√|Cj |

.

22 / 37

Page 23: Spectral Clustering - Carnegie Mellon Universitynyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Guokun... · 2016. 10. 18. · Another solution The paper proposed a solution is that

S-Normalized Laplacian Matrix

But we don’t know |Cj |, the author proposed a method toapproximately computer it.We can define K−1 = αI + LS , α ∈ R. In the ideal case, which isthat we have q disjoint connected components. Then we can provethat

α→ 0, αK =

q∑i=1

1

|Cj |vjv

Tj + O(α)

where vj is the indicator vector of the cluster j . So if we have asmall α, we can assume Ki ,i ∝ |Cj |. Then we can set Si ,i ∝ 1

Ki,i.

23 / 37

Page 24: Spectral Clustering - Carnegie Mellon Universitynyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Guokun... · 2016. 10. 18. · Another solution The paper proposed a solution is that

Comparation

24 / 37

Page 25: Spectral Clustering - Carnegie Mellon Universitynyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Guokun... · 2016. 10. 18. · Another solution The paper proposed a solution is that

Organization

I Graph Cut

I Fundamental Limitations of Spectral Clustering

I Ng 2002 paper (if we have time)

25 / 37

Page 26: Spectral Clustering - Carnegie Mellon Universitynyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Guokun... · 2016. 10. 18. · Another solution The paper proposed a solution is that

Ng 2002 paper

This paper analyzed the spectral clustering problem based on thematrix perturbation theory. It obtains a error bound of the spectralclustering algorithm with several assumptions.

26 / 37

Page 27: Spectral Clustering - Carnegie Mellon Universitynyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Guokun... · 2016. 10. 18. · Another solution The paper proposed a solution is that

Algorithm

I Define the weighted adjacency Matrix W, and construct theLaplacian Matrix L = D−1/2WD−1/2.

I Find x1, · · ·, xk , the K largest eigenvectors of L, and form thematrix X = [x1 · · · xk ] ∈ Rn∗k

I Normalized the every row of X to have unit length,Yij = Xij/(

∑j X

2ij )

1/2

I Treating each row of Y as a point in Rk , cluster them into kclusters via K-means.

27 / 37

Page 28: Spectral Clustering - Carnegie Mellon Universitynyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Guokun... · 2016. 10. 18. · Another solution The paper proposed a solution is that

Ideal Case

Assume the graph G contain K clusters, and it dose not containcross-clusters edge. In this case, the Laplacian matrix containsexactly K eigenvector with eigenvalue 1.

28 / 37

Page 29: Spectral Clustering - Carnegie Mellon Universitynyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Guokun... · 2016. 10. 18. · Another solution The paper proposed a solution is that

Y Matrix of Ideal Case

After running the algorithm on this graph, we can get Y matrix as

Where R is any rotation matrix, and each row of Y will cluster into3 groups naturally.

29 / 37

Page 30: Spectral Clustering - Carnegie Mellon Universitynyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Guokun... · 2016. 10. 18. · Another solution The paper proposed a solution is that

The general case

In real world data, we have cross-clusters edges. So the authoranalyzes the cross-clusters edges influence on the Y matrix basedon the matrix perturbation theory.

30 / 37

Page 31: Spectral Clustering - Carnegie Mellon Universitynyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Guokun... · 2016. 10. 18. · Another solution The paper proposed a solution is that

The general case

Assumption 1

There exists δ > 0 so that, for all second largest eigenvalue of eachcluster, i = 1, · · ·, k , λi2 ≤ 1− δ.

Assumption 2

There is some fixed ε1 > 0, so that for every

i1, i2 ∈ 1, · · ·, k, i1 6= i2, we have that∑

j∈Si1

∑k∈Si2

W 2jk

dj dk≤ ε1,

where di is the degree of i in its cluster.

The intuition of this inequality is to limit the weight ofcross-cluster edges, compared to weight of the intra-cluster edges.

31 / 37

Page 32: Spectral Clustering - Carnegie Mellon Universitynyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Guokun... · 2016. 10. 18. · Another solution The paper proposed a solution is that

The general case

Assumption 3

There is some fixed ε2 > 0, so that for every j ∈ Si , we have that∑k 6∈Si

W 2jk

dj≤ ε2(

∑k,l∈Si

W 2kl

dk dl)−1/2

The intuition of this inequality is also to limit the weight ofcross-cluster edges, compared to weight of the intra-cluster edges.

Assumption 4

There is some constant C > 0 so that for every

i = 1, · · · , k , j = 1, · · · , ni , we have dji ≥ (

∑nik=1 dk

i)/(Cni ).

The intuition of this inequality is that no points in a cluster be”too much less” connected than other points in the same cluster.

32 / 37

Page 33: Spectral Clustering - Carnegie Mellon Universitynyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Guokun... · 2016. 10. 18. · Another solution The paper proposed a solution is that

The general case

If the all of assumptions holds, set ε =√k(k − 1)ε+ k ∗ ε22 If

σ > (2 +√

2)ε. There exists k orthogonal vectors r1, · · · , rk so that

1

n

k∑i=1

ni∑j=1

||y jj − ri ||22 ≤ 4C (4 + 2 ∗√k)2

ε2

(σ −√

2ε)2

33 / 37

Page 34: Spectral Clustering - Carnegie Mellon Universitynyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Guokun... · 2016. 10. 18. · Another solution The paper proposed a solution is that

Liu’s 2016 paper

Motivation

I The original semi-supervised learning problem can beformalized as

minf

∑i

`(fi , yi ) + f TLf

I We can richer the label propagation patterns based on thespectrum transformation, which called ST-enhancesemi-supervised learning

minf

∑i

`(fi , yi ) + f Tσ(L)f

34 / 37

Page 35: Spectral Clustering - Carnegie Mellon Universitynyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Guokun... · 2016. 10. 18. · Another solution The paper proposed a solution is that

Spectral Transform

We can define L =∑

i λiφiφTi , and θi = σ(λi )

−1, where σ(x)should be a non-decrease function. We can substitute it into theobjective function,

minf

C (f ; θ) =∑i∈τ

`(fi , yi ) + γ

m∑i=1

θ−1i 〈φi , f 〉2

whereas θ1 ≥ θ2, · · · ,≥ θm ≥ 0.

35 / 37

Page 36: Spectral Clustering - Carnegie Mellon Universitynyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Guokun... · 2016. 10. 18. · Another solution The paper proposed a solution is that

Jointly optimization

We can try to jointly optimization eigenvalues set θ and labels setf , so we have

minθ

(minf

C (f ; θ)) + τ ||θ||1

we can prove that this function is convex via θ. The optimizationprocess can be describe as, First, fixed θ, we can optimize theconvex problem on f . After that, optimize the θ in its domain.

36 / 37

Page 37: Spectral Clustering - Carnegie Mellon Universitynyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Guokun... · 2016. 10. 18. · Another solution The paper proposed a solution is that

Proof of convexity

We can rewrite the objective function used the dual form of theC (f ; θ), which is C ∗(u; θ).

minθ

(maxu

C ∗(u; θ)) + τ ||θ||1

where C ∗(u; θ) = −w(−u)− 14γ

∑i θi < φi , u >

2, and −w(−u) isthe conjugate function of the `. So the objection is the point-wisemaximum of a set of convex function. Then it still convex on θ.

37 / 37