Spectral Clustering - Carnegie Mellon...

Preview:

Citation preview

Spectral Clustering

Guokun Lai

2016/10

1 / 37

Organization

I Graph Cut

I Fundamental Limitations of Spectral Clustering

I Ng 2002 paper (if we have time)

2 / 37

Notation

I We define a undirected weighted graph G (V ,E ), where V isthe G ’s nodes set, and E is the G ’s edges set. The adjacencymatrix is Wij = E (i , j), Wij ≥ 0.

I The degree Matrix D ∈ Rn×n is a diagonal matrix andDi ,i =

∑nj=1Wi ,j .

I The Laplacian Matrix L ∈ Rn×n is L = D −W .

I Indicator vector of a cluster: The indicator vector Ic of acluster C is ,

Ic,i =

{1 vi ∈ C

0 otherwise(1)

3 / 37

Graph Cut

The intuition of clustering is to separate points in different groupsaccording to their similarities. If we try to separate the node set Ginto two disjoint sets A and B, we define

Cut(A,B) =∑

i∈A,j∈Bwij

If we split the node set into K disjoint set, then

Cut(A1, · · · ,Ak) =K∑i=1

Cut(A,A)

Where A is the complement set of A.

4 / 37

Defect of Graph Cut

The simplest idea to cluster the node set V is to find a partition tominimize the Graph Cut function. But usually it will lead tosolutions that the subset with few nodes.

5 / 37

Normalization Cut

For overcoming the defect of the Graph Cut, the Shi proposed anew cost function to regularize the size of the subset. First, wedefine Vol(A) =

∑i∈A,j∈V w(i , j), and we have

Ncut(A,B) =cut(A,B)

V (A)+

cut(A,B)

V (B)

6 / 37

Relation between NCut and Spectral Clustering

Given a vertex subset Ai ∈ V , we define the vectorfi = IAi

∗ 1√Vol(Ai )

. Then we can write the optimization problem as,

minAi

NCut =1

2

n∑i=0

f Ti Lfi =1

2Tr(FTLF )

s.t. fi = IAi∗ 1√

Vol(Ai )

FTDF = I

(2)

7 / 37

Optimization

Because the constraint fi = IAi∗ 1√

Vol(Ai ), the optimization

problem is a np-hard problem. So we can relax this constraint tothe Rn. Then the optimization problem is,

minfi

Tr(FTLF )

s.t. FTDF = I(3)

Then we found the solution is the kth smallest eigenvector ofD−1L.Based on the F , we recover the Ai by the k-mean algorithm.

8 / 37

Unnormalized Laplacian Matrix

Similar to the above approach, we can prove that the eigenvectorsof the unnormalized Laplacian matrix is the relaxed solution forRatioCut(A,B) = cut(A,B)

|A| + cut(A,B)|B| . We can set fi = IAi

∗ 1√|Ai |

and get the relaxed optimization problem,

minfi

Tr(FTLF )

s.t. FTF = I(4)

9 / 37

Approximation

The solution from the spectral method is approximately for theNormalized Cut objective function. And there is not bound for thegap between them. We can easily construct a case to make thesolution to the relaxed problem very different from the originproblem.

10 / 37

Experiment Result of Shi paper

11 / 37

Organization

I Graph Cut

I Fundamental Limitations of Spectral Clustering

I Ng 2002 paper (if we have time)

12 / 37

Fundamental Limitations of Spectral Clustering

As mentioned above, the spectral clustering approximately solvethe Normalized Graph Cut objective function. But is that theNormalized Graph Cut a good criterion for the all situations?

13 / 37

Limitation of NCut

The NCut function is more likely to capture the global structure.But sometimes, we may want to extract some local feature of thegraph.

The Graph Normalized Cut cannot separate the Gaussiandistribution and the band.

14 / 37

Limitation of Spectral Clustering

Next we analyze the spectral method based on the view of randomwalk process.We define the Markov transition matrix as M = D−1W , it haseigenvalue λi and eigenvector vi . And the random walk process inthe graph converges to the unique equilibrium distribution πs .Then we can found the relationship between eigenvector and the’diffusion distance’ between points,∑

j

λ2tj (vj(x)− vj(y))2 = ||p(z , t|x)− p(z , t|y)||2L2(1/πs)

So we see that the spectral method want to capture the majorpattern of the random walk on whole graph.

15 / 37

Limitation of Spectral Clustering

But this method would fail in the situation, which the scale ofclusters are very different.

16 / 37

Self-Tuning Spectral Clustering

One way to solve above case is that we can accelerate the randomwalk process in the low density area. Assume we define thedistance between node is,

Ai ,j = exp(−d(vi , vj)

2

σiσj)

And σi = d(vi , vk), where vk is the k-th nearest neighbor of vi .

17 / 37

Result of Self-Tuning Spectral Clustering

18 / 37

Failure case

19 / 37

Another solution

The paper proposed a solution is that we split the graph into twosubsets recursively. And stop criterion is based on the relaxationtime of the graph, which is τV = 1/(1− λ2).

I Then if the size of two subsets after splitting is comparable,we expect τV >> τ1 + τ2

I Otherwise, we expect max(τ1, τ2) >> min(τ1, τ2).

If the partition satisfy either condition, we accept separation andcontinue to split the subset. If not, we stop. But it didn’t addresshow to deal with K clustering problem.

20 / 37

Tong Zhang 2007 paper

This paper gave a upper bound of expectation error in thesemi-supervised learning task on graph. Because of the room ofpresentation, I will just introduce a interesting conclusion of thispaper.

21 / 37

S-Normalized Laplacian Matrix

We define the S-Normalized Laplacian Matrix as

LS = S−1/2LS−1/2

where S is a diagonal matrix. According to the analyze of the thispaper, the best choice of S is Si ,i = |Cj |, where Cj is the size of thecluster j . So this is an approach want to solve the different scalecluster problem cannot be dealt with by the spectral clustering.We can find this is similar to the self-tuning spectral clustering, it

renormalized the adjacency matrix as Wij =Wij√|Ci |√|Cj |

.

22 / 37

S-Normalized Laplacian Matrix

But we don’t know |Cj |, the author proposed a method toapproximately computer it.We can define K−1 = αI + LS , α ∈ R. In the ideal case, which isthat we have q disjoint connected components. Then we can provethat

α→ 0, αK =

q∑i=1

1

|Cj |vjv

Tj + O(α)

where vj is the indicator vector of the cluster j . So if we have asmall α, we can assume Ki ,i ∝ |Cj |. Then we can set Si ,i ∝ 1

Ki,i.

23 / 37

Comparation

24 / 37

Organization

I Graph Cut

I Fundamental Limitations of Spectral Clustering

I Ng 2002 paper (if we have time)

25 / 37

Ng 2002 paper

This paper analyzed the spectral clustering problem based on thematrix perturbation theory. It obtains a error bound of the spectralclustering algorithm with several assumptions.

26 / 37

Algorithm

I Define the weighted adjacency Matrix W, and construct theLaplacian Matrix L = D−1/2WD−1/2.

I Find x1, · · ·, xk , the K largest eigenvectors of L, and form thematrix X = [x1 · · · xk ] ∈ Rn∗k

I Normalized the every row of X to have unit length,Yij = Xij/(

∑j X

2ij )

1/2

I Treating each row of Y as a point in Rk , cluster them into kclusters via K-means.

27 / 37

Ideal Case

Assume the graph G contain K clusters, and it dose not containcross-clusters edge. In this case, the Laplacian matrix containsexactly K eigenvector with eigenvalue 1.

28 / 37

Y Matrix of Ideal Case

After running the algorithm on this graph, we can get Y matrix as

Where R is any rotation matrix, and each row of Y will cluster into3 groups naturally.

29 / 37

The general case

In real world data, we have cross-clusters edges. So the authoranalyzes the cross-clusters edges influence on the Y matrix basedon the matrix perturbation theory.

30 / 37

The general case

Assumption 1

There exists δ > 0 so that, for all second largest eigenvalue of eachcluster, i = 1, · · ·, k , λi2 ≤ 1− δ.

Assumption 2

There is some fixed ε1 > 0, so that for every

i1, i2 ∈ 1, · · ·, k, i1 6= i2, we have that∑

j∈Si1

∑k∈Si2

W 2jk

dj dk≤ ε1,

where di is the degree of i in its cluster.

The intuition of this inequality is to limit the weight ofcross-cluster edges, compared to weight of the intra-cluster edges.

31 / 37

The general case

Assumption 3

There is some fixed ε2 > 0, so that for every j ∈ Si , we have that∑k 6∈Si

W 2jk

dj≤ ε2(

∑k,l∈Si

W 2kl

dk dl)−1/2

The intuition of this inequality is also to limit the weight ofcross-cluster edges, compared to weight of the intra-cluster edges.

Assumption 4

There is some constant C > 0 so that for every

i = 1, · · · , k , j = 1, · · · , ni , we have dji ≥ (

∑nik=1 dk

i)/(Cni ).

The intuition of this inequality is that no points in a cluster be”too much less” connected than other points in the same cluster.

32 / 37

The general case

If the all of assumptions holds, set ε =√k(k − 1)ε+ k ∗ ε22 If

σ > (2 +√

2)ε. There exists k orthogonal vectors r1, · · · , rk so that

1

n

k∑i=1

ni∑j=1

||y jj − ri ||22 ≤ 4C (4 + 2 ∗√k)2

ε2

(σ −√

2ε)2

33 / 37

Liu’s 2016 paper

Motivation

I The original semi-supervised learning problem can beformalized as

minf

∑i

`(fi , yi ) + f TLf

I We can richer the label propagation patterns based on thespectrum transformation, which called ST-enhancesemi-supervised learning

minf

∑i

`(fi , yi ) + f Tσ(L)f

34 / 37

Spectral Transform

We can define L =∑

i λiφiφTi , and θi = σ(λi )

−1, where σ(x)should be a non-decrease function. We can substitute it into theobjective function,

minf

C (f ; θ) =∑i∈τ

`(fi , yi ) + γ

m∑i=1

θ−1i 〈φi , f 〉2

whereas θ1 ≥ θ2, · · · ,≥ θm ≥ 0.

35 / 37

Jointly optimization

We can try to jointly optimization eigenvalues set θ and labels setf , so we have

minθ

(minf

C (f ; θ)) + τ ||θ||1

we can prove that this function is convex via θ. The optimizationprocess can be describe as, First, fixed θ, we can optimize theconvex problem on f . After that, optimize the θ in its domain.

36 / 37

Proof of convexity

We can rewrite the objective function used the dual form of theC (f ; θ), which is C ∗(u; θ).

minθ

(maxu

C ∗(u; θ)) + τ ||θ||1

where C ∗(u; θ) = −w(−u)− 14γ

∑i θi < φi , u >

2, and −w(−u) isthe conjugate function of the `. So the objection is the point-wisemaximum of a set of convex function. Then it still convex on θ.

37 / 37

Recommended