64
Unsupervised Clustering

Unsupervised Clustering - sites.cs.ucsb.edu

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Unsupervised Clustering - sites.cs.ucsb.edu

Unsupervised Clustering

Page 2: Unsupervised Clustering - sites.cs.ucsb.edu

2PR , ANN, & ML

Unsupervised Clustering

Training samples are not labeled

May not know

how many classes

a prior probability

state-conditional probability

Automatic discovery of structures

Intuitively, objects in the same class stick

together and form clusters

P i( )

p x i( | )

Page 3: Unsupervised Clustering - sites.cs.ucsb.edu

3PR , ANN, & ML

Locating groups (clusters) having similar

measurements

Given x x x unlabelled

partition in to C clusters

n

C

{ , ,..., }( )

, ,...,

1 2

1 2

Unsupervised Clustering (cont)

Page 4: Unsupervised Clustering - sites.cs.ucsb.edu

4PR , ANN, & ML

Similarity Measure

Need a similarity measurement s(x,x’)

(e.g., distance between x and x’ )

dd

d

d

q| d

t

qq

kk

d

k

yxyx

yx

yxyx

yxΣyxyx

yxyx

),(

features)binary (for y Commonalit

),(

productinner Normalized

)()(),(

Distance sMahalanobi

1,)|(),(

Metric Minkowski

1

1

1

Page 5: Unsupervised Clustering - sites.cs.ucsb.edu

5PR , ANN, & ML

Similarity Measure (cont.)

How to set the threshold?

Too large all samples assigned into a single class

Too small each sample in its own class

How to properly weight each feature?

How do brightness of a fish and length of a fish

relate?

How to scale (or normalize) different measures?

Page 6: Unsupervised Clustering - sites.cs.ucsb.edu

6PR , ANN, & ML

Axis Scaling

Page 7: Unsupervised Clustering - sites.cs.ucsb.edu

7PR , ANN, & ML

Axis Scaling (cont.)

Page 8: Unsupervised Clustering - sites.cs.ucsb.edu

8PR , ANN, & ML

Page 9: Unsupervised Clustering - sites.cs.ucsb.edu

9PR , ANN, & ML

Threshold

Page 10: Unsupervised Clustering - sites.cs.ucsb.edu

10PR , ANN, & ML

Criteria for Clustering

A criterion function for clustering

ii xi

i

c

i x

ien

J xmmx1

||1

2

c

i x xi i in

J1 '

2|'|1

xx

Page 11: Unsupervised Clustering - sites.cs.ucsb.edu

11PR , ANN, & ML

Within and Between group variance

minimize and maximize the trace and

determinant of the appropriate scatter matrix

trace: square of the scattering radius

determinant: square of the scattering volume

x

c

i

t

iB

xii

c

i x

t

W

nn

nii

xmmmmmS

xmmxmxS

ii

ii

1))((

1))((

1

1

Criteria for Clustering (cont.)

Page 12: Unsupervised Clustering - sites.cs.ucsb.edu

12PR , ANN, & ML

Pathology: when class sizes differ

ii xi

i

c

i x

ien

J xmmx1

||1

2

Relaxed constraint Allow the large class to

grow into smaller class

Stringent constraint Disallow large class and

split it

Page 13: Unsupervised Clustering - sites.cs.ucsb.edu

13PR , ANN, & ML

K-Means Algorithm

(fixed # of clusters)

Arbitrarily pick N cluster centers, assign

samples to nearest center

Compute sample mean of each cluster

Reassign samples to clusters with the

nearest mean (for all samples)

Repeat if there are changes, otherwise stop

Page 14: Unsupervised Clustering - sites.cs.ucsb.edu

14PR , ANN, & ML

Page 15: Unsupervised Clustering - sites.cs.ucsb.edu

15PR , ANN, & ML

Page 16: Unsupervised Clustering - sites.cs.ucsb.edu

16PR , ANN, & ML

Page 17: Unsupervised Clustering - sites.cs.ucsb.edu

17PR , ANN, & ML

K-Means

In some sense, it is a simplified version of

the case II of learning parametric form

c

j

jjjk

iiikki

wPwxp

wPwxpxwP

1

)(),|(

)(),|(),|(

In stead of multiple membership, give the

sample to whichever class whose mean is

closest to it

otherwise

xwP ki0

mean classclosest with class1),|(

Page 18: Unsupervised Clustering - sites.cs.ucsb.edu

18PR , ANN, & ML

Fuzzy K-Means

In Matlab, you can find k-mean (it is called

c-mean) under fuzzy logic toolbox

It implements fuzzy k-mean

Where membership function is not 1 or 0, but

can be fractional

Page 19: Unsupervised Clustering - sites.cs.ucsb.edu

19PR , ANN, & ML

Iterative K-means Iterative refinement of clusters to minimize

Each step: move a sample from one to another

ii xi

i

c

i x

in

J xmmx1

||1

2

22

2

2

ji

|ˆ|1

|ˆ|1

|ˆ|11

ˆ

|ˆ|11

ˆ

to from x̂

j

j

j

i

i

i

j

j

j

jj

j

j

jj

i

i

iii

i

iii

n

n

n

n

n

nJJ

n

n

nJJ

n

mxmx

mxmx

mm

mxmx

mm

Page 20: Unsupervised Clustering - sites.cs.ucsb.edu

20PR , ANN, & ML

1. Select an initial partition of the n samples into clusters and compute

2. Select the next candidate sample

3. If the current cluster is larger than 1, then

4. Transfer

5. Update 6. If J has not changed in n attempts, stop, otherwise

go back to 2.

x

ijn

n

ijn

n

j

j

j

j

j

j

j2

2

|ˆ|1

|ˆ|1

mx

mx

x to if for all jk k j

J m mi k, ,

J m mc, ,...,1

Page 21: Unsupervised Clustering - sites.cs.ucsb.edu

21PR , ANN, & ML

Hierarchical Clustering

K-means assume a “flat” data

description

Hierarchical descriptions are more

frequently

Page 22: Unsupervised Clustering - sites.cs.ucsb.edu

22PR , ANN, & ML

Hierarchical clustering (cont.)

Samples in the same cluster will remain so

at a higher level

x1 x2 x3 x4 x5 x6 1

2

3

4

5

level

Page 23: Unsupervised Clustering - sites.cs.ucsb.edu

23PR , ANN, & ML

Bottom-up: agglomerate

Top-down: divisive

.to.Go

bycdecrement,anddelete,and.Merge

andclusters,distinctofpairnearestthe.Find

stopc,c.If

},...,x{xand n c.Let

jji

ji

n

25

1ˆ4

3

ˆ2

ˆ1 1

Hierarchical clustering Options

Page 24: Unsupervised Clustering - sites.cs.ucsb.edu

24PR , ANN, & ML

At a certain step, we have c classes

ixi

iiin

n xm1

,||

Merge two clusters i and j

)(11

,|| jjii

jixji

ijjiij

jiij

nnnn

xnn

mnnij

mm

Hierarchical clustering Procedure

Page 25: Unsupervised Clustering - sites.cs.ucsb.edu

25PR , ANN, & ML

Then certain criteria function will increase

2

222

222

||

)(

||||||

ji

ji

ji

ijjijjii

x

j

x

i

x

ijij

nn

nn

nnnn

Ejiij

mm

mmm

mxmxmx

),(minarg||minarg,,

2

,

**

jiji

ji

ji

ji

jid

nn

nnji

mm

I*,j* chosen in such a way

Until no such pair exist that cause an increase in the criteria

function less than certain preset threshold

Hierarchical clustering Procedure (cont.)

Page 26: Unsupervised Clustering - sites.cs.ucsb.edu

26PR , ANN, & ML

In fact, the criteria function can be chosen

in many different ways, resulting in

different clustering behaviors

)(max),(

classes)(compact neighbor Furthest

)(min),(

classes) (elongatedneighbor Nearest

,max

,min

yx,

yx,

ji

ji

yxji

yxji

d

d

Criteria Function

Page 27: Unsupervised Clustering - sites.cs.ucsb.edu

27PR , ANN, & ML

•Starting from every sample in a cluster

•Merge them according to some criteria function

•Until two clusters exist

Page 28: Unsupervised Clustering - sites.cs.ucsb.edu

28PR , ANN, & ML

Page 29: Unsupervised Clustering - sites.cs.ucsb.edu

29PR , ANN, & ML

Page 30: Unsupervised Clustering - sites.cs.ucsb.edu

30PR , ANN, & ML

In Reality

With n objects, the distance matrix is n x n

For large databases, it is computationally

expensive to compute and store the matrix

Solution: storing only k nearest clusters in

the distance matrix: complexity is k x n

Page 31: Unsupervised Clustering - sites.cs.ucsb.edu

31PR , ANN, & ML

Divisive Clustering

Less often used

Have to be careful that criterion function

usually decreases monotonically (the

samples become purer each time)

Natural grouping is the one where a large

drop in impurity occurs

iteration

impurity

Page 32: Unsupervised Clustering - sites.cs.ucsb.edu

32PR , ANN, & ML

ISODATA Algorithm

Iterative self-organizing data analysis

technique

A tried-and-true clustering algorithm

Dynamically update # of clusters

Can go both top-down (split) and bottom-up

(merge) directions

Page 33: Unsupervised Clustering - sites.cs.ucsb.edu

33PR , ANN, & ML

Notation:

merged be can that clusters ofnumber maximum

mergingfor distance maximum

splittingfor spread maximum

clusters ofnumber (desired) eapproximat

clustera in samples ofnumber on threshold

maxN

D

N

T

m

s

D

Page 34: Unsupervised Clustering - sites.cs.ucsb.edu

34PR , ANN, & ML

Algorithm

1. Cluster the existing data into Nc clusters but eliminate any data and classes with fewer than T members, decrease Nc. Exit when classification of samples has not changed

2. If iteration odd and

Split clusters whose samples are sufficiently disjoint, increase Nc

If any clusters have been split, go to 1

3. Merge any pair of clusters whose samples are sufficiently close

4. Go to step 1

DcD

c NNorN

N 22

Page 35: Unsupervised Clustering - sites.cs.ucsb.edu

35PR , ANN, & ML

Step 1

Classify samples according

to nearest mean

Discard samples in clusters

< T samples, decrease Nc

Recompute sample

mean of each cluster

Change in classification?

Parameter T

yes

noexit

Page 36: Unsupervised Clustering - sites.cs.ucsb.edu

36PR , ANN, & ML

Compute for each cluster

ki

ki

jkji

kj

k

k

i

k

k

N

Nd

)(

)(

||1

max

||1

)(2

)(

y

y

μy

μy

Compute globally

cN

k

kk dNN

d1

1

Step 2

Page 37: Unsupervised Clustering - sites.cs.ucsb.edu

37PR , ANN, & ML

22, skk

2

12

dc

k

k

NNor

TNand

dd

TNDs ,,2

Split cluster

Nc= Nc+1

yes

yes

no

no

Cluster centers are displaced in

opposite direction along the axis of

largest variance

Page 38: Unsupervised Clustering - sites.cs.ucsb.edu

38PR , ANN, & ML

Compute for each pairs of clusters

|| jiijd μμ

Sort distance from smallest to largest less than Dm

Step 3

Page 39: Unsupervised Clustering - sites.cs.ucsb.edu

39PR , ANN, & ML

max# Nmergeof

Dd mij

Merge i and j

Nc= Nc-1

yes

noNeither cluster i

nor j merged?

)(1

' jjii

ji

NNNN

μμμ

Page 40: Unsupervised Clustering - sites.cs.ucsb.edu

40

Spectral Clustering

Graph theoretical approach

(Advanced) linear algebra is really

important

A field by itself

Cover the basics here

PR , ANN, & ML

Page 41: Unsupervised Clustering - sites.cs.ucsb.edu

41

Graph notations

Undirected graph G = (V, E)

V = {v1, …, vn} are nodes (samples)

E = {ei,j| 0<=i,j<n} are edges, with weights

(similarity) wi,j

W: adjacency matrix with entries wi,j

D: degree matrix (diagonal) with entries

A: subset of vertices, \bar{A}: complement

V\A

1A = (f1, … fn)T, fi=1 if i in A and 0 otherwise

PR , ANN, & ML

Page 42: Unsupervised Clustering - sites.cs.ucsb.edu

42

More graph notations

Size of subset A in V

Based on # of vertices

Based on its connections

Connected component A

Any two vertices in A can be joined by a path

where all intermediate points lie in A

No connection between A and \bar{A}

Partition with A1, …, Ak, if

PR , ANN, & ML

Page 43: Unsupervised Clustering - sites.cs.ucsb.edu

43

Similarity Definitions

RBF (fully connected or thresholded):

E.g, Gaussian kernel gives wi,j

e-neighborhood graph:

Connect all points whose pairwise distances less

than e

K-nearest neighbor:

vi and vj are neighbors if either one is a kNN of the

other

Mutual k-nearest neighbor:

vi and vj are neighbors if both are a kNN of the other

PR , ANN, & ML

Page 44: Unsupervised Clustering - sites.cs.ucsb.edu

44

Graph Laplacian: L = D - W

PR , ANN, & ML

Page 45: Unsupervised Clustering - sites.cs.ucsb.edu

45

One connected component

PR , ANN, & ML

Page 46: Unsupervised Clustering - sites.cs.ucsb.edu

46

Multiple connected components

PR , ANN, & ML

Page 47: Unsupervised Clustering - sites.cs.ucsb.edu

47

Normalized Laplacian

PR , ANN, & ML

Page 48: Unsupervised Clustering - sites.cs.ucsb.edu

48

Unnormalized Spectral Clustering

PR , ANN, & ML

Page 49: Unsupervised Clustering - sites.cs.ucsb.edu

49

Normalized Spectral Clustering

PR , ANN, & ML

Page 50: Unsupervised Clustering - sites.cs.ucsb.edu

50PR , ANN, & ML

Page 51: Unsupervised Clustering - sites.cs.ucsb.edu

51

Toy Example

Random sample of 200 points drawn from 4

Gaussians

Similarity based on

Graph

Fully connected or

10 nearest neighbors

PR , ANN, & ML

Page 52: Unsupervised Clustering - sites.cs.ucsb.edu

52

Min-Cut Formulation

If the edge weight represents degree of similarity,

optimal bi-partitioning of a graph is to minimize

the cut (so called min-cut problem)

Intuition: Cut the connection between dis-similar

samples, hence, edge weight (similarity) should

be small

BvAu

vuwBAcut,

),(),(

Feature

space

Page 53: Unsupervised Clustering - sites.cs.ucsb.edu

53

Problem with Min-cut

Tends to cut out small regions

Sum weight (green + red + blue) = constant

Min-cut minimizes sum of weights of blue

edges, with no regard to green and red (half of

the picture)

Page 54: Unsupervised Clustering - sites.cs.ucsb.edu

54

Remedy Distribute the total weight such that

Sum of weights of the blue edges are

minimized

Max between group variance

Sum of weights of the red (green) edges are

maximized

Min within group variance

Page 55: Unsupervised Clustering - sites.cs.ucsb.edu

55

Normalized Cut

Penalize cutting out small, isolated clusters

set vertex full :

green)-totalblue (red ),(),(

red)-totalblue (green ),(),(

(blue) ),(),(

),(

),(

),(

),(),(

,

,

,

V

vuwVBasso

vuwVAasso

vuwBAcut

greentotal

blue

redtotal

blue

VBasso

BAcut

VAasso

BAcutBANcut

VvAu

VvAu

BvAu

A B

Page 56: Unsupervised Clustering - sites.cs.ucsb.edu

56

Normalized Cut (cont.)

Penalize cutting out small, isolated clusters

Small blue

Small red (or green)

greentotal

blue

redtotal

blue

VBasso

BAcut

VAasso

BAcutBANcut

),(

),(

),(

),(),(

Page 57: Unsupervised Clustering - sites.cs.ucsb.edu

57

Intuition

2

2),(),(

2),(

),(

),(

),(

),(

),(

),(

),(

1),(

),(

),(

),( 1

),(

),(

),(

),(

1),(

),(

),(

),(

),(),(),(),(),(),(

,

,,,,

bluered

blue

bluegreen

blue

bluered

red

bluegreen

green

BANcutBANassoc

VBasso

BAcut

VAasso

BAcut

VBasso

BBassoc

VAasso

AAassoc

VBasso

BAcut

VBasso

BBassoc

VAasso

BAcut

VAasso

AAassoc

VAasso

BAcut

VAasso

vuw

BAcutvuwvuwvuwvuwVAasso

AvAu

AvAuBvAuAvAuVvAu

Assoc reflects intra-class connection which

should be maximized

Ncut represents inter-class connection which

should be minimized

Page 58: Unsupervised Clustering - sites.cs.ucsb.edu

58

Solution

How do you define similarity?

Multiple measurements (i.e., a feature vector)

can be used

How do you find the minimal normalized

cut?

Solution turns out to be a generalized eigen

value problem!

k

j

k

i

kk vvwjid ||),(

Page 59: Unsupervised Clustering - sites.cs.ucsb.edu

59

Bi

Ai

xi

N

1

1

1

1

1

1

x

j

i

NNN

jiwd

d

d

d

),(

1

1

D

0

0

NNNNwNwNw

Nwww

Nwww

),()2,()1,(

),2()2,2()1,2(

),1()2,1()1,1(

W

||

0

VN

d

d

k

i

i

x

i

i

Page 60: Unsupervised Clustering - sites.cs.ucsb.edu

60

D11

x1WDx1

D11

x1WDx1TT )1(

))(()())(()(

),(

),(

),(

),(),(

0

0,0

0

0,0

kk

d

xxw

d

xxw

VBasso

BAcut

VAasso

BAcutBANcut

TT

x

i

xx

jiij

x

i

xx

jiij

i

ji

i

ji

Bx

Ax

Bx

Ax

i

i

i

i

1

0

20

1

2

x1x1

0000 kjkj x

kiki

x

jij

x

kik

x

jiji xwdxwxwxwd

Page 61: Unsupervised Clustering - sites.cs.ucsb.edu

61

yDzzzW)D(DD

DyW)y(D

x)b(1x)(1yDyy

W)y(Dyx

T

T

yx

2

1

2

1

2

1

1

min)(min

k

kb

Ncut

A symmetric semi-positive-definite matrix

Real, >=0 eigen values

Orthogonal eigen vectors

0:eigenvalue:reigenvecto 2

1

1Dz o

Page 62: Unsupervised Clustering - sites.cs.ucsb.edu

62

Hence, the second smallest eigen vector

contains the minimal cut solution (in

floating point format)

Even though computing all eigen

vectors/values are expensive O(n^3),

computing a small number of those are not

that expensive (Lanczos method)

Recursive applications of the procedure

Page 63: Unsupervised Clustering - sites.cs.ucsb.edu

63

Results

Page 64: Unsupervised Clustering - sites.cs.ucsb.edu

64