80
Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan Cios / Pedrycz / Swiniarski / Kurgan

Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

Embed Size (px)

Citation preview

Page 1: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

Chapter 9UNSUPERVISED LEARNING:

Clustering Part 2

Cios / Pedrycz / Swiniarski / KurganCios / Pedrycz / Swiniarski / Kurgan

Page 2: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan2

Some key observed features in the operation of a human associative memory:

• information is retrieved/recalled from the memory on basis of some measure of similarity relating to a key pattern

• memory is able to store and recall representations as structured sequences

• the recalls of information from memory are dynamic and similar to time-continuous physical systems

SOM ClusteringSOM Clustering

Page 3: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan3

Page 4: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan4

Page 5: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan5

Self-Organizing Feature MapsSelf-Organizing Feature Maps

In data analysis it is fundamental to:

• capture the topology and probability distribution of pattern vectors

• map pattern vectors from the original high-D space onto the lower-D new feature space (compressed)

Page 6: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan6

• Data compression requires selection of features that best represent data for a specific purpose, e.g., better visual inspection of the data's structure

• Most attractive from the human point of view are visualizations in 2D or 3D

The major difficulty is faithful projection/mapping of data to ensure preservation of the topology present in the original feature space.

Self-Organizing Feature MapsSelf-Organizing Feature Maps

Page 7: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan7

Topology-preserving mapping should have these properties:

• similar patterns in the original feature space must also be similar in the reduced feature space - according to some similarity criteria

• similarity in the original, and the reduced spaces, should be of "continuous nature“

i.e., density of patterns in the reduced feature space should correspond to those in the original space.

Self-Organizing Feature MapsSelf-Organizing Feature Maps

Page 8: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan8

Several methods were developed for 2D topology-preserving mapping:

• linear projections, such as eigenvectors

• nonlinear projections, such as Sammon's projection

• nonlinear projections, such as SOM neural networks

Self-Organizing Feature MapsSelf-Organizing Feature Maps

Page 9: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan9

Sammon’s Projection Nonlinear projection used to preserve topological relations between patterns in the original and the reduced spaces, by preserving the inter-pattern distances.

The Sammon’s projection minimizes an error defined as the difference between patterns in the original and reduced feature spaces.

{xk} -- set of L n-dimensional vectors xk in the original feature space Rn, {yk} -- set of L corresponding m-dimensional vectors y in the reduced low-dimensional space Rm, with m <<n Distortion measure

L

ji1,i

L

ij1,jji

2

jiji

L

ji1,i

L

ij1,jji

d ),d(

)),d(),(d(

),(d

1J

xx

yyxx

xx

Page 10: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan10

Sammon’s Projection

Performs a non-linear projection, typically, onto a 2D plane

Disadvantages: - it is computationally heavy- it cannot be used to project new points (points that

were not used during training) on the output plane

Page 11: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan11

SOM: PrincipleSOM: Principle

high-dim space low-dim space(2-dim, 3-dim)

Page 12: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan12

• Developed by Kohonen in 1982

• SOM is an unsupervised learning, topology preserving, projection algorithm

• It uses a feedforward topology

• It is a scaling method projecting data from high-D input space into a lower-D output space

• Similar vectors in the input space are projected onto nearby neurons on the 2D map

Self-Organizing Feature MapsSelf-Organizing Feature Maps

Page 13: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan13

• The feature map is a layer in which the neurons are self-organizing themselves, according to input values

• Each neuron of the input layer is connected to each neuron of the 2D topology/map

• The weights associated with the inputs are used to propagate them to the map neurons

Self-Organizing Feature MapsSelf-Organizing Feature Maps

Page 14: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan14

• The neurons in a certain area around the winning neuron are also influenced

• SOM reflects the ability of biological neurons to perform global ordering based on local interactions

SOM: Topology and Learning RuleSOM: Topology and Learning Rule

Page 15: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan15

One iteration of the SOM learning:

1. Present a randomly selected input vector x to all neurons

2. Select the winning neuron, i.e., one whose weight vector is closest to the input vector, according to the chosen similarity measure

3. Adjust the weight of the jth winning neuron, and the weights of neighboring (defined by some neighborhood function) neurons

SOM: Topology and Learning RuleSOM: Topology and Learning Rule

Page 16: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan16

The jth winning neuron is selected as the one having minimal distance value:

Competitive (winner-takes-all / Kohonen) learning rule is used for adjusting the weights:

M 1,...,k ,k

wxminj

w-xneuron gjth winnin

input its andneuron abetween similarity of degree theis (t))k

w-(x e wher

M1,2,...,i (t)),k

w-(x t)(t),j

(Nj

h(t)k

w1)k(tk

W

SOM: Topology and Learning RuleSOM: Topology and Learning Rule

Page 17: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan17

Kohonen also proposed a dot product similarity for selecting the winning neuron:

 

and the learning rule:

 

 

where Nj(t) is the winning neuron’s neighborhood

and (0< (t) < ) is the decreasing learning function.

This formula assures automatic weight normalization to the length of one.

M 1,...,k ,k

wmaxj

neuron w gjth winnin

xx

)(j

Ni ,||t)x((t)

iw||

t)x)((t)i

w(

1)(ti

W t

SOM: Topology and Learning RuleSOM: Topology and Learning Rule

Page 18: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan18

Page 19: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan19

• The neighborhood kernel (of the winning neuron) is a non-increasing function of time and distance

• It defines the region of influence that the input has on the SOM

ly respective i and jneuron winning theof radius i

r ,j

r

1t)(t),j

(Nj

h0 e wher

t),i

rj

r(j

ht)(t),j

(Nj

h

Neighborhood KernelNeighborhood Kernel

Page 20: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan20

• The geometric set of neurons must decrease with the increase of iteration steps (time)

• Convergence of the learning process requires that the radius must decrease with learning time/iteration

This causes global ordering by local interactions and local weight adjustments.

),...3

t),3

(r(tj

N)2

t),2

(r(tj

N)1t),

1(r(t

jN

,...)3

r(t)2

r(t)1

r(t

Neighborhood KernelNeighborhood Kernel

Page 21: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan21

Page 22: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan22

With the neighborhood function fixed, the neighborhood kernel learning function can be defined as:

iterations ofnumber T wheret/T)(t)(1maxηη(t)

and

otherwise 0

t)(t),j

(N ifor η(t)t,i

rj

rj

ht)(t),j

(Nj

h

Neighborhood KernelNeighborhood Kernel

Page 23: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan23

Another frequently used neighborhood kernel is a Gaussian function with a radius decreasing with time:

kernel theof width (t)

neuronith theof radius i

r

neuron winning theof radiusposition j

r where

(t))2)/(22

ir

jr(( exp η(t)t,

ir

jr

jht)(t),

j(N

jh

Neighborhood KernelNeighborhood Kernel

Page 24: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan24

Conditions for successful learning of the SOM network:

• the training data must be large since self-organization relies on statistical properties of data

• proper selection of the neighborhood kernel function assures that only the weights of the winning neuron and its neighborhood neurons are locally adjusted

• the radius of the winning neighborhood, as well as the learning function rate, must monotonically decrease with time

• the amount of weight adjustment for neurons in a winning neighborhood depends on how close they are to the winner

SOM: AlgorithmSOM: Algorithm

Page 25: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan25

Given: The 2D network topology consisting of M neurons; training data set of L n-D input vectors; number of iterations T; neighborhood kernel function; learning rate function

1. Set learning rate to the max learning rate function

2. set iteration step t=0

3. randomly select initial values of the weights

4. randomly select the input pattern and present it to the network

5. compute the current learning rate for step t using the given learning rate function

SOM: AlgorithmSOM: Algorithm

Page 26: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan26

6. compute the Euclidean distances

|| xi - wk (t) ||, k = 1,…, M

7. select the jth winning neuron

|| xi - wj(t) || = min || xi(t) – wk(t) ||, k = 1,…, M

8. define the winning neighborhood around the winning neuron using the neighborhood kernel

9. adjust the weights of the neurons

wp (t+1) = wp (t) + (t) (xi – wp (t)), p Nj(t)

10. increase t=t+1 If t>T stop; otherwise go to step 4

Result: Trained SOM network

SOM: AlgorithmSOM: Algorithm

Page 27: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan27

from Kohonen

Page 28: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan28

Given: Trained SOM network consisting of the 2D array of M neurons, each neuron receiving an input via its weight vector w. Small training data set consisting of pairs (xi, ci), i=1, 2, …, L, where c is the class label

1. Set i = 1

2. Present input pattern to the network

3. Calculate the distances or the dot products

4. Locate the spatial position of the winning neuron and assign

label c to that neuron

5. Increase i= i+1 and continue with i < = L

Result: Calibrated SOM network

SOM: InterpretationSOM: Interpretation

Page 29: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan29

In practice, we do not optimize the SOM design, instead, we pay attention to these factors:

• Initialization of the weights

Often they are normalized, and can be set equal to the first several input vectors.

If they are not initialized to some input vectors then they should be grouped in one quadrant of the unit circle so that they can unfold to map the distribution of input data.

• Starting value of the learning rate and its decreasing schedule

• Structure of neighborhood kernel and its decreasing schedule

SOM: IssuesSOM: Issues

Page 30: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan30

SOM: Example SOM: Example

Page 31: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan31

SOM: Example SOM: Example

Page 32: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan32

SOM: Example SOM: Example

Page 33: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan33

SOM: Example SOM: Example

Page 34: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan34

Self-Organizing Feature Map (SOM)

Visualize a structure of highly dimensional data by mapping it onto the low-dimensional (typically two-dimensional) gridof linear neurons

… x(2) x(1)

X

i

j

p-rows

Page 35: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan35

Clustering and Vector Quantization `

encoder decoder

x i0

codebook

Encoding: determine the best representative (prototype) of the codebook and store (transmit) its index i0,

i0= arg mini ||x –vi|| where vi - i-th prototype. Decoding: recall the best prototype given the transmitted index (i0)

Page 36: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan36

Cluster validity

• Using different clustering methods might result in different partitions of X, at each value of c

• Which clusterings are valid?

• It is plausible to expect “good” clusters at more than one value of c ( 2 c < n )

• How many clusters do exist in the data?

Page 37: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

Cluster validity

• Some SymbolX : {X1,X2,…,Xn}

U = {Uik} : c partitions of X are sets of (cn) values {Uik} that can be conveniently array. There are three sets of partition matrices

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan37

Page 38: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

Cluster Validity

• classification of validity measures

Direct Measure: Davies-Bouldin Index, Dunn’s index

Indirect measures for fuzzy clusters: degree of separation, partition coefficient and partition entropy. Xie and Beni index

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan38

Page 39: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan39

Cluster validity• Cluster Error

is associated with any U Mc it is the number of vectors in X that are mislabeled by U

• E(U) is an absolute measure of cluster validity when X is labeled, and is undefined when X is not labeled.

n

uuUE

n

k

c

i ikik

2

ˆ)( 1 1

2

Page 40: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan40

Cluster validity

• Process X at c = 2, 3, …, n – 1

and record the optimal values of some criterion as a function of c

• The most valid clustering is taken as an extremum of this function (or some derivative of it)

Problem: many criterion functions usually have multiple local stationary points at fixed c, and global extrema are not necessarily the “best” c-partitions of the data

Page 41: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan41

Cluster validity

More formal approach is to pose the validity question in the framework of statistical hypothesis testing

• Major difficulty is that the sampling distribution is not known

• Nonetheless, goodness of fit statistics such as chi-square and Kolmogorov-Smirnov tests have been used

Page 42: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan42

Cluster validityThe global minimum of Jw may suggest the “wrong” 2-

clusters

Example from Bezdek:

n=29 data vectors {xk}R2

“correct” 2-partition of X is shown on the left

The global minimum is

hardly an attractive solution.

X

s

For hard 2-partition ,

U For hard 2-partition ,

U

25.6525.2, 2

ssvUJw102,

vUJ

vUJvUJsFor ww ,,,5.5

Page 43: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan43

Cluster validity

• Basic question:

what constitutes a “good” cluster ?

• What is a “cluster” anyhow?

The difficulty is that the data X, and every partition of X, are separated by the algorithm generating partition matrix U

(and defining “clusters” in the process)

Page 44: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan44

Cluster validity

• Many of the algorithms ignore this fundamental difficulty and are thus heuristic in nature

• Some heuristic methods measure the amount of fuzziness in U, and presume the least fuzzy partitions to be the most valid

Page 45: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan45

Degree of Separation - Bezdek

The degree of separation between fuzzy sets u1 and u2 is the scalar

and its generalization from 2 to c clusters is:

fcik

c

i

n

kMUucUZ

11

1;

kk

n

kuuU 21

112;

Page 46: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan46

Degree of Separation - Bezdek

Example: c=2 and two different fuzzy 2-partitions of X:

Z(U;2) = Z(V;2) = 0.50

U and V are very different so Z does not distinguish between the two partitions.

11

2

100

002

111

12

1

2

10

02

1

2

11

VU

Page 47: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan47

Partition Coefficient - Bezdek

U is a fuzzy c-partition of n data points. The partition coefficient, F, of U, is the scalar:

The value of F(U;c) depends on all (c x n) elements of U in contrast to Z(U;c) that depends on just one.

n

ucUF

n

k

c

iik

1 1

2

;

Page 48: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan48

Partition Coefficient - Bezdek

Example: values of F on U and V partitions of X:

F(U;2) = 0.510F(V; 2) = 0.990

The value of F gives accurate indication of the partition, for both the most uncertain and certain states.

11

2

100

002

111

12

1

2

10

02

1

2

11

VU

Page 49: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan49

Partition Coefficient - Bezdek

from Bezdek

Page 50: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan50

Partition Coefficient - Bezdek

Values of F(U;c) for c = 2, 3, 4, 5, 6 with the norms NE, ND

and NM

F first identifies a primary structure at c* = 2; and then the secondary structure at c = 3

Norm

c NE ND NM

2 0.88 0.89 0.56

3 0.78 0.80 0.46

4 0.71 0.71 0.36

5 0.66 0.68 0.31

6 0.60 0.62 0.37

Page 51: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan51

Partition Entropy - Bezdek

Let {Ai | 1 i c } denote a c-partition of events of any sample space connected with an experiment; and let

pT = (p1,p2, …, pc)

denote a probability vector associated with the {Ai}.

The pair ({Ai}, p) is called a finite probability scheme for the experiment

ith component of p is the probability of event Ai

c is called the length of the scheme

Note: c does NOT indicate the number of clusters

Page 52: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan52

Partition Entropy - Bezdek

Our aim is to find a measure h(p) of the amount of uncertainty associated with each state.

• h(p) – should maximize for p=(1/c, …, 1/c)

• h(p) – should minimize for p=(0, 1, 0, …)(any partition statistically certain)

Page 53: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan53

Partition Entropy - Bezdek

The entropy of the scheme is defined as:

pi loga(pi) = 0 whenever pi = 0

c

iiai ppph

1

)(log)(

Page 54: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan54

Partition Entropy - Bezdek

Partition entropy

of any fuzzy c-partition U Mfc of X,

where |X| = n, is, for 1 c n

n

uucUH

n

k

c

iikaik

1 1

)(log);(

Page 55: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan55

Partition Entropy - Bezdek

Let U Mfc be a fuzzy c-partition of n data points. Then for 1 c n and a (1,)

0 H(U;c) loga(c)

H(U;c) = 0 Mco is hard

H(U;c) = loga(c) U = [1/c]

Page 56: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan56

Partition Entropy - Bezdek

Example: entropy for U and V

H(U;c) = 49 loge(2)/51 = 0.665

H(V;c) = loge(2)/51 = 0.013

U is a very uncertain partition

11

2

100

002

111

12

1

2

10

02

1

2

11

VU

Page 57: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan57

Partition Entropy - Bezdek

We assume that minimization of H(U;c) corresponds to maximizing the amount of information about structural memberships an algorithm extracted from data X.

H(U;c) is used for cluster validity as follows:

c denotes a finite set of “optimal” U’s Mfc

c = 2, 3, …, n-1

)}};({min{min cUHcc

Page 58: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

Partition Entropy - Bezdek

• Normalized partition entropy:

Reasons: Variable ranges make interpretation of values of Vpc and Vpe difficult. Since they are not referenced to a fixed scale.

For example

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan58

Page 59: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan59

Partition Entropy - Bezdek

Normalized partition entropy:

)](1[

);();(

nccUH

cUH

Page 60: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

Cluster Validity

• Comment for partition coefficient and partition entropy

Vpc maximizes (and Vpe minimizes) on every crisp c-partition of X. And at the other extreme, Vpc takes its unique minimum( and Vpe takes its unique maximum) at the centroid U =[1/c]= . The “fuzziest” partition you can get since it assigns every point in X to all c classes with equal membership values 1/c.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan60

Page 61: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

Cluster Validity

• Comment for partition coefficient and partition entropy

Vpc and Vpe essentially measure the distance U is from being crisp by measuring the fuzziness in the rows of U

All these two indices really measures is fuzziness relative to partitions that yield other values of the indices

There are roughly ( ) crisp matrices in Mhcn and Vpc is constantly 1

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan61

Page 62: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan62

from Bezdek

Page 63: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan63

Prototypes for FEATURE SELECTIONfrom Bezdek

Symptom

Feature centers

Absolutedifferences(Hernia)

v1j

(Gallstones)v2j

1 0.57 0.27 0.30

2 0.98 0.67 0.31

3 0.06 0.93 0.87

4 0.22 0.55 0.33

5 0.17 0.10 0.07

6 0.77 0.84 0.07

7 0.42 0.05 0.37

8 0.39 0.84 0.45

9 0.48 0.04 0.44

10 0.02 0.16 0.14

11 0.12 0.25 0.13

Page 64: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan64

Cluster Errors for FEATURE SELECTIONfrom Bezdek

Symptoms usedCluster Error

E(U)

{1-11} 23

{3} 23

{3, 8} 23

{3, 9} 36

{3, 8, 9} 36

Page 65: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

Cluster Validity

• Example

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan65

Page 66: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan66

Cluster Validity

• Divisive Coefficient (DC)

a,b,c,d,e

c,d,e

d,e

a

b

c

d

e

a,b

10.0 5.0 3.0 2.0 0.0

0.00.30.50.80.9

0.30.00.40.90.10

0.50.40.00.50.6

0.80.90.50.00.2

0.90.100.60.20.0

e

d

c

b

a

edcba

c

i

iln

DC1

)(1

Page 67: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

Cluster Validity

• Divisive CoefficientFor each object i, let d ( i ) denote the diameter of the last cluster to which it belongs (before being split off as a single object), divided by the diameter of the whole data set.

The divisive coefficient (DC), given by

DC=(

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan67

Page 68: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan68

Cluster Validity

c

i

iln

DC1

)(1

0

of objects-sub ofNumber )(

objects)-sub any twobetween distance (maximum object ofDiameter )(

from comescluster current which thefromcluster The

singleton)or cluster (either object Current

:where

)()0(

)()(1)(

0

l

iin

iimd

j

i

inmd

imdjmdil

C- number of clusters being included

Page 69: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan69

Cluster Validity

• Divisive Coefficient (DC)

95.02

4.05.12

2

)(

1

1

211

2

11

DC

DC

llDC

ilDC

8.0

8.0

7.0

7.0

5.0

6.1

4.0

5.1

0

8

7

6

5

4

3

2

1

0

l

l

l

l

l

l

l

l

l

a,b,c,d,e

c,d,e

d,e

a

b

c

d

e

a,b

10.0 5.0 3.0 2.0 0.0

l0

l1

l2

l4

l3

l5

l6

l7

l8

0.00.30.50.80.9

0.30.00.40.90.10

0.50.40.00.50.6

0.80.90.50.00.2

0.90.100.60.20.0

e

d

c

b

a

edcba

L1=(1-((10-5)/10))3=1.5

Page 70: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan70

Cluster Validity

• Divisive Coefficient (DC)

83.02 DC

8.0

8.0

7.0

7.0

5.0

6.1

4.0

5.1

0

8

7

6

5

4

3

2

1

0

l

l

l

l

l

l

l

l

l

a,b,c,d,e

c,d,e

d,e

a

b

c

d

e

a,b

10.0 5.0 3.0 2.0 0.0

l0

l1

l2

l4

l3

l5

l6

l7

l8

0.00.30.50.80.9

0.30.00.40.90.10

0.50.40.00.50.6

0.80.90.50.00.2

0.90.100.60.20.0

e

d

c

b

a

edcba

Page 71: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan71

Cluster Validity

• Divisive Coefficient (DC)

8.0

8.0

7.0

7.0

5.0

6.1

4.0

5.1

0

8

7

6

5

4

3

2

1

0

l

l

l

l

l

l

l

l

l

a,b,c,d,e

c,d,e

d,e

a

b

c

d

e

a,b

10.0 5.0 3.0 2.0 0.0

l0

l1

l2

l4

l3

l5

l6

l7

l8

575.03 DC

0.00.30.50.80.9

0.30.00.40.90.10

0.50.40.00.50.6

0.80.90.50.00.2

0.90.100.60.20.0

e

d

c

b

a

edcba

Page 72: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan72

Cluster Validity

• Divisive Coefficient (DC)

8.0

8.0

7.0

7.0

5.0

6.1

4.0

5.1

0

8

7

6

5

4

3

2

1

0

l

l

l

l

l

l

l

l

l

a,b,c,d,e

c,d,e

d,e

a

b

c

d

e

a,b

10.0 5.0 3.0 2.0 0.0

l0

l1

l2

l4

l3

l5

l6

l7

l8

70.04 DC

0.00.30.50.80.9

0.30.00.40.90.10

0.50.40.00.50.6

0.80.90.50.00.2

0.90.100.60.20.0

e

d

c

b

a

edcba

Page 73: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan73

Cluster Validity

• Divisive Coefficient (DC)

a,b,c,d,e

c,d,e

d,e

a

b

c

d

e

a,b

10.0 5.0 3.0 2.0 0.0

l0

l1

l2

l4

l3

l5

l6

l7

l8

70.0

575.0

83.0

95.0

4

3

2

1

DC

DC

DC

DC

0.00.30.50.80.9

0.30.00.40.90.10

0.50.40.00.50.6

0.80.90.50.00.2

0.90.100.60.20.0

e

d

c

b

a

edcba

Page 74: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan74

Cluster Validity

How to assess the quality of clusters?

How many clusters should be found distinguished in data?

Compactness: expresses how close the elements in a cluster are. Quantified in terms of intra-cluster distances Separability: expresses how distinct the clusters are. Quantified through inter-cluster distances. Goal: high compactness and high separability

Page 75: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan75

Cluster Validity:Davies-Bouldin index

within scatter distance for the i-th cluster:

2

Ωi

ii ||||

)card(Ω

1s

i

x

vx

distance between the prototypes between the prototypes of the clusters: dij = ||vi – vj||

2 Define the ratio

ij

jiijj,i d

ssmaxr

and then the sum

c

1iirc

1r

The “optimal”, “correct” number of the clusters (c) is the one for which the value of “r” attains its minimum.

Page 76: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan76

Cluster Validity:Dunn separation index

diameter of the cluster

||||max)Δ(ΩiΩ,i yxyx

inter-cluster distance

||||min)Ω,δ(Ωji Ω,Ωji yxyx

)Δ(Ωmax

)Ω,δ(Ωminminr

kk

jicjj,i

Page 77: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan77

Cluster Validity: Xie-Benie

}||||N{min

||||ur

2jiji

N

1k

c

1i

2ik

mik

vv

vx

achieve the lowest value of “r”

Page 78: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan78

Cluster Validity:Fuzzy Clustering

Partition coefficient

Partition entropy

Sugeno-Takagi

N

1kikaik

c

1i2 ulogu

N

1P

)||||||(||uP 2i

2ik

N

1k

mik

c

1i3 vvvx

N

1k

2ik

c

1i1 uP

Page 79: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan79

Random Sampling

DATA

clustering

sampling

prototypes

prototypes

Two-phase clustering:

(a) Random sampling

and

(b) Clustering of prototypes

Page 80: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan80

ReferencesAnderberg MR. 1973.Cluster Analysis for Applications, Academic Press Bezdek, JC. 1981. Pattern Recognition with Fuzzy Objective Function Algorithms, Plenum Press Devijver PA and Kittler J (eds.), 1987. Pattern Recognition Theory and Applications, Springer-Verlag Dubes R. 1987. How many clusters are the best? – an experiment. Pattern Recognition, 20, 6, 645-663 Duda RO, Hart, PE and Stork DG. 2001 Pattern Classification, 2nd edition, J. Wiley Dunn JC. 1974.A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters, J. of Cybernetics, 3, 3, 32-57 Jain, AK, Murthy MN and Flynn PJ. 1999. Data clustering: A review, ACM Comput. Survey, 31, 3, 264-323 Kaufmann L and Rousseeuw PJ. 1990. Finding Groups in Data: An Introduction to Cluster Analysis, J. Wiley Kohonen, T., 1995. Self-organizing Maps, Springer Verlag Sammon, JW Jr. 1969. A nonlinear mapping for data structure analysis. IEEE Trans. on Computers, 5, 401-409 Xie, XL and Beni G. 1991.A validity measure for fuzzy clustering, IEEE Trans. on Pattern Analysis and Machine Intelligence, 13, 841-847 Webb A. 2002. Statistical Pattern Recognition, 2nd edition, J. Wiley