Lecture 22: Clustering Analysis - math.arizona.eduhzhang/math574m/2019Lect22_cluster... · Unsupervised Learning Cluster Analysis Various Clustering Algorithms Introduction Optimal

Unsupervised LearningCluster Analysis

Various Clustering Algorithms

Lecture 22: Clustering Analysis

Hao Helen Zhang

Hao Helen Zhang Lecture 22: Clustering Analysis



Outlines

Unsupervised Learning

IntroductionVarious Learning Problems

Clustering Analysis

IntroductionOptimal Clustering


K-means AlgorithmK-medoids AlgorithmHierarchical Clustering



Various Clustering AlgorithmsIntroduction

What Is Unsupervised Learning

Learning patterns from data without a teacher.

Data: {xi}ni=1

No yi ’s are available.

Various unsupervised learning problems:

cluster analysis

dimension reduction

density estimation problems (useful for low-dimensionalproblems)

In contrast to supervised learning, there is no clear measure ofsuccess for unsupervised learning.




About Supervised Learning

Learning with a teacher: given a training set {xi , yi}ni=1 to learnthe relationship between x and y .

regression, classification, ...

Once a model f (x) is obtained, one can use it for predictiony = f (x).

Measure of success: accuracy of prediction. For example,

L(y , y) = (y − y)2.

“student”: the prediction value yi“teacher”: the correct answer and/or an error associated with thestudent’s answer.




Cluster Analysis

Goal: to find multiple regions of the X-space that contains modesof P(x).

Can we represent P(x) by a mixture of simpler densitiesrepresenting distinct types or classes of observations?

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

−1.5

−1.0

−0.5

0.00.5

1.01.5

x1

x2

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●




Dimension Reduction

Goal: to describe the association among variables as a function ofa smaller set of ”latent” variables.

Principle component analysis (PCA)

Independent component analysis (ICA)

used for blind source separationAssume X = AS, where S is p-vector containing independentcomponents. The goal is to identify A and S.

multidimensional scaling (MDS)

self-organizing maps (SOP)

principal curves

....




Density Estimation

Goal: estimate the density distribution of X1, · · · ,Xp.

histogram

Gaussian mixture provides a crude model

kernel density estimation, smoothing splines

Challenges:

Density estimation can be infeasible for high dimensionalproblems.




IntroductionOptimal Clustering and Combinatorial Algorithm

Cluster Analysis Introduction

Goal: Group or segment the dataset (a collection of objects) intosubsets, such that those within each subset are more closelyrelated to one another than those assigned to different subsets.

Each subset is called a cluster

Two types of clustering: flat and hierarchical clustering:

flat clustering divides the dataset into k clusters

hierarchical clustering is to arrange the clusters into a naturalhierarchy.

involves successively grouping the clusters themselvesAt each level of the hierarchy, clusters within the same groupare more similar to each other than those in different groups.





Proximity and Dissimilarity Matrices

Clustering results are crucially dependent on the measure ofsimilarity or distance between the “points” to be clustered.

One can use either a similarity or dissimilarity matrix.

Similarities can be converted to dissimilarities using amonotone-decreasing function.

Given two points (objects), xi and xi ′ , there are two levels ofdissimilarity:

Attribute dissimilarity: dj(xij , xi ′j) quantifies the dissimilarityin their jth attribute.

Object dissimilarity: d(xi , xi ′) quantifies the overalldissimilarity between two points (objects).





Object Dissimilarities

A common choice of attribute dissimilarity is:

Squared distance : dj(xij , xi ′j) = (xij − xi ′j)2.

The object dissimilarity can be quantified by the sum of dj ’s

D(xi , xi ′) =

p∑j=1

dj(xij , xi ′j).

or a weighted sum

D(xi , xi ′) =

p∑j=1

wjdj(xij , xi ′j) where

p∑j=1

wj = 1.

General distance-based dissimilarity:

d(xi , xi ′) = L(‖xi − xi ′‖).





Correlation-Based Dissimilarity

Alternatively, the following correlation can be used to measure thedissimilarity between subjects:

ρ(xi , xi ′) =

∑pj=1(xij − xi )(xi ′j − xi ′)√∑p

j=1(xij − xi )2∑p

j=1(xi ′j − xi ′)2,

where xi =∑p

j=1 xij/p is the average of variables (notobservations).

If inputs are standardized, then

p∑j=1

(xij − xi ′j)2 ∝ 2(1− ρ(xi , xi ′)).

In the case, clustering based on correlation (similarity) isequivalent to that based on squared distance (dissimilarity).





Proximity Matrix

The proximity (or alikeness or affinity) matrix D = (dik):

The ik-th element dik measuring the proximity between thei-th and the kth objects (or observations).

D is of size n × n:

D is typically symmetric.

If D is not symmetric, we can use (D + D ′)/2 can be applied.





Criterion for Optimal Clustering

Each observation is uniquely labeled by an integeri ∈ {1, 2, . . . , n}.k clusters: k ∈ {1, . . . ,K}.Define k = C (i): the ith observation is assigned to the k-thcluster.

The optimal cluster C ∗ should satisfy that

The total dissimilarity within clusters is minimized, i.e., wewant to minimize W (C ):

W (C ) =1

2

K∑k=1

∑C(i)=k

∑C(i ′)=k

d(xi , xi ′).





Within-Class and Between-Class Dissimilarities

Total dissimilarity:

T =1

2

n∑i=1

n∑i ′=1

dii ′ =1

2

K∑k=1

∑C(i)=k

(∑

C(i ′)=k

dii ′ +∑

C(i ′)6=k

dii ′).

Between-cluster dissimilarity:

B(C ) =1

2

K∑k=1

∑C(i)=k

∑C(i ′)6=k

dii ′ .

W (C ) = T − B(C ).

Minimizing W (C ) is equivalent to maximizing B(C ).





Combinatorial Algorithms

One needs to minimize W over all possible assignments of npoints to K clusters.

The number of distinct assignments is

S(n,K ) =1

K !

K∑k=1

(−1)K−k(K

k

)kn.

It is not feasible for large n and K .

It calls for more efficient algorithms: may not be optimal buta reasonably good suboptimal partition.




K -means MethodK-medoids AlgorithmHierarchical Clustering

K -means Method

One of the most popular iterative descent clustering methods.

Works for the case when all variables are quantitative.

Dissimilarity measure: the squared distance

d(xi , xi ′) =

p∑j=1

(xij − xi ′j)2 = ||xi − xi ′ ||2.

The within-class scatter can be reduced to

W (C ) =K∑

k=1

nk∑

C(i)=k

||xi − xk ||2,

where xk is the mean for cluster k and nk is size of cluster k .





Optimization for K -means Clustering

Optimization Problem:

C ∗ = minC

K∑k=1

nk∑

C(i)=k

||xi − xk ||2,

wherexS = argminm

∑i∈S||xi −m||2.

This can be solved using an iterative descent algorithm.

We actually need to solve the enlarged optimization problem:

minC ,{mk}K1

K∑k=1

nk∑

C(i)=k

||xi −mk ||2.





K -means Algorithm

Starts with the guesses for the K clusters:

1 (Clustering Step) Given K centroids {m1, . . . ,mk}, solve C byassigning each observation to the closest (current) clustermean

C (i) = argmink ||xi −mk ||2.2 (Minimization Step)) For a given C , update center{m1, . . . ,mk}; that is the mean of k clusters.

3 Iterate steps 1 and 2 until the assignments do not change.

This can be regarded as a top-down procedure for clusteringanalysis. (see http://en.wikipedia.org/wiki/K−meansclustering )





Elements of Statisti al Learning Hastie, Tibshirani & Friedman 2001 Chapter 14

-4 -2 0 2 4 6

-20

24

6

Initial Centroids

• • •

•

••

••

•••

•

•

•

• •• ••

•• • •

•

••

•

•

•

••

•

••

•••••

• ••

• •••

•

•

•

••

••

•••

•

••

••

•

•

••••

•

•

• •••

••

•• •

•• •• •

•

• •• •

••

•

• •

•

• ••••

••

•

•

•• • •

•• •• •

• ••

•

••• •

•••

•

•• •

•

••

••

•

•

••

••

••

•

••

• •

••• ••

•

••

•

••

• • •

•

••

••

•••

•

•

•

• •• ••

•• • •

•

••

•

•

•

••

•

••

•••••

• ••

• •••

•

•

•

••

••

•••

•

••

••

•

•

••••

•

•

• •••

••

•• •

•• •• •

•

• •• •

••

•

• •

•

• ••••

••

•

•

•• • •

•• •• •

• ••

•

••• •

•••

•

•• •

•

••

••

•

•

••

••

••

•

••

• •

••• ••

•

••

•

••

Initial Partition

• • •

•

••

••

•••

•

•

•

• •• ••

•• • •

•

••

•

•

••

•

••

•••••

• ••

•

• •••

•

•

•

••

•

••

•••

•

••

•

•

•

•

•

••••

•

•

• •••

••

•• •

•• •• •

•

• •• •

••

•

• •

•• ••

••

•

••

•• •• •

••

••• •

••

•

•

•

•

• •• •

•

••

••

•

•

••

••

•

•

•

••

••

• •

••• ••

Iteration Number 2

•

••

•

••

• • •

•

••

••

•••

•

•

•

• •• ••

•• • •

•

••

•

•

••

•

••

•••••

• ••

•

• •••

•

•

•

••

••

••

•••

•

••

•

•

•

•

•

•

••••

•

•

• •••

••

•• •

•• •• •

•

• •• •

••

•

• •

••

••• •

•

•• • •

•• •• •

•• ••

•

•••

•••

•

•• •

•

••

••

•

•

••

••

••••

• •

••• ••

Iteration Number 20

•

••

•

••

Figure 14.6: Su essive iterations of the K-means lustering algorithm for the simulated data of Fig-ure 14.4.Hao Helen Zhang Lecture 22: Clustering Analysis




About K -means Algorithm

K -means often produces good results.

The K -means algorithm is guaranteed to converge, since eachof steps 1 and 2 reduced W (C )

The time complexity of K -means is O(tKn), where t is thenumber of iterations.

K -means finds a local optimum and may miss the globaloptimum.

Different starting values lead to different clustering results.

should start the algorithm with many different random choicesfor the starting means, and choose the solution having smallestvalue of the W (C ).

K -means does not work on categorical data; does not handleoutliers well.





How to Choose K

In the K -means algorithm, one needs to specify K in advance.

K is a tuning parameter. An inappropriate choice of K mayyield poor results.

Intuitively, one can try different K values and evaluate W (C )on a test set.

However, W (C ) generally decreases with increasing K , since alarge number of centers tend to fill the feature space denselyand thus will be close to all data points.So CV or test set does not work here.

A heuristic approach: locating a kink in W (C ) for the optimalK ∗.

Use the Gap statistic (Tibshirani et al. 2001)






Number of Clusters K

Sum

of Sq

uares

2 4 6 8 10

1600

0020

0000

2400

00

•

•

••

••

• •• •

Figure 14.8: Total within luster sum of squares forK-means lustering applied to the human tumor mi- roarray data.





Motivations of K-medoids Algorithm

The K-means algorithm:

assumes squared Euclidean distance in the minimization step

requires all the inputs to be quantitative type

The K-medoids algorithm generalizes the K-means algorithm in thefollowing ways:

use arbitrarily defined similarities D(xi , xi ′)

can be applied to any data described only by proximitymatrices





Features of K-medoids Algorithm

Main ideas:

Centers for each cluster are restricted to be one of theobservations assigned to the cluster

there is no need to explicitly computer cluster centersthe computation cost is O(n2k). (For K-means, thecomputation cost is O(nk).)

Given a set of cluster centers {i1, · · · , ik}, obtain the newassignments

C (i) = arg min1≤k≤K

diik .

Requires the computation proportional to Kn.

K-medoids is far more computationally intensive thanK -means.





K -medoids Algorithm

Starts with the guesses for the K clusters:

1 (Clustering Step) Given {m1, . . . ,mk}, minimize the totalerror by assigning each observation to the closest center:

C (i) = arg min1≤k≤K

D(xi ,mk).

2 (Minimization Step) For a given C , find the observation in thecluster k by minimizing the total distance to other points inthe cluster k

i∗k = arg mini :C(i)=k

∑C(i ′)=k

D(xi , xi ′).

3 Iterate steps 1 and 2 until the assignments do not change.

A heuristic search strategy to solve minC ,{ik}K1

∑Kk=1

∑C(i)=k diik .





Hierarchical Clustering

K -means does not give a linear ordering of objects within a cluster.This motivates the idea of hierarchical clustering:

Produce hierarchical representations: the clusters at each levelof the hierarchy are created by merging clusters at the nextlower level.

At the lowest level, each cluster contains a single observation.

At the highest level there is only one cluster containing allobservations.

It is very informative and highly interpretable to use thedendrogram to display the clustering result.

Cutting the dendrogram horizontally at a particular heightpartitions the data into disjoint clusters represented by thevertical lines that intersect it.






CNS

CNS

CNS

RENA

L

BREA

ST

CNSCN

S

BREA

ST

NSCL

C

NSCL

C

RENA

LRE

NAL

RENA

LRENA

LRE

NAL

RENA

L

RENA

L

BREA

STNS

CLC

RENA

L

UNKN

OWN

OVAR

IAN

MELA

NOMA

PROS

TATEOV

ARIAN

OVAR

IAN

OVAR

IANOV

ARIAN

OVAR

IANPR

OSTA

TE

NSCL

CNS

CLC

NSCL

C

LEUK

EMIA

K562

B-rep

roK5

62A-r

epro

LEUK

EMIA

LEUK

EMIA

LEUK

EMIA

LEUK

EMIA

LEUK

EMIA

COLO

NCO

LON

COLO

N COLO

NCO

LON

COLO

NCO

LON

MCF7

A-rep

roBR

EAST

MCF7

D-rep

ro

BREA

ST

NSCL

C

NSCL

CNS

CLC

MELA

NOMA

BREA

STBR

EAST

MELA

NOMA

MELA

NOMA

MELA

NOMA

MELA

NOMA

MELA

NOMA

MELA

NOMA

Figure 14.12: Dendrogram from agglomerative hier-ar hi al lustering with average linkage to the humantumor mi roarray data.





Agglomerative Clustering

Two paradigms:

agglomerative (bottom-up)

divisive (top-down).

Key steps in the agglomerative clustering:

Begin with every observation representing a singleton cluster.

At each step, merge two “closest” clusters into one clusterand reduce the number of clusters by one.

Need a measure of dissimilarity between two clusters.





Dissimilarity between Two Groups

In hierarchical clustering, it is important to define the dissimilaritybetween two clusters (group) G and H:

d(G ,H) is a function of the set of pairwise dissimilarities dii ′ ,where xi ∈ G and xi ′ ∈ H.

Single linkage: the closest pair

dSL(G ,H) = mini∈G ,i ′∈H

dii ′ .

Complete linkage: the furthest pair

dCL(G ,H) = maxi∈G ,i ′∈H

dii ′ .

Group Average: average dissimilarity

dGA(G ,H) =1

nGnH

∑i∈G

∑i ′∈A

dii ′ .

Group average clustering has a statistical consistency propertyviolated by the single and complete linkage.






Average Linkage Complete Linkage Single Linkage

Figure 14.13: Dendrograms from agglomerative hier-ar hi al lustering of human tumor mi roarray data.





Human Tumor Microarray Data

An example of high-dimensional clustering

The data are 6, 830× 64 matrix of real numbers

Row: the expression measurement for a gene.

Column: a sample (labeled as “breast cancer” or “melaoma”)

In the next plot,

we have arranged the genes (rows) and samples (columns) inordering derived from hierarchical clustering

the subtree with the tighter cluster is placed to the left (orbottom).

This two-way rearrangement of picture is more informative thandisplaying the data with randomly ordered rows and columns






Figure 14.14: DNA mi roarray data: average linkagehierar hi al lustering has been applied independentlyto the rows (genes) and olumns (samples), determin-ing the ordering of the rows and olumns (see text).The olors range from bright green (negative, underex-pressed) to bright red (positive, overexpressed).Hao Helen Zhang Lecture 22: Clustering Analysis




R functions

K -means:

package stat function kmeans

Hierarchical Clustering:

package stat function hclust.


Documents

Lecture 22: Clustering Analysis - math.arizona.eduhzhang/math574m/2019Lect22_cluster... · Unsupervised Learning Cluster Analysis Various Clustering Algorithms Introduction Optimal