Clustering Gene Expression Data BMI/CS 576 Colin Dewey [email protected] Fall 2010

Clustering Gene Expression Data

BMI/CS 576

www.biostat.wisc.edu/bmi576/

Colin Dewey

[email protected]

Fall 2010

Gene Expression Profiles• we’ll assume we have a 2D matrix of gene expression

measurements– rows represent genes– columns represent different experiments, time points,

individuals etc. (what we can measure using one* microarray)

• we’ll refer to individual rows or columns as profiles– a row is a profile for a gene

* Depending on the number of genes being considered, we might actually use several arrays per experiment, time point, individual.

Expression Profile Example

• rows represent human genes

• columns represent people with leukemia

Expression Profile Example

• rows represent yeast genes

• columns represent time points in a given experiment

Task Definition: Clustering Gene Expression Profiles

• given: expression profiles for a set of genes or experiments/individuals/time points (whatever columns represent)

• do: organize profiles into clusters such that

– instances in the same cluster are highly similar to each other

– instances from different clusters have low similarity to each other

Motivation for Clustering

• exploratory data analysis

– understanding general characteristics of data

– visualizing data

• generalization

– infer something about an instance (e.g. a gene) based on how it relates to other instances

• everyone else is doing it

The Clustering Landscape

• there are many different clustering algorithms• they differ along several dimensions

– hierarchical vs. partitional (flat)– hard (no uncertainty about which instances belong to a

cluster) vs. soft clusters– disjunctive (an instance can belong to multiple clusters)

vs. non-disjunctive– deterministic (same clusters produced every time for a

given data set) vs. stochastic – distance (similarity) measure used

Distance/Similarity Measures• many clustering methods employ a distance (similarity) measure

to assess the distance between

– a pair of instances

– a cluster and an instance

– a pair of clusters

• given a distance value, it is straightforward to convert it into a similarity value

• not necessarily straightforward to go the other way

• we’ll describe our algorithms in terms of distances

1sim( , )

1 dist( , )x y

x y=

+

Distance Metrics• properties of metrics

• some distance metrics

dist( , ) dist( , )i j j ix x x x=

, ,dist( , )i j i e j ee

x x x x= −∑

dist( , ) 0i jx x ≥

dist( , ) dist( , ) dist( , )i j i k k jx x x x x x≤ +

( )2, ,dist( , )i j i e j ee

x x x x= −∑

Manhattan

Euclidean

e ranges over the individual measurements for xi and xj

€

dist(x i, x j ) = 0 if and only if x i = x j

(non-negativity)

(identity)

(symmetry)

(triangle inequality)

Hierarchical Clustering: A Dendogram

leaves represent instances (e.g. genes)

height of bar indicates degree of distance within cluster

dist

ance

sca

le

0

Hierarchical Clustering of Expression Data

Hierarchical Clustering

• can do top-down (divisive) or bottom-up (agglomerative)

• in either case, we maintain a matrix of distance (or similarity) scores for all pairs of

– instances

– clusters (formed so far)

– instances and clusters

Bottom-Up Hierarchical Clustering

u

1

1

( , )

given:a set { ... } of instances

for : 1 to do

: { }

: { ... }

:

while | | 1

: 1

( , ) : argmin dist( , )

add a new node to the

v

n

i i

n

a b u vc c

j a b

X x x

i n

c x

C c c

j n

C

j j

c c c c

c c c

j

===

==

>= +

=

= ∪

tree joining and : { , } { }

return tree with root node a b j

a bC C c c c

j

= − ∪

// each object is initially its own cluster, and a leaf in tree

// find least distant pair in C

// create a new cluster for pair

Haven’t We Already Seen This? • this algorithm is very similar to UPGMA and neighbor joining; there are

some differences• what tree represents

– phylogenetic inference: tree represents hypothesized sequence of evolutionary events; internal nodes represent hypothetical ancestors

– clustering: inferred tree represents similarity of instances; internal nodes don’t represent ancestors

• form of tree– UPGMA: rooted tree– neighbor joining: unrooted– hierarchical clustering: rooted tree

• how distances among clusters are calculated– UPGMA: average link– neighbor joining: based on additivity– hierarchical clustering: various

Distance Between Two Clusters

• the distance between two clusters can be determined in several ways

– single link: distance of two most similar instances

– complete link: distance of two least similar instances

– average link: average distance between instances

{ }dist( , ) min dist( , ) | ,u v u vc c a b a c b c= ∈ ∈

{ }dist( , ) max dist( , ) | ,u v u vc c a b a c b c= ∈ ∈

{ }dist( , ) avg dist( , ) | ,u v u vc c a b a c b c= ∈ ∈

Complete-Link vs. Single-Link Distances

complete link

uc

vc

single link

vc

uc

Updating Distances Efficiently • if we just merged and into , we can

determine distance to each other cluster as follows

– single link:

– complete link:

– average link:

€

dist(c j ,ck ) = min dist(cu ,ck ),dist(cv ,ck ){ }

€

dist(c j ,ck ) = max dist(cu ,ck ),dist(cv ,ck ){ }

€

dist(c j ,ck ) =| cu | × dist(cu,ck ) + | cv | × dist(cv,ck )

| cu | + | cv |

jcuc vckc

Computational Complexity

• the naïve implementation of hierarchical clustering has time complexity, where n is the number of instances

– computing the initial distance matrix takes time

– there are merging steps

– on each step, we have to update the distance matrix and select the next pair of clusters to merge

)( 3nO

)(nO)(nO

)( 2nO

)( 2nO

Computational Complexity

• for single-link clustering, we can update and pick the next pair in time, resulting in an algorithm

• for complete-link and average-link we can do these steps in time resulting in an method

• see http://www-csli.stanford.edu/~schuetze/completelink.html for an improved and corrected discussion of the computational complexity of hierarchical clustering

)(nO )( 2nO

)log( nnO )log( 2 nnO

Partitional Clustering

• divide instances into disjoint clusters

– flat vs. tree structure

• key issues

– how many clusters should there be?

– how should clusters be represented?

Partitional Clustering Example

cutting here resultsin 2 clusters

cutting here resultsin 4 clusters

Partitional Clustering from a Hierarchical Clustering

• we can always generate a partitional clustering from a hierarchical clustering by “cutting” the tree at some level

K-Means Clustering• assume our instances are represented by vectors of real

values

• put k cluster centers in same space as instances

• each cluster is represented by a vector

• consider an example in which our vectors have 2 dimensions

+ +

+

+

instances cluster center

jfr

K-Means Clustering• each iteration involves two steps

– assignment of instances to clusters

– re-computation of the means

+ +

+

+

+ +

+

+

assignment re-computation of means

K-Means Clustering: Updating the Means

• for a set of instances that have been assigned to a cluster , we re-compute the mean of the cluster as follows

€

μ(c j ) =

r x i

r x i∈c j

∑

| c j |

jc

K-Means Clustering

€

given : a set X = {r x 1...

r x n} of instances

select k initial cluster centers r f 1...

r f k

while stopping criterion not true do

for all clusters c j do

c j = r x i | dist

r x i,

r f j( ) < dist

r x i,

r f l( ), ∀ l ≠ j { }

for all means r f j do

r f j = μ(c j )

// determine which instances are assigned to this cluster

// update the cluster center

K-means Clustering Example

f1

f2

x1

x2

x3

x4

6),( ,11),(

2),( ,3),(

3),( ,2),(

5),( ,2),(

2414

2313

2212

2111

========

fxdistfxdistfxdistfxdistfxdistfxdistfxdistfxdist

5,72

82,

2

86

2,42

31,

2

44

2

1

=++

=

=++

=

f

f

4),( ,10),(

4),( ,2),(

5),( ,1),(

7),( ,1),(

2414

2313

2212

2111

========

fxdistfxdistfxdistfxdistfxdistfxdistfxdistfxdist

Given the following 4 instances and 2 clusters initialized as shown. Assume the distance function is ∑ −=

eejeiji xxxx ,,),(dist

K-means Clustering Example (Continued)

8,81

8,

1

8

2,67.43

231,

3

644

2

1

==

=++++

=

f

f

assignments remain the same,so the procedure has converged

EM Clustering

• in k-means as just described, instances are assigned to one and only one cluster

• we can do “soft” k-means clustering via an Expectation Maximization (EM) algorithm

– each cluster represented by a distribution (e.g. a Gaussian)

– E step: determine how likely is it that each cluster “generated” each instance

– M step: adjust cluster parameters to maximize likelihood of instances

Representation of Clusters

• in the EM approach, we’ll represent each cluster using an m-dimensional multivariate Gaussian

where

⎥⎦

⎤⎢⎣

⎡ −Σ−−Σ

= − )()(2

1exp

||)2(

1)( 1

jijT

ji

jmij xxxN μμ

π

rrrrr

jΣ

jμr is the mean of the Gaussian

is the covariance matrix

this is a representation of a Gaussian in a 2-D space

Cluster generation

• We model our data points with a generative process

• Each point is generated by:– Choosing a cluster (1,2,…,k) by sampling from

probability distribution over the clusters• Prob(cluster j) = Pj

– Sampling a point from the Gaussian distribution Nj

EM Clustering• the EM algorithm will try to set the parameters of the

Gaussians, , to maximize the log likelihood of the data, X

€

log likelihood(X | Θ) = log Pr(r x i

i=1

n

∏ )

€

=log PjN j (r x i)

j=1

k

∑i=1

n

∏

€

= log PjN j (r x i)

j=1

k

∑i=1

n

∑

Θ

EM Clustering• the parameters of the model, , include the means, the

covariance matrix and sometimes prior weights for each Gaussian

• here, we’ll assume that the covariance matrix is fixed; we’ll focus just on setting the means and the prior weights

Θ

EM Clustering: Hidden Variables• on each iteration of k-means clustering, we had to assign

each instance to a cluster

• in the EM approach, we’ll use hidden variables to represent this idea

• for each instance we have a set of hidden variables

• we can think of as being 1 if is a member of cluster j and 0 otherwise

€

Zi1,...,Zik

ixr

€

Zij ixr

EM Clustering: the E-step• recall that is a hidden variable which is 1 if

generated and 0 otherwise

• in the E-step, we compute , the expected value of this hidden variable

€

Zij

€

hij = E(Zij |r x i) = Pr(Zij =1 |

r x i) =

PjN j (r x i)

PlN l (r x i)

l =1

k

∑

jNix

r

ijh

+

assignment

0.3

0.7

EM Clustering: the M-step• given the expected values , we re-estimate the means of

the Gaussians and the cluster probabilities

• can also re-estimate the covariance matrix if we’re varying it

ijh

€

rμ j =

hij

r x i

i=1

n

∑

hiji=1

n

∑

€

Pj =

hij

i=1

n

∑n

EM Clustering ExampleConsider a one-dimensional clustering problem in which the data given are: x1 = -4 x2 = -3 x3 = -1 x4 = 3 x5 = 5

The initial mean of the first Gaussian is 0 and the initial mean of the second is 2. The Gaussians both have variance = 4; their density function is:

where denotes the mean (center) of the Gaussian.

Initially, we set P1 = P2 = 0.5

2

22

1

8

1),(

⎟⎠⎞

⎜⎝⎛ −

−=

μ

πμ

x

exf

μ

EM Clustering Example

2

22

1

8

1),(

⎟⎠⎞

⎜⎝⎛ −

−=

μ

πμ

x

exf

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

-6 -4 -2 0 2 4 6

Gaussian 1Gaussian 2

data

0269.8

1),4(

2

2

0 4

2

1

1 ==−⎟⎠⎞

⎜⎝⎛ −−

−ef

πμ 0022.),4( 2 =− μf

0646.),3( 1 =− μf 00874.),3( 2 =− μf

176.),1( 1 =− μf 0646.),1( 2 =− μf

0646.),3( 1 =μf 176.),3( 2 =μf

00874.),5( 1 =μf 0646.),5( 2 =μf

EM Clustering Example: E Step

€

h12 =

1

2f (x1,μ2)

1

2f (x1,μ1) +

1

2f (x1,μ2)

=.0022

.0269 + .0022

= 0.076

h22 =.00874

.0646 + .00874= 0.119

h32 =.0646

.176 + .0646= 0.268

h42 =.176

.0646 + .176= 0.732

h52 =.0646

.00874 + .0646= 0.881

€

h11 =

1

2f (x1,μ1)

1

2f (x1,μ1) +

1

2f (x1,μ2)

=.0269

.0269 + .0022

= 0.924

h21 =

1

2f (x2,μ1)

1

2f (x2,μ1) +

1

2f (x2,μ2)

=.0646

.0646 + .00874

= 0.881

h31 =.176

.176 + .0646= 0.732

h41 =.0646

.0646 + .176= 0.268

h51 =.00874

.00874 + .0646= 0.119

EM Clustering Example: M-step

€

μ1 =

x i × hi1

i

∑

hi1

i

∑=

−4 × .924 + −3× .881+ −1× .732 + 3× .268 + 5 × .119

.924 + .881+ .732 + .268 + .119= −1.94

€

μ2 =

x i × hi2

i

∑

hi2

i

∑=

−4 × .076 + −3× .119 + −1× .268 + 3 × .732 + 5 × .881

.076 + .119 + .268 + .732 + .881= 3.39

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

-6 -4 -2 0 2 4 6


data

€

P1 =

hi1

i

∑n

=.924 + .881+ .732 + .268 + .119

5= 0.58

P2 =

hi2

i

∑n

=.076 + .119 + .268 + .732 + .881

5= 0.42

EM Clustering Example

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

-6 -4 -2 0 2 4 6


data

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

-6 -4 -2 0 2 4 6


data

• here we’ve shown just one step of the EM procedure

• we would continue the E- and M-steps until convergence

EM and K-Means Clustering

• both will converge to a local maximum

• both are sensitive to initial positions (means) of clusters

• have to choose value of k for both

Cross Validation to Select k• with EM clustering, we can run the method with different

values of k, use CV to evaluate each clustering

computeto evaluate clustering

€

log Pr(xi )i

∑

clusteringclusteringclustering

• then run method on all data once we’ve picked k

Evaluating Clustering Results

• given random data without any “structure”, clustering algorithms will still return clusters

• the gold standard: do clusters correspond to natural categories?

• do clusters correspond to categories we care about? (there are lots of ways to partition the world)

Evaluating Clustering Results• some approaches

– external validation• e.g. do genes clustered together have some

common function?– internal validation

• How well does clustering optimize intra-cluster similarity and inter-cluster dissimilarity?

– relative validation• How does it compare to other clusterings using

these criteria?• e.g. with a probabilistic method (such as EM)

we can ask: how probable does held-aside data look as we vary the number of clusters.

Comments on Clustering

• there many different ways to do clustering; we’ve discussed just a few methods

• hierarchical clusters may be more informative, but they’re more expensive to compute

• clusterings are hard to evaluate in many cases

Documents

Clustering Gene Expression Data BMI/CS 576 Colin Dewey [email protected] Fall 2010