Clustering Gene Expression Data BMI/CS 576 Colin Dewey cdewey@biostat.wisc.edu Fall 2010

Preview:

Citation preview

Clustering Gene Expression Data

BMI/CS 576

www.biostat.wisc.edu/bmi576/

Colin Dewey

cdewey@biostat.wisc.edu

Fall 2010

Gene Expression Profiles• we’ll assume we have a 2D matrix of gene expression

measurements– rows represent genes– columns represent different experiments, time points,

individuals etc. (what we can measure using one* microarray)

• we’ll refer to individual rows or columns as profiles– a row is a profile for a gene

* Depending on the number of genes being considered, we might actually use several arrays per experiment, time point, individual.

Expression Profile Example

• rows represent human genes

• columns represent people with leukemia

Expression Profile Example

• rows represent yeast genes

• columns represent time points in a given experiment

Task Definition: Clustering Gene Expression Profiles

• given: expression profiles for a set of genes or experiments/individuals/time points (whatever columns represent)

• do: organize profiles into clusters such that

– instances in the same cluster are highly similar to each other

– instances from different clusters have low similarity to each other

Motivation for Clustering

• exploratory data analysis

– understanding general characteristics of data

– visualizing data

• generalization

– infer something about an instance (e.g. a gene) based on how it relates to other instances

• everyone else is doing it

The Clustering Landscape

• there are many different clustering algorithms• they differ along several dimensions

– hierarchical vs. partitional (flat)– hard (no uncertainty about which instances belong to a

cluster) vs. soft clusters– disjunctive (an instance can belong to multiple clusters)

vs. non-disjunctive– deterministic (same clusters produced every time for a

given data set) vs. stochastic – distance (similarity) measure used

Distance/Similarity Measures• many clustering methods employ a distance (similarity) measure

to assess the distance between

– a pair of instances

– a cluster and an instance

– a pair of clusters

• given a distance value, it is straightforward to convert it into a similarity value

• not necessarily straightforward to go the other way

• we’ll describe our algorithms in terms of distances

1sim( , )

1 dist( , )x y

x y=

+

Distance Metrics• properties of metrics

• some distance metrics

dist( , ) dist( , )i j j ix x x x=

, ,dist( , )i j i e j ee

x x x x= −∑

dist( , ) 0i jx x ≥

dist( , ) dist( , ) dist( , )i j i k k jx x x x x x≤ +

( )2, ,dist( , )i j i e j ee

x x x x= −∑

Manhattan

Euclidean

e ranges over the individual measurements for xi and xj

dist(x i, x j ) = 0 if and only if x i = x j

(non-negativity)

(identity)

(symmetry)

(triangle inequality)

Hierarchical Clustering: A Dendogram

leaves represent instances (e.g. genes)

height of bar indicates degree of distance within cluster

dist

ance

sca

le

0

Hierarchical Clustering of Expression Data

Hierarchical Clustering

• can do top-down (divisive) or bottom-up (agglomerative)

• in either case, we maintain a matrix of distance (or similarity) scores for all pairs of

– instances

– clusters (formed so far)

– instances and clusters

Bottom-Up Hierarchical Clustering

u

1

1

( , )

given:a set { ... } of instances

for : 1 to do

: { }

: { ... }

:

while | | 1

: 1

( , ) : argmin dist( , )

add a new node to the

v

n

i i

n

a b u vc c

j a b

X x x

i n

c x

C c c

j n

C

j j

c c c c

c c c

j

===

==

>= +

=

= ∪

tree joining and : { , } { }

return tree with root node a b j

a bC C c c c

j

= − ∪

// each object is initially its own cluster, and a leaf in tree

// find least distant pair in C

// create a new cluster for pair

Haven’t We Already Seen This? • this algorithm is very similar to UPGMA and neighbor joining; there are

some differences• what tree represents

– phylogenetic inference: tree represents hypothesized sequence of evolutionary events; internal nodes represent hypothetical ancestors

– clustering: inferred tree represents similarity of instances; internal nodes don’t represent ancestors

• form of tree– UPGMA: rooted tree– neighbor joining: unrooted– hierarchical clustering: rooted tree

• how distances among clusters are calculated– UPGMA: average link– neighbor joining: based on additivity– hierarchical clustering: various

Distance Between Two Clusters

• the distance between two clusters can be determined in several ways

– single link: distance of two most similar instances

– complete link: distance of two least similar instances

– average link: average distance between instances

{ }dist( , ) min dist( , ) | ,u v u vc c a b a c b c= ∈ ∈

{ }dist( , ) max dist( , ) | ,u v u vc c a b a c b c= ∈ ∈

{ }dist( , ) avg dist( , ) | ,u v u vc c a b a c b c= ∈ ∈

Complete-Link vs. Single-Link Distances

complete link

uc

vc

single link

vc

uc

Updating Distances Efficiently • if we just merged and into , we can

determine distance to each other cluster as follows

– single link:

– complete link:

– average link:

dist(c j ,ck ) = min dist(cu ,ck ),dist(cv ,ck ){ }

dist(c j ,ck ) = max dist(cu ,ck ),dist(cv ,ck ){ }

dist(c j ,ck ) =| cu | × dist(cu,ck ) + | cv | × dist(cv,ck )

| cu | + | cv |

jcuc vckc

Computational Complexity

• the naïve implementation of hierarchical clustering has time complexity, where n is the number of instances

– computing the initial distance matrix takes time

– there are merging steps

– on each step, we have to update the distance matrix and select the next pair of clusters to merge

)( 3nO

)(nO)(nO

)( 2nO

)( 2nO

Computational Complexity

• for single-link clustering, we can update and pick the next pair in time, resulting in an algorithm

• for complete-link and average-link we can do these steps in time resulting in an method

• see http://www-csli.stanford.edu/~schuetze/completelink.html for an improved and corrected discussion of the computational complexity of hierarchical clustering

)(nO )( 2nO

)log( nnO )log( 2 nnO

Partitional Clustering

• divide instances into disjoint clusters

– flat vs. tree structure

• key issues

– how many clusters should there be?

– how should clusters be represented?

Partitional Clustering Example

cutting here resultsin 2 clusters

cutting here resultsin 4 clusters

Partitional Clustering from a Hierarchical Clustering

• we can always generate a partitional clustering from a hierarchical clustering by “cutting” the tree at some level

K-Means Clustering• assume our instances are represented by vectors of real

values

• put k cluster centers in same space as instances

• each cluster is represented by a vector

• consider an example in which our vectors have 2 dimensions

+ +

+

+

instances cluster center

jfr

K-Means Clustering• each iteration involves two steps

– assignment of instances to clusters

– re-computation of the means

+ +

+

+

+ +

+

+

assignment re-computation of means

K-Means Clustering: Updating the Means

• for a set of instances that have been assigned to a cluster , we re-compute the mean of the cluster as follows

μ(c j ) =

r x i

r x i∈c j

| c j |

jc

K-Means Clustering

given : a set X = {r x 1...

r x n} of instances

select k initial cluster centers r f 1...

r f k

while stopping criterion not true do

for all clusters c j do

c j = r x i | dist

r x i,

r f j( ) < dist

r x i,

r f l( ), ∀ l ≠ j { }

for all means r f j do

r f j = μ(c j )

// determine which instances are assigned to this cluster

// update the cluster center

K-means Clustering Example

f1

f2

x1

x2

x3

x4

6),( ,11),(

2),( ,3),(

3),( ,2),(

5),( ,2),(

2414

2313

2212

2111

========

fxdistfxdistfxdistfxdistfxdistfxdistfxdistfxdist

5,72

82,

2

86

2,42

31,

2

44

2

1

=++

=

=++

=

f

f

4),( ,10),(

4),( ,2),(

5),( ,1),(

7),( ,1),(

2414

2313

2212

2111

========

fxdistfxdistfxdistfxdistfxdistfxdistfxdistfxdist

Given the following 4 instances and 2 clusters initialized as shown. Assume the distance function is ∑ −=

eejeiji xxxx ,,),(dist

K-means Clustering Example (Continued)

8,81

8,

1

8

2,67.43

231,

3

644

2

1

==

=++++

=

f

f

assignments remain the same,so the procedure has converged

EM Clustering

• in k-means as just described, instances are assigned to one and only one cluster

• we can do “soft” k-means clustering via an Expectation Maximization (EM) algorithm

– each cluster represented by a distribution (e.g. a Gaussian)

– E step: determine how likely is it that each cluster “generated” each instance

– M step: adjust cluster parameters to maximize likelihood of instances

Representation of Clusters

• in the EM approach, we’ll represent each cluster using an m-dimensional multivariate Gaussian

where

⎥⎦

⎤⎢⎣

⎡ −Σ−−Σ

= − )()(2

1exp

||)2(

1)( 1

jijT

ji

jmij xxxN μμ

π

rrrrr

jμr is the mean of the Gaussian

is the covariance matrix

this is a representation of a Gaussian in a 2-D space

Cluster generation

• We model our data points with a generative process

• Each point is generated by:– Choosing a cluster (1,2,…,k) by sampling from

probability distribution over the clusters• Prob(cluster j) = Pj

– Sampling a point from the Gaussian distribution Nj

EM Clustering• the EM algorithm will try to set the parameters of the

Gaussians, , to maximize the log likelihood of the data, X

log likelihood(X | Θ) = log Pr(r x i

i=1

n

∏ )

=log PjN j (r x i)

j=1

k

∑i=1

n

= log PjN j (r x i)

j=1

k

∑i=1

n

Θ

EM Clustering• the parameters of the model, , include the means, the

covariance matrix and sometimes prior weights for each Gaussian

• here, we’ll assume that the covariance matrix is fixed; we’ll focus just on setting the means and the prior weights

Θ

EM Clustering: Hidden Variables• on each iteration of k-means clustering, we had to assign

each instance to a cluster

• in the EM approach, we’ll use hidden variables to represent this idea

• for each instance we have a set of hidden variables

• we can think of as being 1 if is a member of cluster j and 0 otherwise

Zi1,...,Zik

ixr

Zij ixr

EM Clustering: the E-step• recall that is a hidden variable which is 1 if

generated and 0 otherwise

• in the E-step, we compute , the expected value of this hidden variable

Zij

hij = E(Zij |r x i) = Pr(Zij =1 |

r x i) =

PjN j (r x i)

PlN l (r x i)

l =1

k

jNix

r

ijh

+

assignment

0.3

0.7

EM Clustering: the M-step• given the expected values , we re-estimate the means of

the Gaussians and the cluster probabilities

• can also re-estimate the covariance matrix if we’re varying it

ijh

rμ j =

hij

r x i

i=1

n

hiji=1

n

Pj =

hij

i=1

n

∑n

EM Clustering ExampleConsider a one-dimensional clustering problem in which the data given are: x1 = -4 x2 = -3 x3 = -1 x4 = 3 x5 = 5

The initial mean of the first Gaussian is 0 and the initial mean of the second is 2. The Gaussians both have variance = 4; their density function is:

where denotes the mean (center) of the Gaussian.

Initially, we set P1 = P2 = 0.5

2

22

1

8

1),(

⎟⎠⎞

⎜⎝⎛ −

−=

μ

πμ

x

exf

μ

EM Clustering Example

2

22

1

8

1),(

⎟⎠⎞

⎜⎝⎛ −

−=

μ

πμ

x

exf

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

-6 -4 -2 0 2 4 6

Gaussian 1Gaussian 2

data

0269.8

1),4(

2

2

0 4

2

1

1 ==−⎟⎠⎞

⎜⎝⎛ −−

−ef

πμ 0022.),4( 2 =− μf

0646.),3( 1 =− μf 00874.),3( 2 =− μf

176.),1( 1 =− μf 0646.),1( 2 =− μf

0646.),3( 1 =μf 176.),3( 2 =μf

00874.),5( 1 =μf 0646.),5( 2 =μf

EM Clustering Example: E Step

h12 =

1

2f (x1,μ2)

1

2f (x1,μ1) +

1

2f (x1,μ2)

=.0022

.0269 + .0022

= 0.076

h22 =.00874

.0646 + .00874= 0.119

h32 =.0646

.176 + .0646= 0.268

h42 =.176

.0646 + .176= 0.732

h52 =.0646

.00874 + .0646= 0.881

h11 =

1

2f (x1,μ1)

1

2f (x1,μ1) +

1

2f (x1,μ2)

=.0269

.0269 + .0022

= 0.924

h21 =

1

2f (x2,μ1)

1

2f (x2,μ1) +

1

2f (x2,μ2)

=.0646

.0646 + .00874

= 0.881

h31 =.176

.176 + .0646= 0.732

h41 =.0646

.0646 + .176= 0.268

h51 =.00874

.00874 + .0646= 0.119

EM Clustering Example: M-step

μ1 =

x i × hi1

i

hi1

i

∑=

−4 × .924 + −3× .881+ −1× .732 + 3× .268 + 5 × .119

.924 + .881+ .732 + .268 + .119= −1.94

μ2 =

x i × hi2

i

hi2

i

∑=

−4 × .076 + −3× .119 + −1× .268 + 3 × .732 + 5 × .881

.076 + .119 + .268 + .732 + .881= 3.39

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

-6 -4 -2 0 2 4 6

Gaussian 1Gaussian 2

data

P1 =

hi1

i

∑n

=.924 + .881+ .732 + .268 + .119

5= 0.58

P2 =

hi2

i

∑n

=.076 + .119 + .268 + .732 + .881

5= 0.42

EM Clustering Example

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

-6 -4 -2 0 2 4 6

Gaussian 1Gaussian 2

data

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

-6 -4 -2 0 2 4 6

Gaussian 1Gaussian 2

data

• here we’ve shown just one step of the EM procedure

• we would continue the E- and M-steps until convergence

EM and K-Means Clustering

• both will converge to a local maximum

• both are sensitive to initial positions (means) of clusters

• have to choose value of k for both

Cross Validation to Select k• with EM clustering, we can run the method with different

values of k, use CV to evaluate each clustering

computeto evaluate clustering

log Pr(xi )i

clusteringclusteringclustering

• then run method on all data once we’ve picked k

Evaluating Clustering Results

• given random data without any “structure”, clustering algorithms will still return clusters

• the gold standard: do clusters correspond to natural categories?

• do clusters correspond to categories we care about? (there are lots of ways to partition the world)

Evaluating Clustering Results• some approaches

– external validation• e.g. do genes clustered together have some

common function?– internal validation

• How well does clustering optimize intra-cluster similarity and inter-cluster dissimilarity?

– relative validation• How does it compare to other clusterings using

these criteria?• e.g. with a probabilistic method (such as EM)

we can ask: how probable does held-aside data look as we vary the number of clusters.

Comments on Clustering

• there many different ways to do clustering; we’ve discussed just a few methods

• hierarchical clusters may be more informative, but they’re more expensive to compute

• clusterings are hard to evaluate in many cases

Recommended