46
Clustering Gene Expression Data BMI/CS 576 www.biostat.wisc.edu/ bmi576/ Colin Dewey [email protected] Fall 2010

Clustering Gene Expression Data BMI/CS 576 Colin Dewey [email protected] Fall 2010

Embed Size (px)

Citation preview

Page 1: Clustering Gene Expression Data BMI/CS 576  Colin Dewey cdewey@biostat.wisc.edu Fall 2010

Clustering Gene Expression Data

BMI/CS 576

www.biostat.wisc.edu/bmi576/

Colin Dewey

[email protected]

Fall 2010

Page 2: Clustering Gene Expression Data BMI/CS 576  Colin Dewey cdewey@biostat.wisc.edu Fall 2010

Gene Expression Profiles• we’ll assume we have a 2D matrix of gene expression

measurements– rows represent genes– columns represent different experiments, time points,

individuals etc. (what we can measure using one* microarray)

• we’ll refer to individual rows or columns as profiles– a row is a profile for a gene

* Depending on the number of genes being considered, we might actually use several arrays per experiment, time point, individual.

Page 3: Clustering Gene Expression Data BMI/CS 576  Colin Dewey cdewey@biostat.wisc.edu Fall 2010

Expression Profile Example

• rows represent human genes

• columns represent people with leukemia

Page 4: Clustering Gene Expression Data BMI/CS 576  Colin Dewey cdewey@biostat.wisc.edu Fall 2010

Expression Profile Example

• rows represent yeast genes

• columns represent time points in a given experiment

Page 5: Clustering Gene Expression Data BMI/CS 576  Colin Dewey cdewey@biostat.wisc.edu Fall 2010

Task Definition: Clustering Gene Expression Profiles

• given: expression profiles for a set of genes or experiments/individuals/time points (whatever columns represent)

• do: organize profiles into clusters such that

– instances in the same cluster are highly similar to each other

– instances from different clusters have low similarity to each other

Page 6: Clustering Gene Expression Data BMI/CS 576  Colin Dewey cdewey@biostat.wisc.edu Fall 2010

Motivation for Clustering

• exploratory data analysis

– understanding general characteristics of data

– visualizing data

• generalization

– infer something about an instance (e.g. a gene) based on how it relates to other instances

• everyone else is doing it

Page 7: Clustering Gene Expression Data BMI/CS 576  Colin Dewey cdewey@biostat.wisc.edu Fall 2010

The Clustering Landscape

• there are many different clustering algorithms• they differ along several dimensions

– hierarchical vs. partitional (flat)– hard (no uncertainty about which instances belong to a

cluster) vs. soft clusters– disjunctive (an instance can belong to multiple clusters)

vs. non-disjunctive– deterministic (same clusters produced every time for a

given data set) vs. stochastic – distance (similarity) measure used

Page 8: Clustering Gene Expression Data BMI/CS 576  Colin Dewey cdewey@biostat.wisc.edu Fall 2010

Distance/Similarity Measures• many clustering methods employ a distance (similarity) measure

to assess the distance between

– a pair of instances

– a cluster and an instance

– a pair of clusters

• given a distance value, it is straightforward to convert it into a similarity value

• not necessarily straightforward to go the other way

• we’ll describe our algorithms in terms of distances

1sim( , )

1 dist( , )x y

x y=

+

Page 9: Clustering Gene Expression Data BMI/CS 576  Colin Dewey cdewey@biostat.wisc.edu Fall 2010

Distance Metrics• properties of metrics

• some distance metrics

dist( , ) dist( , )i j j ix x x x=

, ,dist( , )i j i e j ee

x x x x= −∑

dist( , ) 0i jx x ≥

dist( , ) dist( , ) dist( , )i j i k k jx x x x x x≤ +

( )2, ,dist( , )i j i e j ee

x x x x= −∑

Manhattan

Euclidean

e ranges over the individual measurements for xi and xj

dist(x i, x j ) = 0 if and only if x i = x j

(non-negativity)

(identity)

(symmetry)

(triangle inequality)

Page 10: Clustering Gene Expression Data BMI/CS 576  Colin Dewey cdewey@biostat.wisc.edu Fall 2010

Hierarchical Clustering: A Dendogram

leaves represent instances (e.g. genes)

height of bar indicates degree of distance within cluster

dist

ance

sca

le

0

Page 11: Clustering Gene Expression Data BMI/CS 576  Colin Dewey cdewey@biostat.wisc.edu Fall 2010

Hierarchical Clustering of Expression Data

Page 12: Clustering Gene Expression Data BMI/CS 576  Colin Dewey cdewey@biostat.wisc.edu Fall 2010

Hierarchical Clustering

• can do top-down (divisive) or bottom-up (agglomerative)

• in either case, we maintain a matrix of distance (or similarity) scores for all pairs of

– instances

– clusters (formed so far)

– instances and clusters

Page 13: Clustering Gene Expression Data BMI/CS 576  Colin Dewey cdewey@biostat.wisc.edu Fall 2010

Bottom-Up Hierarchical Clustering

u

1

1

( , )

given:a set { ... } of instances

for : 1 to do

: { }

: { ... }

:

while | | 1

: 1

( , ) : argmin dist( , )

add a new node to the

v

n

i i

n

a b u vc c

j a b

X x x

i n

c x

C c c

j n

C

j j

c c c c

c c c

j

===

==

>= +

=

= ∪

tree joining and : { , } { }

return tree with root node a b j

a bC C c c c

j

= − ∪

// each object is initially its own cluster, and a leaf in tree

// find least distant pair in C

// create a new cluster for pair

Page 14: Clustering Gene Expression Data BMI/CS 576  Colin Dewey cdewey@biostat.wisc.edu Fall 2010

Haven’t We Already Seen This? • this algorithm is very similar to UPGMA and neighbor joining; there are

some differences• what tree represents

– phylogenetic inference: tree represents hypothesized sequence of evolutionary events; internal nodes represent hypothetical ancestors

– clustering: inferred tree represents similarity of instances; internal nodes don’t represent ancestors

• form of tree– UPGMA: rooted tree– neighbor joining: unrooted– hierarchical clustering: rooted tree

• how distances among clusters are calculated– UPGMA: average link– neighbor joining: based on additivity– hierarchical clustering: various

Page 15: Clustering Gene Expression Data BMI/CS 576  Colin Dewey cdewey@biostat.wisc.edu Fall 2010

Distance Between Two Clusters

• the distance between two clusters can be determined in several ways

– single link: distance of two most similar instances

– complete link: distance of two least similar instances

– average link: average distance between instances

{ }dist( , ) min dist( , ) | ,u v u vc c a b a c b c= ∈ ∈

{ }dist( , ) max dist( , ) | ,u v u vc c a b a c b c= ∈ ∈

{ }dist( , ) avg dist( , ) | ,u v u vc c a b a c b c= ∈ ∈

Page 16: Clustering Gene Expression Data BMI/CS 576  Colin Dewey cdewey@biostat.wisc.edu Fall 2010

Complete-Link vs. Single-Link Distances

complete link

uc

vc

single link

vc

uc

Page 17: Clustering Gene Expression Data BMI/CS 576  Colin Dewey cdewey@biostat.wisc.edu Fall 2010

Updating Distances Efficiently • if we just merged and into , we can

determine distance to each other cluster as follows

– single link:

– complete link:

– average link:

dist(c j ,ck ) = min dist(cu ,ck ),dist(cv ,ck ){ }

dist(c j ,ck ) = max dist(cu ,ck ),dist(cv ,ck ){ }

dist(c j ,ck ) =| cu | × dist(cu,ck ) + | cv | × dist(cv,ck )

| cu | + | cv |

jcuc vckc

Page 18: Clustering Gene Expression Data BMI/CS 576  Colin Dewey cdewey@biostat.wisc.edu Fall 2010

Computational Complexity

• the naïve implementation of hierarchical clustering has time complexity, where n is the number of instances

– computing the initial distance matrix takes time

– there are merging steps

– on each step, we have to update the distance matrix and select the next pair of clusters to merge

)( 3nO

)(nO)(nO

)( 2nO

)( 2nO

Page 19: Clustering Gene Expression Data BMI/CS 576  Colin Dewey cdewey@biostat.wisc.edu Fall 2010

Computational Complexity

• for single-link clustering, we can update and pick the next pair in time, resulting in an algorithm

• for complete-link and average-link we can do these steps in time resulting in an method

• see http://www-csli.stanford.edu/~schuetze/completelink.html for an improved and corrected discussion of the computational complexity of hierarchical clustering

)(nO )( 2nO

)log( nnO )log( 2 nnO

Page 20: Clustering Gene Expression Data BMI/CS 576  Colin Dewey cdewey@biostat.wisc.edu Fall 2010

Partitional Clustering

• divide instances into disjoint clusters

– flat vs. tree structure

• key issues

– how many clusters should there be?

– how should clusters be represented?

Page 21: Clustering Gene Expression Data BMI/CS 576  Colin Dewey cdewey@biostat.wisc.edu Fall 2010

Partitional Clustering Example

Page 22: Clustering Gene Expression Data BMI/CS 576  Colin Dewey cdewey@biostat.wisc.edu Fall 2010

cutting here resultsin 2 clusters

cutting here resultsin 4 clusters

Partitional Clustering from a Hierarchical Clustering

• we can always generate a partitional clustering from a hierarchical clustering by “cutting” the tree at some level

Page 23: Clustering Gene Expression Data BMI/CS 576  Colin Dewey cdewey@biostat.wisc.edu Fall 2010

K-Means Clustering• assume our instances are represented by vectors of real

values

• put k cluster centers in same space as instances

• each cluster is represented by a vector

• consider an example in which our vectors have 2 dimensions

+ +

+

+

instances cluster center

jfr

Page 24: Clustering Gene Expression Data BMI/CS 576  Colin Dewey cdewey@biostat.wisc.edu Fall 2010

K-Means Clustering• each iteration involves two steps

– assignment of instances to clusters

– re-computation of the means

+ +

+

+

+ +

+

+

assignment re-computation of means

Page 25: Clustering Gene Expression Data BMI/CS 576  Colin Dewey cdewey@biostat.wisc.edu Fall 2010

K-Means Clustering: Updating the Means

• for a set of instances that have been assigned to a cluster , we re-compute the mean of the cluster as follows

μ(c j ) =

r x i

r x i∈c j

| c j |

jc

Page 26: Clustering Gene Expression Data BMI/CS 576  Colin Dewey cdewey@biostat.wisc.edu Fall 2010

K-Means Clustering

given : a set X = {r x 1...

r x n} of instances

select k initial cluster centers r f 1...

r f k

while stopping criterion not true do

for all clusters c j do

c j = r x i | dist

r x i,

r f j( ) < dist

r x i,

r f l( ), ∀ l ≠ j { }

for all means r f j do

r f j = μ(c j )

// determine which instances are assigned to this cluster

// update the cluster center

Page 27: Clustering Gene Expression Data BMI/CS 576  Colin Dewey cdewey@biostat.wisc.edu Fall 2010

K-means Clustering Example

f1

f2

x1

x2

x3

x4

6),( ,11),(

2),( ,3),(

3),( ,2),(

5),( ,2),(

2414

2313

2212

2111

========

fxdistfxdistfxdistfxdistfxdistfxdistfxdistfxdist

5,72

82,

2

86

2,42

31,

2

44

2

1

=++

=

=++

=

f

f

4),( ,10),(

4),( ,2),(

5),( ,1),(

7),( ,1),(

2414

2313

2212

2111

========

fxdistfxdistfxdistfxdistfxdistfxdistfxdistfxdist

Given the following 4 instances and 2 clusters initialized as shown. Assume the distance function is ∑ −=

eejeiji xxxx ,,),(dist

Page 28: Clustering Gene Expression Data BMI/CS 576  Colin Dewey cdewey@biostat.wisc.edu Fall 2010

K-means Clustering Example (Continued)

8,81

8,

1

8

2,67.43

231,

3

644

2

1

==

=++++

=

f

f

assignments remain the same,so the procedure has converged

Page 29: Clustering Gene Expression Data BMI/CS 576  Colin Dewey cdewey@biostat.wisc.edu Fall 2010

EM Clustering

• in k-means as just described, instances are assigned to one and only one cluster

• we can do “soft” k-means clustering via an Expectation Maximization (EM) algorithm

– each cluster represented by a distribution (e.g. a Gaussian)

– E step: determine how likely is it that each cluster “generated” each instance

– M step: adjust cluster parameters to maximize likelihood of instances

Page 30: Clustering Gene Expression Data BMI/CS 576  Colin Dewey cdewey@biostat.wisc.edu Fall 2010

Representation of Clusters

• in the EM approach, we’ll represent each cluster using an m-dimensional multivariate Gaussian

where

⎥⎦

⎤⎢⎣

⎡ −Σ−−Σ

= − )()(2

1exp

||)2(

1)( 1

jijT

ji

jmij xxxN μμ

π

rrrrr

jμr is the mean of the Gaussian

is the covariance matrix

this is a representation of a Gaussian in a 2-D space

Page 31: Clustering Gene Expression Data BMI/CS 576  Colin Dewey cdewey@biostat.wisc.edu Fall 2010

Cluster generation

• We model our data points with a generative process

• Each point is generated by:– Choosing a cluster (1,2,…,k) by sampling from

probability distribution over the clusters• Prob(cluster j) = Pj

– Sampling a point from the Gaussian distribution Nj

Page 32: Clustering Gene Expression Data BMI/CS 576  Colin Dewey cdewey@biostat.wisc.edu Fall 2010

EM Clustering• the EM algorithm will try to set the parameters of the

Gaussians, , to maximize the log likelihood of the data, X

log likelihood(X | Θ) = log Pr(r x i

i=1

n

∏ )

=log PjN j (r x i)

j=1

k

∑i=1

n

= log PjN j (r x i)

j=1

k

∑i=1

n

Θ

Page 33: Clustering Gene Expression Data BMI/CS 576  Colin Dewey cdewey@biostat.wisc.edu Fall 2010

EM Clustering• the parameters of the model, , include the means, the

covariance matrix and sometimes prior weights for each Gaussian

• here, we’ll assume that the covariance matrix is fixed; we’ll focus just on setting the means and the prior weights

Θ

Page 34: Clustering Gene Expression Data BMI/CS 576  Colin Dewey cdewey@biostat.wisc.edu Fall 2010

EM Clustering: Hidden Variables• on each iteration of k-means clustering, we had to assign

each instance to a cluster

• in the EM approach, we’ll use hidden variables to represent this idea

• for each instance we have a set of hidden variables

• we can think of as being 1 if is a member of cluster j and 0 otherwise

Zi1,...,Zik

ixr

Zij ixr

Page 35: Clustering Gene Expression Data BMI/CS 576  Colin Dewey cdewey@biostat.wisc.edu Fall 2010

EM Clustering: the E-step• recall that is a hidden variable which is 1 if

generated and 0 otherwise

• in the E-step, we compute , the expected value of this hidden variable

Zij

hij = E(Zij |r x i) = Pr(Zij =1 |

r x i) =

PjN j (r x i)

PlN l (r x i)

l =1

k

jNix

r

ijh

+

assignment

0.3

0.7

Page 36: Clustering Gene Expression Data BMI/CS 576  Colin Dewey cdewey@biostat.wisc.edu Fall 2010

EM Clustering: the M-step• given the expected values , we re-estimate the means of

the Gaussians and the cluster probabilities

• can also re-estimate the covariance matrix if we’re varying it

ijh

rμ j =

hij

r x i

i=1

n

hiji=1

n

Pj =

hij

i=1

n

∑n

Page 37: Clustering Gene Expression Data BMI/CS 576  Colin Dewey cdewey@biostat.wisc.edu Fall 2010

EM Clustering ExampleConsider a one-dimensional clustering problem in which the data given are: x1 = -4 x2 = -3 x3 = -1 x4 = 3 x5 = 5

The initial mean of the first Gaussian is 0 and the initial mean of the second is 2. The Gaussians both have variance = 4; their density function is:

where denotes the mean (center) of the Gaussian.

Initially, we set P1 = P2 = 0.5

2

22

1

8

1),(

⎟⎠⎞

⎜⎝⎛ −

−=

μ

πμ

x

exf

μ

Page 38: Clustering Gene Expression Data BMI/CS 576  Colin Dewey cdewey@biostat.wisc.edu Fall 2010

EM Clustering Example

2

22

1

8

1),(

⎟⎠⎞

⎜⎝⎛ −

−=

μ

πμ

x

exf

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

-6 -4 -2 0 2 4 6

Gaussian 1Gaussian 2

data

0269.8

1),4(

2

2

0 4

2

1

1 ==−⎟⎠⎞

⎜⎝⎛ −−

−ef

πμ 0022.),4( 2 =− μf

0646.),3( 1 =− μf 00874.),3( 2 =− μf

176.),1( 1 =− μf 0646.),1( 2 =− μf

0646.),3( 1 =μf 176.),3( 2 =μf

00874.),5( 1 =μf 0646.),5( 2 =μf

Page 39: Clustering Gene Expression Data BMI/CS 576  Colin Dewey cdewey@biostat.wisc.edu Fall 2010

EM Clustering Example: E Step

h12 =

1

2f (x1,μ2)

1

2f (x1,μ1) +

1

2f (x1,μ2)

=.0022

.0269 + .0022

= 0.076

h22 =.00874

.0646 + .00874= 0.119

h32 =.0646

.176 + .0646= 0.268

h42 =.176

.0646 + .176= 0.732

h52 =.0646

.00874 + .0646= 0.881

h11 =

1

2f (x1,μ1)

1

2f (x1,μ1) +

1

2f (x1,μ2)

=.0269

.0269 + .0022

= 0.924

h21 =

1

2f (x2,μ1)

1

2f (x2,μ1) +

1

2f (x2,μ2)

=.0646

.0646 + .00874

= 0.881

h31 =.176

.176 + .0646= 0.732

h41 =.0646

.0646 + .176= 0.268

h51 =.00874

.00874 + .0646= 0.119

Page 40: Clustering Gene Expression Data BMI/CS 576  Colin Dewey cdewey@biostat.wisc.edu Fall 2010

EM Clustering Example: M-step

μ1 =

x i × hi1

i

hi1

i

∑=

−4 × .924 + −3× .881+ −1× .732 + 3× .268 + 5 × .119

.924 + .881+ .732 + .268 + .119= −1.94

μ2 =

x i × hi2

i

hi2

i

∑=

−4 × .076 + −3× .119 + −1× .268 + 3 × .732 + 5 × .881

.076 + .119 + .268 + .732 + .881= 3.39

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

-6 -4 -2 0 2 4 6

Gaussian 1Gaussian 2

data

P1 =

hi1

i

∑n

=.924 + .881+ .732 + .268 + .119

5= 0.58

P2 =

hi2

i

∑n

=.076 + .119 + .268 + .732 + .881

5= 0.42

Page 41: Clustering Gene Expression Data BMI/CS 576  Colin Dewey cdewey@biostat.wisc.edu Fall 2010

EM Clustering Example

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

-6 -4 -2 0 2 4 6

Gaussian 1Gaussian 2

data

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

-6 -4 -2 0 2 4 6

Gaussian 1Gaussian 2

data

• here we’ve shown just one step of the EM procedure

• we would continue the E- and M-steps until convergence

Page 42: Clustering Gene Expression Data BMI/CS 576  Colin Dewey cdewey@biostat.wisc.edu Fall 2010

EM and K-Means Clustering

• both will converge to a local maximum

• both are sensitive to initial positions (means) of clusters

• have to choose value of k for both

Page 43: Clustering Gene Expression Data BMI/CS 576  Colin Dewey cdewey@biostat.wisc.edu Fall 2010

Cross Validation to Select k• with EM clustering, we can run the method with different

values of k, use CV to evaluate each clustering

computeto evaluate clustering

log Pr(xi )i

clusteringclusteringclustering

• then run method on all data once we’ve picked k

Page 44: Clustering Gene Expression Data BMI/CS 576  Colin Dewey cdewey@biostat.wisc.edu Fall 2010

Evaluating Clustering Results

• given random data without any “structure”, clustering algorithms will still return clusters

• the gold standard: do clusters correspond to natural categories?

• do clusters correspond to categories we care about? (there are lots of ways to partition the world)

Page 45: Clustering Gene Expression Data BMI/CS 576  Colin Dewey cdewey@biostat.wisc.edu Fall 2010

Evaluating Clustering Results• some approaches

– external validation• e.g. do genes clustered together have some

common function?– internal validation

• How well does clustering optimize intra-cluster similarity and inter-cluster dissimilarity?

– relative validation• How does it compare to other clusterings using

these criteria?• e.g. with a probabilistic method (such as EM)

we can ask: how probable does held-aside data look as we vary the number of clusters.

Page 46: Clustering Gene Expression Data BMI/CS 576  Colin Dewey cdewey@biostat.wisc.edu Fall 2010

Comments on Clustering

• there many different ways to do clustering; we’ve discussed just a few methods

• hierarchical clusters may be more informative, but they’re more expensive to compute

• clusterings are hard to evaluate in many cases