Clustering Gene Expression Data BMI/CS 576 Colin Dewey cdewey@biostat.wisc.edu Fall 2010

Clustering Gene Expression Data

BMI/CS 576

www.biostat.wisc.edu/bmi576/

Colin Dewey

cdewey@biostat.wisc.edu

Fall 2010

Gene Expression Profiles• we’ll assume we have a 2D matrix of gene expression

measurements– rows represent genes– columns represent different experiments, time points,

individuals etc. (what we can measure using one* microarray)

• we’ll refer to individual rows or columns as profiles– a row is a profile for a gene

* Depending on the number of genes being considered, we might actually use several arrays per experiment, time point, individual.

Expression Profile Example

• rows represent human genes

• columns represent people with leukemia

Expression Profile Example

• rows represent yeast genes

• columns represent time points in a given experiment

Task Definition: Clustering Gene Expression Profiles

• given: expression profiles for a set of genes or experiments/individuals/time points (whatever columns represent)

• do: organize profiles into clusters such that

– instances in the same cluster are highly similar to each other

– instances from different clusters have low similarity to each other

Motivation for Clustering

• exploratory data analysis

– understanding general characteristics of data

– visualizing data

• generalization

– infer something about an instance (e.g. a gene) based on how it relates to other instances

• everyone else is doing it

The Clustering Landscape

• there are many different clustering algorithms• they differ along several dimensions

– hierarchical vs. partitional (flat)– hard (no uncertainty about which instances belong to a

cluster) vs. soft clusters– disjunctive (an instance can belong to multiple clusters)

vs. non-disjunctive– deterministic (same clusters produced every time for a

given data set) vs. stochastic – distance (similarity) measure used

Distance/Similarity Measures• many clustering methods employ a distance (similarity) measure

to assess the distance between

– a pair of instances

– a cluster and an instance

– a pair of clusters

• given a distance value, it is straightforward to convert it into a similarity value

• not necessarily straightforward to go the other way

• we’ll describe our algorithms in terms of distances

1sim( , )

1 dist( , )x y

Distance Metrics• properties of metrics

• some distance metrics

dist( , ) dist( , )i j j ix x x x=

, ,dist( , )i j i e j ee

x x x x= −∑

dist( , ) 0i jx x ≥

dist( , ) dist( , ) dist( , )i j i k k jx x x x x x≤ +

( )2, ,dist( , )i j i e j ee

x x x x= −∑

Manhattan

Euclidean

e ranges over the individual measurements for xi and xj

dist(x i, x j ) = 0 if and only if x i = x j

(non-negativity)

(identity)

(symmetry)

(triangle inequality)

Hierarchical Clustering: A Dendogram

leaves represent instances (e.g. genes)

height of bar indicates degree of distance within cluster

Hierarchical Clustering of Expression Data

Hierarchical Clustering

• can do top-down (divisive) or bottom-up (agglomerative)

• in either case, we maintain a matrix of distance (or similarity) scores for all pairs of

– instances

– clusters (formed so far)

– instances and clusters

Bottom-Up Hierarchical Clustering

given:a set { ... } of instances

for : 1 to do

: { ... }

while | | 1

( , ) : argmin dist( , )

add a new node to the

a b u vc c

c c c c

tree joining and : { , } { }

return tree with root node a b j

a bC C c c c

= − ∪

// each object is initially its own cluster, and a leaf in tree

// find least distant pair in C

// create a new cluster for pair

Haven’t We Already Seen This? • this algorithm is very similar to UPGMA and neighbor joining; there are

some differences• what tree represents

– phylogenetic inference: tree represents hypothesized sequence of evolutionary events; internal nodes represent hypothetical ancestors

– clustering: inferred tree represents similarity of instances; internal nodes don’t represent ancestors

• form of tree– UPGMA: rooted tree– neighbor joining: unrooted– hierarchical clustering: rooted tree

• how distances among clusters are calculated– UPGMA: average link– neighbor joining: based on additivity– hierarchical clustering: various

Distance Between Two Clusters

• the distance between two clusters can be determined in several ways

– single link: distance of two most similar instances

– complete link: distance of two least similar instances

– average link: average distance between instances

{ }dist( , ) min dist( , ) | ,u v u vc c a b a c b c= ∈ ∈

{ }dist( , ) max dist( , ) | ,u v u vc c a b a c b c= ∈ ∈

{ }dist( , ) avg dist( , ) | ,u v u vc c a b a c b c= ∈ ∈

Complete-Link vs. Single-Link Distances

complete link

single link

Updating Distances Efficiently • if we just merged and into , we can

determine distance to each other cluster as follows

– single link:

– complete link:

– average link:

dist(c j ,ck ) = min dist(cu ,ck ),dist(cv ,ck ){ }

dist(c j ,ck ) = max dist(cu ,ck ),dist(cv ,ck ){ }

dist(c j ,ck ) =| cu | × dist(cu,ck ) + | cv | × dist(cv,ck )

| cu | + | cv |

jcuc vckc

Computational Complexity

• the naïve implementation of hierarchical clustering has time complexity, where n is the number of instances

– computing the initial distance matrix takes time

– there are merging steps

– on each step, we have to update the distance matrix and select the next pair of clusters to merge

)( 3nO

)(nO)(nO

)( 2nO

Computational Complexity

• for single-link clustering, we can update and pick the next pair in time, resulting in an algorithm

• for complete-link and average-link we can do these steps in time resulting in an method

• see http://www-csli.stanford.edu/~schuetze/completelink.html for an improved and corrected discussion of the computational complexity of hierarchical clustering

)(nO )( 2nO

)log( nnO )log( 2 nnO

Partitional Clustering

• divide instances into disjoint clusters

– flat vs. tree structure

• key issues

– how many clusters should there be?

– how should clusters be represented?

Partitional Clustering Example

cutting here resultsin 2 clusters

cutting here resultsin 4 clusters

Partitional Clustering from a Hierarchical Clustering

• we can always generate a partitional clustering from a hierarchical clustering by “cutting” the tree at some level

K-Means Clustering• assume our instances are represented by vectors of real

values

• put k cluster centers in same space as instances

• each cluster is represented by a vector

• consider an example in which our vectors have 2 dimensions

instances cluster center

K-Means Clustering• each iteration involves two steps

– assignment of instances to clusters

– re-computation of the means

assignment re-computation of means

K-Means Clustering: Updating the Means

• for a set of instances that have been assigned to a cluster , we re-compute the mean of the cluster as follows

μ(c j ) =

r x i∈c j

| c j |

K-Means Clustering

given : a set X = {r x 1...

r x n} of instances

select k initial cluster centers r f 1...

while stopping criterion not true do

for all clusters c j do

c j = r x i | dist

r x i,

r f j( ) < dist

r x i,

r f l( ), ∀ l ≠ j { }

for all means r f j do

r f j = μ(c j )

// determine which instances are assigned to this cluster

// update the cluster center

K-means Clustering Example

6),( ,11),(

2),( ,3),(

3),( ,2),(

5),( ,2),(

========

fxdistfxdistfxdistfxdistfxdistfxdistfxdistfxdist

4),( ,10),(

4),( ,2),(

5),( ,1),(

7),( ,1),(

========

fxdistfxdistfxdistfxdistfxdistfxdistfxdistfxdist

Given the following 4 instances and 2 clusters initialized as shown. Assume the distance function is ∑ −=

eejeiji xxxx ,,),(dist

K-means Clustering Example (Continued)

2,67.43

assignments remain the same,so the procedure has converged

EM Clustering

• in k-means as just described, instances are assigned to one and only one cluster

• we can do “soft” k-means clustering via an Expectation Maximization (EM) algorithm

– each cluster represented by a distribution (e.g. a Gaussian)

– E step: determine how likely is it that each cluster “generated” each instance

– M step: adjust cluster parameters to maximize likelihood of instances

Representation of Clusters

• in the EM approach, we’ll represent each cluster using an m-dimensional multivariate Gaussian

⎥⎦

⎤⎢⎣

⎡ −Σ−−Σ

= − )()(2

jmij xxxN μμ

jμr is the mean of the Gaussian

is the covariance matrix

this is a representation of a Gaussian in a 2-D space

Cluster generation

• We model our data points with a generative process

• Each point is generated by:– Choosing a cluster (1,2,…,k) by sampling from

probability distribution over the clusters• Prob(cluster j) = Pj

– Sampling a point from the Gaussian distribution Nj

EM Clustering• the EM algorithm will try to set the parameters of the

Gaussians, , to maximize the log likelihood of the data, X

log likelihood(X | Θ) = log Pr(r x i

=log PjN j (r x i)

∑i=1

= log PjN j (r x i)

∑i=1

EM Clustering• the parameters of the model, , include the means, the

covariance matrix and sometimes prior weights for each Gaussian

• here, we’ll assume that the covariance matrix is fixed; we’ll focus just on setting the means and the prior weights

EM Clustering: Hidden Variables• on each iteration of k-means clustering, we had to assign

each instance to a cluster

• in the EM approach, we’ll use hidden variables to represent this idea

• for each instance we have a set of hidden variables

• we can think of as being 1 if is a member of cluster j and 0 otherwise

Zi1,...,Zik

Zij ixr

EM Clustering: the E-step• recall that is a hidden variable which is 1 if

generated and 0 otherwise

• in the E-step, we compute , the expected value of this hidden variable

hij = E(Zij |r x i) = Pr(Zij =1 |

r x i) =

PjN j (r x i)

PlN l (r x i)

assignment

EM Clustering: the M-step• given the expected values , we re-estimate the means of

the Gaussians and the cluster probabilities

• can also re-estimate the covariance matrix if we’re varying it

rμ j =

hiji=1

EM Clustering ExampleConsider a one-dimensional clustering problem in which the data given are: x1 = -4 x2 = -3 x3 = -1 x4 = 3 x5 = 5

The initial mean of the first Gaussian is 0 and the initial mean of the second is 2. The Gaussians both have variance = 4; their density function is:

where denotes the mean (center) of the Gaussian.

Initially, we set P1 = P2 = 0.5

⎟⎠⎞

⎜⎝⎛ −

EM Clustering Example

⎟⎠⎞

⎜⎝⎛ −

-6 -4 -2 0 2 4 6

Gaussian 1Gaussian 2

0269.8

1 ==−⎟⎠⎞

⎜⎝⎛ −−

πμ 0022.),4( 2 =− μf

0646.),3( 1 =− μf 00874.),3( 2 =− μf

176.),1( 1 =− μf 0646.),1( 2 =− μf

0646.),3( 1 =μf 176.),3( 2 =μf

00874.),5( 1 =μf 0646.),5( 2 =μf

EM Clustering Example: E Step

2f (x1,μ2)

2f (x1,μ1) +

2f (x1,μ2)

=.0022

.0269 + .0022

= 0.076

h22 =.00874

.0646 + .00874= 0.119

h32 =.0646

.176 + .0646= 0.268

h42 =.176

.0646 + .176= 0.732

h52 =.0646

.00874 + .0646= 0.881

2f (x1,μ1)

2f (x1,μ1) +

2f (x1,μ2)

=.0269

.0269 + .0022

= 0.924

2f (x2,μ1)

2f (x2,μ1) +

2f (x2,μ2)

=.0646

.0646 + .00874

= 0.881

h31 =.176

.176 + .0646= 0.732

h41 =.0646

.0646 + .176= 0.268

h51 =.00874

.00874 + .0646= 0.119

EM Clustering Example: M-step

x i × hi1

−4 × .924 + −3× .881+ −1× .732 + 3× .268 + 5 × .119

.924 + .881+ .732 + .268 + .119= −1.94

x i × hi2

−4 × .076 + −3× .119 + −1× .268 + 3 × .732 + 5 × .881

.076 + .119 + .268 + .732 + .881= 3.39

-6 -4 -2 0 2 4 6

=.924 + .881+ .732 + .268 + .119

5= 0.58

=.076 + .119 + .268 + .732 + .881

5= 0.42

EM Clustering Example

-6 -4 -2 0 2 4 6

• here we’ve shown just one step of the EM procedure

• we would continue the E- and M-steps until convergence

EM and K-Means Clustering

• both will converge to a local maximum

• both are sensitive to initial positions (means) of clusters

• have to choose value of k for both

Cross Validation to Select k• with EM clustering, we can run the method with different

values of k, use CV to evaluate each clustering

computeto evaluate clustering

log Pr(xi )i

clusteringclusteringclustering

• then run method on all data once we’ve picked k

Evaluating Clustering Results

• given random data without any “structure”, clustering algorithms will still return clusters

• the gold standard: do clusters correspond to natural categories?

• do clusters correspond to categories we care about? (there are lots of ways to partition the world)

Evaluating Clustering Results• some approaches

– external validation• e.g. do genes clustered together have some

common function?– internal validation

• How well does clustering optimize intra-cluster similarity and inter-cluster dissimilarity?

– relative validation• How does it compare to other clusterings using

these criteria?• e.g. with a probabilistic method (such as EM)

we can ask: how probable does held-aside data look as we vary the number of clusters.

Comments on Clustering

• there many different ways to do clustering; we’ve discussed just a few methods

• hierarchical clusters may be more informative, but they’re more expensive to compute

• clusterings are hard to evaluate in many cases

Clustering Gene Expression Data BMI/CS 576 Colin Dewey cdewey@biostat.wisc.edu Fall 2010

Documents

ARCHITECTURE 576 AC ARCHITECTURE, CINEMA, ENVIRONMENT … 576 AC Anthony.pdf · ARCHITECTURE 576 AC ARCHITECTURE, CINEMA, ENVIRONMENT AND BEHAVIOR CRN: ... and designers portrayed

Phylogenetic trees Sushmita Roy BMI/CS 576 sroy@biostat.wisc.edu Sep 23 rd, 2014

Friendly Word 576

Operator’s manual 570 576 XP 576 XPG - hsqGlobalm.husqvarna.com/ddoc/HUSO/HUSO2016_NAen/HUSO2016... · 570 576 XP 576 XPG Operator’s manual ... 11 Throttle trigger lockout

Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 sroy@biostat.wisc.edu Sep 16 th, 2014

576 t gp_report_light

Advanced Bioinformatics - biostat.wisc.edubiostat.wisc.edu/bmi776/lectures/course-overview.pdfmi1.biostat.wisc.edu mi2.biostat.wisc.edu –HW0 tests your access to these machines –Homework

Markov Chain Models BMI/CS 576 cdewey@biostat.wisc.edu Fall 2010

Pairwise Sequence Alignment Sushmita Roy BMI/CS 576 Sushmita Roy sroy@biostat.wisc.edu Sep 10 th, 2013 BMI/CS 576

Introduction to Molecular Biology, Genetics and Genomics Sushmita Roy sroy@biostat.wisc.edu September 6, 2012 BMI/CS 576

576- Harry Potter

Introduction to Molecular Biology and Genomics BMI/CS 576 Mark Craven craven@biostat.wisc.edu September 2007

Measuring transcriptomes with RNA-Seq BMI 877 Spring 2014 Colin Dewey cdewey@biostat.wisc.edu

Probabilistic methods for phylogenetic trees (Part 2) Sushmita Roy BMI/CS 576 sroy@biostat.wisc.edu Oct 7 th, 2014

Primer on Probability Sushmita Roy BMI/CS 576 Sushmita Roy sroy@biostat.wisc.edu Oct 2nd, 2012 BMI/CS 576

Snippetz Issue 576

Primer on Probability for Discrete Variables BMI/CS 576 Colin Dewey cdewey@biostat.wisc.edu Fall 2010

Pairwise Sequence Alignment BMI/CS 776 craven/776.html Mark Craven craven@biostat.wisc.edu January 2002

Scores and substitution matrices in sequence alignment Sushmita Roy BMI/CS 576 Sushmita Roy sroy@biostat.wisc.edu Sep 11 th,

Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy sroy@biostat.wisc.edu Sep 27 th,