Upload
buck-rogers
View
217
Download
4
Tags:
Embed Size (px)
Citation preview
Clustering Gene Expression Data
BMI/CS 576
www.biostat.wisc.edu/bmi576/
Colin Dewey
Fall 2010
Gene Expression Profiles• we’ll assume we have a 2D matrix of gene expression
measurements– rows represent genes– columns represent different experiments, time points,
individuals etc. (what we can measure using one* microarray)
• we’ll refer to individual rows or columns as profiles– a row is a profile for a gene
* Depending on the number of genes being considered, we might actually use several arrays per experiment, time point, individual.
Expression Profile Example
• rows represent human genes
• columns represent people with leukemia
Expression Profile Example
• rows represent yeast genes
• columns represent time points in a given experiment
Task Definition: Clustering Gene Expression Profiles
• given: expression profiles for a set of genes or experiments/individuals/time points (whatever columns represent)
• do: organize profiles into clusters such that
– instances in the same cluster are highly similar to each other
– instances from different clusters have low similarity to each other
Motivation for Clustering
• exploratory data analysis
– understanding general characteristics of data
– visualizing data
• generalization
– infer something about an instance (e.g. a gene) based on how it relates to other instances
• everyone else is doing it
The Clustering Landscape
• there are many different clustering algorithms• they differ along several dimensions
– hierarchical vs. partitional (flat)– hard (no uncertainty about which instances belong to a
cluster) vs. soft clusters– disjunctive (an instance can belong to multiple clusters)
vs. non-disjunctive– deterministic (same clusters produced every time for a
given data set) vs. stochastic – distance (similarity) measure used
Distance/Similarity Measures• many clustering methods employ a distance (similarity) measure
to assess the distance between
– a pair of instances
– a cluster and an instance
– a pair of clusters
• given a distance value, it is straightforward to convert it into a similarity value
• not necessarily straightforward to go the other way
• we’ll describe our algorithms in terms of distances
1sim( , )
1 dist( , )x y
x y=
+
Distance Metrics• properties of metrics
• some distance metrics
dist( , ) dist( , )i j j ix x x x=
, ,dist( , )i j i e j ee
x x x x= −∑
dist( , ) 0i jx x ≥
dist( , ) dist( , ) dist( , )i j i k k jx x x x x x≤ +
( )2, ,dist( , )i j i e j ee
x x x x= −∑
Manhattan
Euclidean
e ranges over the individual measurements for xi and xj
€
dist(x i, x j ) = 0 if and only if x i = x j
(non-negativity)
(identity)
(symmetry)
(triangle inequality)
Hierarchical Clustering: A Dendogram
leaves represent instances (e.g. genes)
height of bar indicates degree of distance within cluster
dist
ance
sca
le
0
Hierarchical Clustering of Expression Data
Hierarchical Clustering
• can do top-down (divisive) or bottom-up (agglomerative)
• in either case, we maintain a matrix of distance (or similarity) scores for all pairs of
– instances
– clusters (formed so far)
– instances and clusters
Bottom-Up Hierarchical Clustering
u
1
1
( , )
given:a set { ... } of instances
for : 1 to do
: { }
: { ... }
:
while | | 1
: 1
( , ) : argmin dist( , )
add a new node to the
v
n
i i
n
a b u vc c
j a b
X x x
i n
c x
C c c
j n
C
j j
c c c c
c c c
j
===
==
>= +
=
= ∪
tree joining and : { , } { }
return tree with root node a b j
a bC C c c c
j
= − ∪
// each object is initially its own cluster, and a leaf in tree
// find least distant pair in C
// create a new cluster for pair
Haven’t We Already Seen This? • this algorithm is very similar to UPGMA and neighbor joining; there are
some differences• what tree represents
– phylogenetic inference: tree represents hypothesized sequence of evolutionary events; internal nodes represent hypothetical ancestors
– clustering: inferred tree represents similarity of instances; internal nodes don’t represent ancestors
• form of tree– UPGMA: rooted tree– neighbor joining: unrooted– hierarchical clustering: rooted tree
• how distances among clusters are calculated– UPGMA: average link– neighbor joining: based on additivity– hierarchical clustering: various
Distance Between Two Clusters
• the distance between two clusters can be determined in several ways
– single link: distance of two most similar instances
– complete link: distance of two least similar instances
– average link: average distance between instances
{ }dist( , ) min dist( , ) | ,u v u vc c a b a c b c= ∈ ∈
{ }dist( , ) max dist( , ) | ,u v u vc c a b a c b c= ∈ ∈
{ }dist( , ) avg dist( , ) | ,u v u vc c a b a c b c= ∈ ∈
Complete-Link vs. Single-Link Distances
complete link
uc
vc
single link
vc
uc
Updating Distances Efficiently • if we just merged and into , we can
determine distance to each other cluster as follows
– single link:
– complete link:
– average link:
€
dist(c j ,ck ) = min dist(cu ,ck ),dist(cv ,ck ){ }
€
dist(c j ,ck ) = max dist(cu ,ck ),dist(cv ,ck ){ }
€
dist(c j ,ck ) =| cu | × dist(cu,ck ) + | cv | × dist(cv,ck )
| cu | + | cv |
jcuc vckc
Computational Complexity
• the naïve implementation of hierarchical clustering has time complexity, where n is the number of instances
– computing the initial distance matrix takes time
– there are merging steps
– on each step, we have to update the distance matrix and select the next pair of clusters to merge
)( 3nO
)(nO)(nO
)( 2nO
)( 2nO
Computational Complexity
• for single-link clustering, we can update and pick the next pair in time, resulting in an algorithm
• for complete-link and average-link we can do these steps in time resulting in an method
• see http://www-csli.stanford.edu/~schuetze/completelink.html for an improved and corrected discussion of the computational complexity of hierarchical clustering
)(nO )( 2nO
)log( nnO )log( 2 nnO
Partitional Clustering
• divide instances into disjoint clusters
– flat vs. tree structure
• key issues
– how many clusters should there be?
– how should clusters be represented?
Partitional Clustering Example
cutting here resultsin 2 clusters
cutting here resultsin 4 clusters
Partitional Clustering from a Hierarchical Clustering
• we can always generate a partitional clustering from a hierarchical clustering by “cutting” the tree at some level
K-Means Clustering• assume our instances are represented by vectors of real
values
• put k cluster centers in same space as instances
• each cluster is represented by a vector
• consider an example in which our vectors have 2 dimensions
+ +
+
+
instances cluster center
jfr
K-Means Clustering• each iteration involves two steps
– assignment of instances to clusters
– re-computation of the means
+ +
+
+
+ +
+
+
assignment re-computation of means
K-Means Clustering: Updating the Means
• for a set of instances that have been assigned to a cluster , we re-compute the mean of the cluster as follows
€
μ(c j ) =
r x i
r x i∈c j
∑
| c j |
jc
K-Means Clustering
€
given : a set X = {r x 1...
r x n} of instances
select k initial cluster centers r f 1...
r f k
while stopping criterion not true do
for all clusters c j do
c j = r x i | dist
r x i,
r f j( ) < dist
r x i,
r f l( ), ∀ l ≠ j { }
for all means r f j do
r f j = μ(c j )
// determine which instances are assigned to this cluster
// update the cluster center
K-means Clustering Example
f1
f2
x1
x2
x3
x4
6),( ,11),(
2),( ,3),(
3),( ,2),(
5),( ,2),(
2414
2313
2212
2111
========
fxdistfxdistfxdistfxdistfxdistfxdistfxdistfxdist
5,72
82,
2
86
2,42
31,
2
44
2
1
=++
=
=++
=
f
f
4),( ,10),(
4),( ,2),(
5),( ,1),(
7),( ,1),(
2414
2313
2212
2111
========
fxdistfxdistfxdistfxdistfxdistfxdistfxdistfxdist
Given the following 4 instances and 2 clusters initialized as shown. Assume the distance function is ∑ −=
eejeiji xxxx ,,),(dist
K-means Clustering Example (Continued)
8,81
8,
1
8
2,67.43
231,
3
644
2
1
==
=++++
=
f
f
assignments remain the same,so the procedure has converged
EM Clustering
• in k-means as just described, instances are assigned to one and only one cluster
• we can do “soft” k-means clustering via an Expectation Maximization (EM) algorithm
– each cluster represented by a distribution (e.g. a Gaussian)
– E step: determine how likely is it that each cluster “generated” each instance
– M step: adjust cluster parameters to maximize likelihood of instances
Representation of Clusters
• in the EM approach, we’ll represent each cluster using an m-dimensional multivariate Gaussian
where
⎥⎦
⎤⎢⎣
⎡ −Σ−−Σ
= − )()(2
1exp
||)2(
1)( 1
jijT
ji
jmij xxxN μμ
π
rrrrr
jΣ
jμr is the mean of the Gaussian
is the covariance matrix
this is a representation of a Gaussian in a 2-D space
Cluster generation
• We model our data points with a generative process
• Each point is generated by:– Choosing a cluster (1,2,…,k) by sampling from
probability distribution over the clusters• Prob(cluster j) = Pj
– Sampling a point from the Gaussian distribution Nj
EM Clustering• the EM algorithm will try to set the parameters of the
Gaussians, , to maximize the log likelihood of the data, X
€
log likelihood(X | Θ) = log Pr(r x i
i=1
n
∏ )
€
=log PjN j (r x i)
j=1
k
∑i=1
n
∏
€
= log PjN j (r x i)
j=1
k
∑i=1
n
∑
Θ
EM Clustering• the parameters of the model, , include the means, the
covariance matrix and sometimes prior weights for each Gaussian
• here, we’ll assume that the covariance matrix is fixed; we’ll focus just on setting the means and the prior weights
Θ
EM Clustering: Hidden Variables• on each iteration of k-means clustering, we had to assign
each instance to a cluster
• in the EM approach, we’ll use hidden variables to represent this idea
• for each instance we have a set of hidden variables
• we can think of as being 1 if is a member of cluster j and 0 otherwise
€
Zi1,...,Zik
ixr
€
Zij ixr
EM Clustering: the E-step• recall that is a hidden variable which is 1 if
generated and 0 otherwise
• in the E-step, we compute , the expected value of this hidden variable
€
Zij
€
hij = E(Zij |r x i) = Pr(Zij =1 |
r x i) =
PjN j (r x i)
PlN l (r x i)
l =1
k
∑
jNix
r
ijh
+
assignment
0.3
0.7
EM Clustering: the M-step• given the expected values , we re-estimate the means of
the Gaussians and the cluster probabilities
• can also re-estimate the covariance matrix if we’re varying it
ijh
€
rμ j =
hij
r x i
i=1
n
∑
hiji=1
n
∑
€
Pj =
hij
i=1
n
∑n
EM Clustering ExampleConsider a one-dimensional clustering problem in which the data given are: x1 = -4 x2 = -3 x3 = -1 x4 = 3 x5 = 5
The initial mean of the first Gaussian is 0 and the initial mean of the second is 2. The Gaussians both have variance = 4; their density function is:
where denotes the mean (center) of the Gaussian.
Initially, we set P1 = P2 = 0.5
2
22
1
8
1),(
⎟⎠⎞
⎜⎝⎛ −
−=
μ
πμ
x
exf
μ
EM Clustering Example
2
22
1
8
1),(
⎟⎠⎞
⎜⎝⎛ −
−=
μ
πμ
x
exf
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
-6 -4 -2 0 2 4 6
Gaussian 1Gaussian 2
data
0269.8
1),4(
2
2
0 4
2
1
1 ==−⎟⎠⎞
⎜⎝⎛ −−
−ef
πμ 0022.),4( 2 =− μf
0646.),3( 1 =− μf 00874.),3( 2 =− μf
176.),1( 1 =− μf 0646.),1( 2 =− μf
0646.),3( 1 =μf 176.),3( 2 =μf
00874.),5( 1 =μf 0646.),5( 2 =μf
EM Clustering Example: E Step
€
h12 =
1
2f (x1,μ2)
1
2f (x1,μ1) +
1
2f (x1,μ2)
=.0022
.0269 + .0022
= 0.076
h22 =.00874
.0646 + .00874= 0.119
h32 =.0646
.176 + .0646= 0.268
h42 =.176
.0646 + .176= 0.732
h52 =.0646
.00874 + .0646= 0.881
€
h11 =
1
2f (x1,μ1)
1
2f (x1,μ1) +
1
2f (x1,μ2)
=.0269
.0269 + .0022
= 0.924
h21 =
1
2f (x2,μ1)
1
2f (x2,μ1) +
1
2f (x2,μ2)
=.0646
.0646 + .00874
= 0.881
h31 =.176
.176 + .0646= 0.732
h41 =.0646
.0646 + .176= 0.268
h51 =.00874
.00874 + .0646= 0.119
EM Clustering Example: M-step
€
μ1 =
x i × hi1
i
∑
hi1
i
∑=
−4 × .924 + −3× .881+ −1× .732 + 3× .268 + 5 × .119
.924 + .881+ .732 + .268 + .119= −1.94
€
μ2 =
x i × hi2
i
∑
hi2
i
∑=
−4 × .076 + −3× .119 + −1× .268 + 3 × .732 + 5 × .881
.076 + .119 + .268 + .732 + .881= 3.39
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
-6 -4 -2 0 2 4 6
Gaussian 1Gaussian 2
data
€
P1 =
hi1
i
∑n
=.924 + .881+ .732 + .268 + .119
5= 0.58
P2 =
hi2
i
∑n
=.076 + .119 + .268 + .732 + .881
5= 0.42
EM Clustering Example
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
-6 -4 -2 0 2 4 6
Gaussian 1Gaussian 2
data
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
-6 -4 -2 0 2 4 6
Gaussian 1Gaussian 2
data
• here we’ve shown just one step of the EM procedure
• we would continue the E- and M-steps until convergence
EM and K-Means Clustering
• both will converge to a local maximum
• both are sensitive to initial positions (means) of clusters
• have to choose value of k for both
Cross Validation to Select k• with EM clustering, we can run the method with different
values of k, use CV to evaluate each clustering
computeto evaluate clustering
€
log Pr(xi )i
∑
clusteringclusteringclustering
• then run method on all data once we’ve picked k
Evaluating Clustering Results
• given random data without any “structure”, clustering algorithms will still return clusters
• the gold standard: do clusters correspond to natural categories?
• do clusters correspond to categories we care about? (there are lots of ways to partition the world)
Evaluating Clustering Results• some approaches
– external validation• e.g. do genes clustered together have some
common function?– internal validation
• How well does clustering optimize intra-cluster similarity and inter-cluster dissimilarity?
– relative validation• How does it compare to other clusterings using
these criteria?• e.g. with a probabilistic method (such as EM)
we can ask: how probable does held-aside data look as we vary the number of clusters.
Comments on Clustering
• there many different ways to do clustering; we’ve discussed just a few methods
• hierarchical clusters may be more informative, but they’re more expensive to compute
• clusterings are hard to evaluate in many cases