Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of

Microarray Data Analysis

(Lecture for CS498-CXZ Algorithms in Bioinformatics)

Oct 13, 2005

ChengXiang Zhai

Department of Computer Science

University of Illinois, Urbana-Champaign

Outline

Microarrays

Hierarchical Clustering

K-Means Clustering

Corrupted Cliques Problem

CAST Clustering Algorithm

Motivation: Inferring Gene Functionality

Researchers want to know the functions of newly sequenced genes

Simply comparing the new gene sequences to known DNA sequences often does not give away the function of gene

For 40% of sequenced genes, functionality cannot be ascertained by only comparing to sequences of other known genes

Microarrays allow biologists to infer gene function even when sequence similarity alone there is insufficient to infer function.

Microarrays and Expression Analysis

Microarrays measure the activity (expression level) of the genes under varying conditions/time points

Expression level is estimated by measuring the amount of mRNA for that particular gene

A gene is active if it is being transcribed

More mRNA usually indicates more gene activity

Microarray Experiments

Produce cDNA from mRNA (DNA is more stable) Attach phosphor to cDNA to see when a particular

gene is expressed Different color phosphors are available to compare

many samples at once Hybridize cDNA over the micro array Scan the microarray with a phosphor-illuminating laser

Illumination reveals transcribed genes

Scan microarray multiple times for the different color phosphor’s

Microarray Experiments (con’t)

www.affymetrix.com

Phosphors can be added here instead

Then instead of staining, laser illumination can be used

Interpretation of Microarray Data

Green: expressed only from control

Red: expressed only from experimental cell

Yellow: equally expressed in both samples

Black: NOT expressed in either control or experimental cells

Microarray Data Microarray data are usually transformed into an intensity

matrix (below)

The intensity matrix allows biologists to make correlations between different genes (even if they are

dissimilar) and to understand how gene functions might be related

Clustering comes into playTime: Time X Time Y Time Z

Gene 1 10 8 10

Gene 2 10 0 9

Gene 3 4 8.6 3

Gene 4 7 8 3

Gene 5 1 2 3

Intensity (expression level) of gene at measured time

Major Analysis Techniques

Single gene analysis Compare the expression levels of the same gene under

different conditions

Main techniques: Significance test (e.g., t-test)

Gene group analysis Find genes that are expressed similarly across many different

conditions (co-expressed genes -> co-regulated genes)

Main techniques: Clustering (many possibilities)

Gene network analysis Analyze gene regulation relationship at a large scale

Main techniques: Bayesian networks, won’t be covered

Clustering of Microarray Data

Clusters

Homogeneity and Separation Principles

• Homogeneity: Elements within a cluster are close to each other

• Separation: Elements in different clusters are further apart from each other

• …clustering is not an easy task!

Given these points a clustering algorithm might make two distinct clusters as follows

Bad Clustering

This clustering violates both Homogeneity and Separation principles

Close distances from points in separate clusters

Far distances from points in the same cluster

Good Clustering

This clustering satisfies both Homogeneity and Separation principles

Clustering Methods

Similarity-based (need a similarity function)

Construct a partition Agglomerative, bottom up

Typically “hard” clustering

Model-based (latent models, probabilistic or algebraic)

First compute the model

Clusters are obtained easily after having a model

Typically “soft” clustering

Similarity-based Clustering

Define a similarity function to measure similarity between two objects

Common clustering criteria: Find a partition to Maximize intra-cluster similarity

Minimize inter-cluster similarity

Challenge: How to combine them? (many possibilities)

How to construct clusters? Agglomerative Hierarchical Clustering (Greedy)

Need a cutoff (either by the number of clusters or by a score threshold)

Method 1 (Similarity-based):

Agglomerative Hierarchical Clustering

Similarity Measures

Given two gene expression vectors, X={X1, X2, …, Xn} and Y={Y1, Y2, …, Yn}, define a similarity function s(X,Y) to measure how likely the two genes are co-regulated/co-expressed However “co-expressed” is hard to quantify

Biology knowledge would help

In practice, many choices Euclidean distance

Pearson correlation coefficient

1 1 1

2 2 2 2

1 1 1 1

( , )

[ ( ) ] [ ( ) ]

n n n

i i i ii i i

n n n n

i i i ii i i i

n x y x ys X Y

n x x n y y

Better measures are possible (e.g., focusing on a subset of values…

2

1

( , ) ( )n

i ii

s X Y x y

Agglomerative Hierachical Clustering

Given a similarity function to measure similarity between two objects

Gradually group similar objects together in a bottom-up fashion

Stop when some stopping criterion is met

Variations: different ways to compute group similarity based on individual object similarity

Hierarchical Clustering

Hierarchical Clustering Algorithm

1. Hierarchical Clustering (d , n)

2. Form n clusters each with one element

3. Construct a graph T by assigning one vertex to each cluster

4. while there is more than one cluster

5. Find the two closest clusters C1 and C2

6. Merge C1 and C2 into new cluster C with |C1| +|C2| elements

7. Compute distance from C to all other clusters

8. Add a new vertex C to T and connect to vertices C1 and C2

9. Remove rows and columns of d corresponding to C1 and C2

10. Add a row and column to d corrsponding to the new cluster C

11. return T

The algorithm takes a nxn distance matrix d of pairwise distances between points as an input.

Hierarchical Clustering Algorithm

1. Hierarchical Clustering (d , n)

2. Form n clusters each with one element

3. Construct a graph T by assigning one vertex to each cluster

4. while there is more than one cluster

5. Find the two closest clusters C1 and C2

6. Merge C1 and C2 into new cluster C with |C1| +|C2| elements

7. Compute distance from C to all other clusters

8. Add a new vertex C to T and connect to vertices C1 and C2

9. Remove rows and columns of d corresponding to C1 and C2

10. Add a row and column to d corrsponding to the new cluster C

11. return T

Different ways to define distances between clusters

lead to different clustering methods…

How to Compute Group Similarity?

Given two groups g1 and g2,

Single-link algorithm: s(g1,g2)= similarity of the closest pair

complete-link algorithm: s(g1,g2)= similarity of the farthest pair

average-link algorithm: s(g1,g2)= average of similarity of all pairs

Three Popular Methods:

Three Methods Illustrated

Single-link algorithm

?

g1 g2

complete-link algorithm

……

average-link algorithm

Group Similarity Formulas

Given C (new cluster) and C* (any other cluster in the current results)

Given a similarity function s(x,y) (or equivalently, a distance function, d(x,y) as in the book)

Single link:

Complete link:

Average link:

max min, *, *

( , *) max ( , ) ( , *) min ( , )x C y Cx C y C

s C C s x y or d C C d x y

, * , *

( , ) ( , )

( , *) , ( , *)| || * | | || * |

x C y C x C y Cavg avg

s x y d x y

s C C or d C CC C C C

min max, * , *

( , *) min ( , ) ( , *) max ( , )x C y C x C y C

s C C s x y or d C C d x y

Comparison of the Three Methods

Single-link “Loose” clusters

Individual decision, sensitive to outliers

Complete-link “Tight” clusters

Individual decision, sensitive to outliers

Average-link “In between”

Group decision, insensitive to outliers

Which one is the best? Depends on what you need!

Method 2 (Model-based):

K-Means

A Model-based View of K-Means

Each cluster is modeled by a centroid point

A partition of the data V={v1…vn} with k clusters is modeled by k centroid points X={x1…xk}; Note that a model X uniquely specifies the clustering results

Define an objective function on X and V to measure how good/bad V is as clustering results of X, f(X,V)

Clustering is reduced to the problem of finding a model X* that optimizes the value f(X*,V)

Once X* is found, the clustering results are simply computed according to our model X*

Major tasks: (1) define d(x,v); (2) define f(X,V); (3) Efficiently search for X*

( ) { | , , ( , ) ( , )}i i j l j i j lC x v V x X l i d v x d v x

Squared Error Distortion● Given a set of n data points V={v1…vn} and a set

of k points X={x1…xk}

● Define d(xi,vj)=Euclidean distance

● Define f(X,V)=d(X,V) (Squared error distortion)

2

1

1

1( , ) ( , )

( , ) min ( , )

n

ii

i i jj k

d X V d v Xn

d v X d v x

The distance from vi to

the closest point in X

K-Means Clustering Problem: Formulation

● Input: A set, V, consisting of n points and a parameter k

● Output: A set X consisting of k points (cluster centers) that minimizes the squared error distortion d(V,X) over all possible choices of X

This problem is NP-complete.

K-Means Clustering

Given a distance function

Start with k randomly selected data points

Assume they are the centroids of k clusters

Assign every data point to a cluster whose centroid is the closest to the data point

Recompute the centroid for each cluster

Repeat this process until the distance-based objective function converges

K-Means Clustering: Lloyd Algorithm

1. Lloyd Algorithm

2. Arbitrarily assign the k cluster centers

3. while the cluster centers keep changing

4. Assign each data point to the cluster Ci (xi) corresponding to the closest

cluster representative (i.e,. the center xi) (1 ≤ i ≤ k)

5. After the assignment of all data points, compute new cluster representatives according to the center of gravity of

each cluster, that is, the new cluster representative is

*This may lead to merely a locally optimal clustering.

'( )| |v C

i ii

vx C

C

0

1

2

3

4

5

0 1 2 3 4 5

expression in condition 1

ex

pre

ss

ion

in

co

nd

itio

n 2

x1

x2

x3

0

1

2

3

4

5

0 1 2 3 4 5


exp

ress

ion

in c

on

dit

ion

2

x1

x2

x3

0

1

2

3

4

5

0 1 2 3 4 5


exp

ress

ion

in c

on

dit

ion

2

x1

x2

x3

0

1

2

3

4

5

0 1 2 3 4 5


exp

ress

ion

in c

on

dit

ion

2

x1

x2 x3

Evaluation of Clustering Results

How many clusters? Score threshold in hierarchical clustering (need to have some confidence in the

scores) Try out multiple numbers, and see which one works the best (best could mean

fitting the data better or discovering some known clusters)

Quality of a single cluster A tight cluster is better (but needs to quantify the “tightness”) Prior knowledge about clusters

Quality of a set of clusters Quality of individual clusters How well different clusters are separated

In general, evaluation of clustering and choosing the right clustering algorithms are difficult

But the good news is that significant clusters tend to show up regardless what clustering algorithms to use

What You Should Know

Why clustering microarray data is interesting

How hierarchical agglomerative clustering works

The 3 different ways of computing the group similarity/distance

How k-means works

Documents

Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of