37
Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign

Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of

Embed Size (px)

Citation preview

Page 1: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of

Microarray Data Analysis

(Lecture for CS498-CXZ Algorithms in Bioinformatics)

Oct 13, 2005

ChengXiang Zhai

Department of Computer Science

University of Illinois, Urbana-Champaign

Page 2: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of

Outline

Microarrays

Hierarchical Clustering

K-Means Clustering

Corrupted Cliques Problem

CAST Clustering Algorithm

Page 3: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of

Motivation: Inferring Gene Functionality

Researchers want to know the functions of newly sequenced genes

Simply comparing the new gene sequences to known DNA sequences often does not give away the function of gene

For 40% of sequenced genes, functionality cannot be ascertained by only comparing to sequences of other known genes

Microarrays allow biologists to infer gene function even when sequence similarity alone there is insufficient to infer function.

Page 4: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of

Microarrays and Expression Analysis

Microarrays measure the activity (expression level) of the genes under varying conditions/time points

Expression level is estimated by measuring the amount of mRNA for that particular gene

A gene is active if it is being transcribed

More mRNA usually indicates more gene activity

Page 5: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of

Microarray Experiments

Produce cDNA from mRNA (DNA is more stable) Attach phosphor to cDNA to see when a particular

gene is expressed Different color phosphors are available to compare

many samples at once Hybridize cDNA over the micro array Scan the microarray with a phosphor-illuminating laser

Illumination reveals transcribed genes

Scan microarray multiple times for the different color phosphor’s

Page 6: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of

Microarray Experiments (con’t)

www.affymetrix.com

Phosphors can be added here instead

Then instead of staining, laser illumination can be used

Page 7: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of

Interpretation of Microarray Data

Green: expressed only from control

Red: expressed only from experimental cell

Yellow: equally expressed in both samples

Black: NOT expressed in either control or experimental cells

Page 8: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of

Microarray Data Microarray data are usually transformed into an intensity

matrix (below)

The intensity matrix allows biologists to make correlations between different genes (even if they are

dissimilar) and to understand how gene functions might be related

Clustering comes into playTime: Time X Time Y Time Z

Gene 1 10 8 10

Gene 2 10 0 9

Gene 3 4 8.6 3

Gene 4 7 8 3

Gene 5 1 2 3

Intensity (expression level) of gene at measured time

Page 9: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of

Major Analysis Techniques

Single gene analysis Compare the expression levels of the same gene under

different conditions

Main techniques: Significance test (e.g., t-test)

Gene group analysis Find genes that are expressed similarly across many different

conditions (co-expressed genes -> co-regulated genes)

Main techniques: Clustering (many possibilities)

Gene network analysis Analyze gene regulation relationship at a large scale

Main techniques: Bayesian networks, won’t be covered

Page 10: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of

Clustering of Microarray Data

Clusters

Page 11: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of

Homogeneity and Separation Principles

• Homogeneity: Elements within a cluster are close to each other

• Separation: Elements in different clusters are further apart from each other

• …clustering is not an easy task!

Given these points a clustering algorithm might make two distinct clusters as follows

Page 12: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of

Bad Clustering

This clustering violates both Homogeneity and Separation principles

Close distances from points in separate clusters

Far distances from points in the same cluster

Page 13: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of

Good Clustering

This clustering satisfies both Homogeneity and Separation principles

Page 14: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of

Clustering Methods

Similarity-based (need a similarity function)

Construct a partition Agglomerative, bottom up

Typically “hard” clustering

Model-based (latent models, probabilistic or algebraic)

First compute the model

Clusters are obtained easily after having a model

Typically “soft” clustering

Page 15: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of

Similarity-based Clustering

Define a similarity function to measure similarity between two objects

Common clustering criteria: Find a partition to Maximize intra-cluster similarity

Minimize inter-cluster similarity

Challenge: How to combine them? (many possibilities)

How to construct clusters? Agglomerative Hierarchical Clustering (Greedy)

Need a cutoff (either by the number of clusters or by a score threshold)

Page 16: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of

Method 1 (Similarity-based):

Agglomerative Hierarchical Clustering

Page 17: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of

Similarity Measures

Given two gene expression vectors, X={X1, X2, …, Xn} and Y={Y1, Y2, …, Yn}, define a similarity function s(X,Y) to measure how likely the two genes are co-regulated/co-expressed However “co-expressed” is hard to quantify

Biology knowledge would help

In practice, many choices Euclidean distance

Pearson correlation coefficient

1 1 1

2 2 2 2

1 1 1 1

( , )

[ ( ) ] [ ( ) ]

n n n

i i i ii i i

n n n n

i i i ii i i i

n x y x ys X Y

n x x n y y

Better measures are possible (e.g., focusing on a subset of values…

2

1

( , ) ( )n

i ii

s X Y x y

Page 18: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of

Agglomerative Hierachical Clustering

Given a similarity function to measure similarity between two objects

Gradually group similar objects together in a bottom-up fashion

Stop when some stopping criterion is met

Variations: different ways to compute group similarity based on individual object similarity

Page 19: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of

Hierarchical Clustering

Page 20: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of

Hierarchical Clustering Algorithm

1. Hierarchical Clustering (d , n)

2. Form n clusters each with one element

3. Construct a graph T by assigning one vertex to each cluster

4. while there is more than one cluster

5. Find the two closest clusters C1 and C2

6. Merge C1 and C2 into new cluster C with |C1| +|C2| elements

7. Compute distance from C to all other clusters

8. Add a new vertex C to T and connect to vertices C1 and C2

9. Remove rows and columns of d corresponding to C1 and C2

10. Add a row and column to d corrsponding to the new cluster C

11. return T

The algorithm takes a nxn distance matrix d of pairwise distances between points as an input.

Page 21: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of

Hierarchical Clustering Algorithm

1. Hierarchical Clustering (d , n)

2. Form n clusters each with one element

3. Construct a graph T by assigning one vertex to each cluster

4. while there is more than one cluster

5. Find the two closest clusters C1 and C2

6. Merge C1 and C2 into new cluster C with |C1| +|C2| elements

7. Compute distance from C to all other clusters

8. Add a new vertex C to T and connect to vertices C1 and C2

9. Remove rows and columns of d corresponding to C1 and C2

10. Add a row and column to d corrsponding to the new cluster C

11. return T

Different ways to define distances between clusters

lead to different clustering methods…

Page 22: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of

How to Compute Group Similarity?

Given two groups g1 and g2,

Single-link algorithm: s(g1,g2)= similarity of the closest pair

complete-link algorithm: s(g1,g2)= similarity of the farthest pair

average-link algorithm: s(g1,g2)= average of similarity of all pairs

Three Popular Methods:

Page 23: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of

Three Methods Illustrated

Single-link algorithm

?

g1 g2

complete-link algorithm

……

average-link algorithm

Page 24: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of

Group Similarity Formulas

Given C (new cluster) and C* (any other cluster in the current results)

Given a similarity function s(x,y) (or equivalently, a distance function, d(x,y) as in the book)

Single link:

Complete link:

Average link:

max min, *, *

( , *) max ( , ) ( , *) min ( , )x C y Cx C y C

s C C s x y or d C C d x y

, * , *

( , ) ( , )

( , *) , ( , *)| || * | | || * |

x C y C x C y Cavg avg

s x y d x y

s C C or d C CC C C C

min max, * , *

( , *) min ( , ) ( , *) max ( , )x C y C x C y C

s C C s x y or d C C d x y

Page 25: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of

Comparison of the Three Methods

Single-link “Loose” clusters

Individual decision, sensitive to outliers

Complete-link “Tight” clusters

Individual decision, sensitive to outliers

Average-link “In between”

Group decision, insensitive to outliers

Which one is the best? Depends on what you need!

Page 26: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of

Method 2 (Model-based):

K-Means

Page 27: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of

A Model-based View of K-Means

Each cluster is modeled by a centroid point

A partition of the data V={v1…vn} with k clusters is modeled by k centroid points X={x1…xk}; Note that a model X uniquely specifies the clustering results

Define an objective function on X and V to measure how good/bad V is as clustering results of X, f(X,V)

Clustering is reduced to the problem of finding a model X* that optimizes the value f(X*,V)

Once X* is found, the clustering results are simply computed according to our model X*

Major tasks: (1) define d(x,v); (2) define f(X,V); (3) Efficiently search for X*

( ) { | , , ( , ) ( , )}i i j l j i j lC x v V x X l i d v x d v x

Page 28: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of

Squared Error Distortion● Given a set of n data points V={v1…vn} and a set

of k points X={x1…xk}

● Define d(xi,vj)=Euclidean distance

● Define f(X,V)=d(X,V) (Squared error distortion)

2

1

1

1( , ) ( , )

( , ) min ( , )

n

ii

i i jj k

d X V d v Xn

d v X d v x

The distance from vi to

the closest point in X

Page 29: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of

K-Means Clustering Problem: Formulation

● Input: A set, V, consisting of n points and a parameter k

● Output: A set X consisting of k points (cluster centers) that minimizes the squared error distortion d(V,X) over all possible choices of X

This problem is NP-complete.

Page 30: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of

K-Means Clustering

Given a distance function

Start with k randomly selected data points

Assume they are the centroids of k clusters

Assign every data point to a cluster whose centroid is the closest to the data point

Recompute the centroid for each cluster

Repeat this process until the distance-based objective function converges

Page 31: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of

K-Means Clustering: Lloyd Algorithm

1. Lloyd Algorithm

2. Arbitrarily assign the k cluster centers

3. while the cluster centers keep changing

4. Assign each data point to the cluster Ci (xi) corresponding to the closest

cluster representative (i.e,. the center xi) (1 ≤ i ≤ k)

5. After the assignment of all data points, compute new cluster representatives according to the center of gravity of

each cluster, that is, the new cluster representative is

*This may lead to merely a locally optimal clustering.

'( )| |v C

i ii

vx C

C

Page 32: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of

0

1

2

3

4

5

0 1 2 3 4 5

expression in condition 1

ex

pre

ss

ion

in

co

nd

itio

n 2

x1

x2

x3

Page 33: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of

0

1

2

3

4

5

0 1 2 3 4 5

expression in condition 1

exp

ress

ion

in c

on

dit

ion

2

x1

x2

x3

Page 34: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of

0

1

2

3

4

5

0 1 2 3 4 5

expression in condition 1

exp

ress

ion

in c

on

dit

ion

2

x1

x2

x3

Page 35: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of

0

1

2

3

4

5

0 1 2 3 4 5

expression in condition 1

exp

ress

ion

in c

on

dit

ion

2

x1

x2 x3

Page 36: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of

Evaluation of Clustering Results

How many clusters? Score threshold in hierarchical clustering (need to have some confidence in the

scores) Try out multiple numbers, and see which one works the best (best could mean

fitting the data better or discovering some known clusters)

Quality of a single cluster A tight cluster is better (but needs to quantify the “tightness”) Prior knowledge about clusters

Quality of a set of clusters Quality of individual clusters How well different clusters are separated

In general, evaluation of clustering and choosing the right clustering algorithms are difficult

But the good news is that significant clusters tend to show up regardless what clustering algorithms to use

Page 37: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of

What You Should Know

Why clustering microarray data is interesting

How hierarchical agglomerative clustering works

The 3 different ways of computing the group similarity/distance

How k-means works