46
Normalization Review and Cluster Analysis Class web site: http://statwww.epfl.ch/davison/teaching/Microarr ays/ Statistics for Microarrays

Statistics for Microarrays

Embed Size (px)

DESCRIPTION

Statistics for Microarrays. Normalization Review and Cluster Analysis. Class web site: http://statwww.epfl.ch/davison/teaching/Microarrays/. Normalization Methods. Global adjustment (e.g. median or mean for a set of genes) - PowerPoint PPT Presentation

Citation preview

Page 1: Statistics for Microarrays

Normalization Review and Cluster Analysis

Class web site: http://statwww.epfl.ch/davison/teaching/Microarrays/

Statistics for Microarrays

Page 2: Statistics for Microarrays

Normalization Methods

• Global adjustment (e.g. median or mean for a set of genes)

• Intensity-dependent -- shift M by c=c(A); e.g. c(A) made with lowess

• Within print-tip group normalization to correct for spatial bias

• Both print-tip and intensity-dependent adjustment: perform LOWESS fits to the data within print-tip groups

Page 3: Statistics for Microarrays

Print-tip-group normalization

Page 4: Statistics for Microarrays
Page 5: Statistics for Microarrays

Normalization: Which Spots to use?

The LOWESS lines can be run through many different sets of points, and each strategy has its own implicit set of assumptions justifying its applicability.

For example, the use of a global LOWESS approach

can be justified by supposing that, when stratified by mRNA abundance, a) only a minority of genes are expected to be differentially expressed, or b) any differential expression is as likely to be up-regulation as down-regulation.

Pin-group LOWESS requires stronger assumptions:

that one of the above applies within each pin-group. The use of other sets of genes, e.g. control or

housekeeping genes, involve similar assumptions.

Page 6: Statistics for Microarrays

Pre-processed cDNA Gene Expression Data

On p genes for n slides: p is O(10,000), n is O(10-100), but growing,

Genes

Slides

Gene expression level of gene 5 in slide 4

= Log2( Red intensity / Green intensity)

slide 1 slide 2 slide 3 slide 4 slide 5 …

1 0.46 0.30 0.80 1.51 0.90 ...2 -0.10 0.49 0.24 0.06 0.46 ...3 0.15 0.74 0.04 0.10 0.20 ...4 -0.45 -1.03 -0.79 -0.56 -0.32 ...5 -0.06 1.06 1.35 1.09 -1.09 ...

These values are conventionally displayed on a red (>0) yellow (0) green (<0) scale.

Page 7: Statistics for Microarrays

Cluster analysis

• Used to find groups of objects when not already known

• “Unsupervised learning”• Associated with each object is a set

of measurements (the feature vector)

• Aim is to identify groups of similar objects on the basis of the observed measurements

Page 8: Statistics for Microarrays

Clustering Gene Expression Data

• Can cluster genes (rows), e.g. to (attempt to) identify groups of co-regulated genes

• Can cluster samples (columns), e.g. to identify tumors based on profiles

• Can cluster both rows and columns at the same time

Page 9: Statistics for Microarrays

Clustering Gene Expression Data

• Leads to readily interpretable figures

• Can be helpful for identifying patterns in time or space

• Useful (essential?) when seeking new subclasses of samples

• Can be used for exploratory purposes

Page 10: Statistics for Microarrays

Similarity

• Similarity sij indicates the strength of relationship between two objects i and j

• Usually 0 ≤ sij ≤1

• Correlation-based similarity ranges from –1 to 1

Page 11: Statistics for Microarrays

Problems using correlation

Page 12: Statistics for Microarrays

Dissimilarity and Distance

• Associated with similarity measures sij

bounded by 0 and 1 is a dissimilarity dij = 1 - sij

• Distance measures have the metric property (dij +dik ≥ djk)

• Many examples: Euclidean, Manhattan, etc.

• Distance measure has a large effect on performance

• Behavior of distance measure related to scale of measurement

Page 13: Statistics for Microarrays

Partitioning Methods

• Partition the objects into a prespecified number of groups K

• Iteratively reallocate objects to clusters until some criterion is met (e.g. minimize within cluster sums of squares)

• Examples: k-means, partitioning around medoids (PAM), self-organizing maps (SOM), model-based clustering

Page 14: Statistics for Microarrays

Hierarchical Clustering

• Produce a dendrogram• Avoid prespecification of the number

of clusters K • The tree can be built in two distinct

ways: – Bottom-up: agglomerative clustering– Top-down: divisive clustering

Page 15: Statistics for Microarrays

Agglomerative Methods• Start with n mRNA sample (or G gene) clusters• At each step, merge the two closest clusters

using a measure of between-cluster dissimilarity which reflects the shape of the clusters

• Examples of between-cluster dissimilarities: – Unweighted Pair Group Method with Arithmetic Mean

(UPGMA): average of pairwise dissimilarities– Single-link (NN): minimum of pairwise dissimilarities– Complete-link (FN): maximum of pairwise

dissimilarities

Page 16: Statistics for Microarrays

Divisive Methods

• Start with only one cluster• At each step, split clusters into two

parts• Advantage: Obtain the main structure

of the data (i.e. focus on upper levels of dendrogram)

• Disadvantage: Computational difficulties when considering all possible divisions into two groups

Page 17: Statistics for Microarrays

Partitioning vs. Hierarchical

• Partitioning– Advantage: Provides clusters that satisfy

some optimality criterion (approximately)– Disadvantages: Need initial K, long

computation time

• Hierarchical– Advantage: Fast computation

(agglomerative)– Disadvantages: Rigid, cannot correct later

for erroneous decisions made earlier

Page 18: Statistics for Microarrays

Generic Clustering Tasks

• Estimating number of clusters

• Assigning each object to a cluster

• Assessing strength/confidence of cluster assignments for individual objects

• Assessing cluster homogeneity

Page 19: Statistics for Microarrays
Page 20: Statistics for Microarrays

Estimating Number of Clusters

Page 21: Statistics for Microarrays

(BREAK)

Page 22: Statistics for Microarrays

Bittner et al.

It has been proposed (by many) that a

cancer taxonomy can be identified

from gene expression experiments.

Page 23: Statistics for Microarrays

Dataset description

• 31 melanomas (from a variety of tissues/cell lines)

• 7 controls• 8150 cDNAs• 6971 unique genes• 3613 genes ‘strongly detected’

Page 24: Statistics for Microarrays

This is why you need to take logs!

Page 25: Statistics for Microarrays

After logging…

Page 26: Statistics for Microarrays

How many clusters are present?

Page 27: Statistics for Microarrays

‘cluster’

unclustered

Average linkage hierarchical clustering, melanoma only

1- = .54

Page 28: Statistics for Microarrays

Issues in Clustering

• Pre-processing (Image analysis and Normalization)

• Which genes (variables) are used • Which samples are used• Which distance measure is used• Which algorithm is applied• How to decide the number of clusters K

Page 29: Statistics for Microarrays

Issues in Clustering

• Pre-processing (Image analysis and Normalization)

•Which genes (variables) are used

• Which samples are used• Which distance measure is used• Which algorithm is applied• How to decide the number of clusters K

Page 30: Statistics for Microarrays

Filtering Genes

• All genes (i.e. don’t filter any)• At least k (or a proportion p) of the

samples must have expression values larger than some specified amount, A

• Genes showing “sufficient” variation– a gap of size A in the central portion of the

data– a interquartile range of at least B

• Filter based on statistical comparison– t-test– ANOVA– Cox model, etc.

Page 31: Statistics for Microarrays

Issues in Clustering

• Pre-processing (Image analysis and Normalization)

• Which genes (variables) are used

•Which samples are used• Which distance measure is used• Which algorithm is applied• How to decide the number of clusters K

Page 32: Statistics for Microarrays

‘cluster’

unclustered

Average linkage hierarchical clustering, melanoma only

Page 33: Statistics for Microarrays

‘cluster’control

unclustered

Average linkage hierarchical clustering, melanoma & controls

Page 34: Statistics for Microarrays

Issues in clustering

• Pre-processing• Which genes (variables) are used• Which samples are used

•Which distance measure is used

• Which algorithm is applied• How to decide the number of clusters K

Page 35: Statistics for Microarrays

Complete linkage (FN) hierarchical clustering

Page 36: Statistics for Microarrays

Single linkage (NN) hierarchical clustering

Page 37: Statistics for Microarrays

Issues in clustering

• Pre-processing• Which genes (variables) are used• Which samples are used• Which distance measure is used

•Which algorithm is applied• How to decide the number of clusters

K

Page 38: Statistics for Microarrays

Divisive clustering, melanoma only

Page 39: Statistics for Microarrays

Divisive clustering, melanoma & controls

Page 40: Statistics for Microarrays

Partitioning methods K-means and PAM, 2

groupsBittner K-means PAM #

samples

1 1 1 10

111

122

212

018

222

112

121

106

2 2 2 5

Page 41: Statistics for Microarrays

Bittner K-means PAM # samples

1 1 1 11

112

121

221

261

2 2 2 4

22233

23323

31331

12413

3 3 3 3

Page 42: Statistics for Microarrays

Issues in clustering

• Pre-processing• Which genes (variables) are used• Which samples are used• Which distance measure is used• Which algorithm is applied

•How to decide the number of clusters K

Page 43: Statistics for Microarrays

How many clusters K?

• Many suggestions for how to decide this!• Milligan and Cooper (Psychometrika

50:159-179, 1985) studied 30 methods• Some new methods include GAP (Tibshirani

) and clest (Fridlyand and Dudoit)• Applying several methods yielded

estimates of K = 2 (largest cluster has 27 members) to K = 8 (largest cluster has 19 members)

Page 44: Statistics for Microarrays

cluster

unclustered

Average linkage hierarchical clustering, melanoma only

Page 45: Statistics for Microarrays

Summary

• Buyer beware – results of cluster analysis should be treated with GREAT CAUTION and ATTENTION TO SPECIFICS, because…

• Many things can vary in a cluster analysis

• If covariates/group labels are known, then clustering is usually inefficient

Page 46: Statistics for Microarrays

Acknowledgements

IPAM Group, UCLA:Debashis GhoshErin ConlonDirk HolsteSteve HorvathLei LiHenry LuEduardo NevesMarcia SalzanoXianghong Zhao

Others:Sandrine DudoitJane FridlyandJose CorreaTerry SpeedWilliam LemonFred Wright