Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

Some Statistical Issues in Microarray Data Analysis

Alex SánchezEstadística i Bioinformàtica

Departament d’Estadística Universitat de BarcelonaUnitat d’Estadística i BioinformàticaIR-HUVH

2

Outline

Introduction Experimental design Selecting differentially expressed genes

Statistical tests Significance testing Linear models and Analysis of the variance Multiple testing

Software for microarray data analysis

Introduction

4

Microarray experiments: Overview

5

Why are we talking of statistics?

A microarray experiment is, as called, an experiment, that is: It has been performed to determine if some

previous hypothesis are true or false (although it can also lead to new hypotheses)

It is subject to errors which may arise from many sources

6

Sources of variability Biological Heterogeneity in Population Specimen Collection/ Handling Effects

Tumor: surgical bx, FNA Cell Line: culture condition, confluence

level Biological Heterogeneity in Specimen RNA extraction RNA amplification

Fluor labeling

Hybridization

Scanning – PMT voltage – laser power

(Geschwind, Nature Reviews Neuroscience, 2001)

7

Categories of variability

Systematic variability Amount of RNA in the

biopsy Efficiencies of lab

procedures such as: RNA extraction, reverse transcription, Labeling or photodetection

Random variation PCR yield DNA quality spotting efficiency, spot size cross-/unspecific

hybridization stray signal

8

Dealing with systematic variability

Systematic variability has similar effects on many measurements

Corrections can be estimated from dataCALIBRATION or NORMALIZATION is the

general name for processes that correct for systematic variability

9

Dealing with random variation

Random variation cannot be explicitly accounted for

Usual way to deal with it is to assume some ERROR MODELS (e.g. ei~N(0, 2))

Assuming these error models are true… EXPERIMENTAL DESIGN is (must be) used to EXPERIMENTAL DESIGN is (must be) used to

control the action of random variationcontrol the action of random variation STATISTICAL INFERENCE is (must be) used to STATISTICAL INFERENCE is (must be) used to

extract conclusions in the presence of random extract conclusions in the presence of random variationvariation

10

Biological verification and interpretation

Microarray experiment

Experimental design

Image analysis

Normalization

Biological question

TestingEstimation Discrimination

AnalysisClustering

Quality Measurement

Failed

Pass

Today

Experimental design

12

Why experimental design?

The objective of experimental design is to make the analysis of the data and the interpretation of the resultsAs simple and as powerful as possibleGiven the purpose of the experimentAnd the constraints of the experimental

material

13

Scientific aims and design choice

The primary focus of the experiments needs to be clearly stated, whether it is: to identify differentially expressed genes to search for specific gene-expression patterns to identify phenotypic subclasses

Aim of the experiment guides design choiceSometimes only one choice is reasonableSometimes different options available

14

Designing microarray experiments

The appropriate design of a microarray experiment must considerDesign of the arrayAllocation of mRNA samples to the slides

15

I: Layout of the array

Which sequences to usecDNA’s Selection of cDNA from library

Riken, NIA, etcAffymetrix PM’s and MM’s

Oligo probes selection (from Operon, Agilent, etc)Control probes

What %?. Where should controls be put

How many sequences to use Should there be replicate spots within a slide?

16

II: Allocating samples in slides

Types of SamplesReplication: technical vs biologicalPooled vs individual samples

Different design layout / data analysis:Scientific aim of the experimentEfficiency, Robustness, Extensibility

Physical limitations (cost) :Number of slidesAmount of material

17

Basic principles of experimental design

Apply the following principles to best attain the objectives of experimental designReplicationLocal control or BlockingRandomization

18

1. Replication It’s important

To reduce uncertainty (increase precision) To obtain sufficient power for the tests As a formal basis for inferential procedures

Consider different types of replicates Technical

Duplicate spots Multiple hybridizations from the same sample

Biological Repeat most what is expected to vary most!

2

var XXn

19

Biological vs Technical Replicates

@ Nature reviews & G. Churchill (2002)

2B

2A

2e

20

Replication vs Pooling mRNA from different samples are often combined to

form a ``pooled-sample’’ or pool. Why? If each sample doesn’t yield enough mRNATo compensate an excess of variability ?

Statisticians tend not to like it but pooling may be OK if properly doneCombine several samples in each poolUse several pools from different samplesDo not use pools when individual information is

important (e.g.paired designs)

21

2. Blocking Assume we wish to perform an experiment to

compare two treatments. The samples or their processing may not be

homogeneous: There are blocks Subjects: Male/Female Arrays produced in two lots (February, March)

If there are systematic differences between blocks the effects of interest (e.g. tretament) may be confounded Observed differences are attributable to treatment

effect or to confounding factors?

22

Confounding block with treatment effects

Sample Treatment Sex Batch Sample Treatment Sex Batch1 A Male 1 1 A Male 12 A Male 1 2 A Female 23 A Male 1 3 A Male 14 A Male 1 4 A Female 25 B Female 2 5 B Male 16 B Female 2 6 B Female 27 B Female 2 7 B Male 18 B Female 2 8 B Female 2

Awful design Balanced design

Two alternative designs to investigate treatment effects Left: Treatment effects confounded with Sex and Batch effect Right: Treatments are balanced between blocks

Influence of blocks is automatically compensated Statistical analysis may separate block from treatment efefect

23

3. Randomisation

Randomly assigning samples to groups to eliminate unspecific disturbancesRandomly assign individuals to treatments.Randomise order in which experiments are

performed. Randomisation required to ensure validity

of statistical procedures. Block what you can and randomize what

you cannot

24

Experimental layout

How are mRNA samples assigned to arrays The experimental layout has to be chosen

so that the resulting analysis can be done as efficient and robust as possibleSometimes there is only one reasonable choiceSometimes several choices are available

25

Case 1: Meaningful biological control (C)Samples: Liver tissue from 4 mice treated by cholesterol modifying

drugs.Question 1: Genes that respond differently between the T and the C.Question 2: Genes that responded similarly across two or more treatments relative to control.

Case 2: Use of universal reference.Samples: Different tumor samples.

Question: To discover tumor subtypes.

Example I: Only one design choice

T2 T3 T4

C

T1 T1

Ref

T2 Tn-1 Tn

26

Example 2: a number of different designs are suitable for use (1) Time course experiments

Design choice depends on the comparisons of interest

T2 T3 T4T1

Ref T2 T3 T4T1

T2 T3 T4T1 T2 T3 T4T1

27

How can we decide?

A-optimality: choosee design which minimizes variance of estimates of effects of interest

A simple example: Direct vs indirect estimates

A BA

BR

Direct Indirect

2 /2 22

average (log (A/B)) log (A / R) – log (B / R )

28

Summary Selection of mRNA samples is important

Most important: biological replicates Technical replicates also useful, but different If needed and possible use pooling wisely

Choice of experimental layout guided by The scientific question Experimental design principles Efficiency and robustness considerations

Correspondence between experimental Designs-Linear Models-ANOVA can be exploited to select model and analyze data

29

Experimental design, Linear Models and Analysis of the Variance In experimental design the different

sources of variability influencing the observed response may be identified.

These sources can be related with the response using a linear model

Analysis of the variance can be used to separately estimate and test the relative importance of each source of variability.

Statistical methods to detect differentially expressed genes

31

Class comparison: Identifying differentially expressed genes Identify genes differentially expressed between

different conditions such as Treatment, cell type,... (qualitative covariates) Dose, time, ... (quantitative covariate) Survival, infection time,... !

Estimate effects/differences between groups probably using log-ratios, i.e. the difference on log scale log(X)-log(Y) [=log(X/Y)]

32

What is a “significant change”?

Depends on the variability within groups, which may be different from gene to gene.

To assess the statistical significance of differences, conduct a statistical test for each gene.

33

Different settings for statistical tests Indirect comparisons: 2 groups, 2 samples, unpaired

E.g. 10 individuals: 5 suffer diabetes, 5 healthy One sample fro each individual Typically: Two sample t-test or similar

Direct comparisons: Two groups, two samples, paired E.g. 6 individuals with brain stroke. Two samples from each: one from healthy (region 1) and

one from affected (region 2). Typically: One sample t-test (also called paired t-test) or

similar based on the individual differences between conditions.

34

Different ways to do the experiment

An experiment use cDNA arrays (“two-colour”) or affy (“one-colour).

Depending on the technology used allocation of conditions to slides changes.

Type of chip

Experiment

cDNA(2-col)

Affy

(1-col)

10 indiv.

Diab (5)

Heal (5)

Reference design.

(5) Diab/Ref (5) Heal/Ref

Comparison design.

(5) Diab vs (5) Heal

6 indiv.

Region 1

Region 2

6 slides

1 individual per slide

(6) reg1/reg2

12 slides

(6) Paired differences

35

1 1

1 1Mean difference =

Classical t-test = ( ) 1/ 1/

Robust t-test = Use robust estimates of location &scale

CT nn

i ii iT C

p T C

T C T Cn n

t T C s n n

“Natural” measures of discrepancy

1

1Mean (log) ratio = , (R or M used indistinctly)

Classical t-test = ( ) , ( estimates standard error of R)

Robust t-test = Use robust estimates of location &scale

Tn

iiT

Rn

t R SE SE

For Direct comparisons in two colour or paired-one colour.

For Indirect comparisons in two colour or Direct comparisons in one colour.

36

Some Issues Can we trust average effect sizes (average difference of

means) alone? Can we trust the t statistic alone? Here is evidence that the answer is no.

Gene M1 M2 M3 M4 M5 M6 Mean SD t

A 2.5 2.7 2.5 2.8 3.2 2 2.61 0.40 16.10

B 0.01 0.05 -0.05 0.01 0 0 0.003 0.03 0.25

C 2.5 2.7 2.5 1.8 20 1 5.08 7.34 1.69

D 0.5 0 0.2 0.1 -0.3 0.3 0.13 0.27 1.19

E 0.1 0.11 0.1 0.1 0.11 0.09 0.10 0.01 33.09

Courtesy of Y.H. Yang

37

Some Issues


A 2.5 2.7 2.5 2.8 3.2 2 2.61 0.40 16.10

B 0.01 0.05 -0.05 0.01 0 0 0.003 0.03 0.25

C 2.5 2.7 2.5 1.8 20 1 5.08 7.34 1.69

D 0.5 0 0.2 0.1 -0.3 0.3 0.13 0.27 1.19

E 0.1 0.11 0.1 0.1 0.11 0.09 0.10 0.01 33.09

Courtesy of Y.H. Yang

Can we trust average effect sizes (average difference of means) alone?

Can we trust the t statistic alone? Here is evidence that the answer is no.

•Averages can be driven by outliers.

38

Some Issues


A 2.5 2.7 2.5 2.8 3.2 2 2.61 0.40 16.10

B 0.01 0.05 -0.05 0.01 0 0 0.003 0.03 0.25

C 2.5 2.7 2.5 1.8 20 1 5.08 7.34 1.69

D 0.5 0 0.2 0.1 -0.3 0.3 0.13 0.27 1.19

E 0.1 0.11 0.1 0.1 0.11 0.09 0.10 0.01 33.09

Courtesy of Y.H. Yang•t’s can be driven by tiny variances.

Can we trust average effect sizes (average difference of means) alone?

Can we trust the t statistic alone? Here is evidence that the answer is no.

39

Variations in t-tests (1)

Let Rg mean observed log ratio

SEg standard error of Rg estimated from data on gene g.

SE standard error of Rg estimated from data across all genes.

Global t-test: t=Rg/SE

Gene-specific t-test t=Rg/SEg

40

Some pro’s and con’s of t-test

Test Pro’s Con’s

Global t-test:

t=Rg/SE

Yields stable variance estimate

Assumes variance homogeneity

biased if false

Gene-specific: t=Rg/SEg

Robust to variance heterogeneity

Low power Yields unstable variance estimates (due to few data)

41

T-tests extensions

g

g

RS

c SE

2 20

0

( 1)

2

g

g

Rt

v SE n SE

v n

2 20 0

0

g

g

Rt

d SE d SE

d d

SAM (Tibshirani, 2001)

Regularized-t (Baldi, 2001)

EB-moderated t(Smyth, 2003)

42

Up to here…: Can we generate a list of candidate genes?

Gene 1: M11, M12, …., M1k

Gene 2: M21, M22, …., M2k

…………….Gene G: MG1, MG2, …., MGk

For every gene, calculateSi=t(Mi1, Mi2, …., Mik),

e.g. t-statistics, S, B,…

A list of candidateDE genes

Statistics of interestS1, S2, …., SG

?

With the tools we have, the reasonable steps to generate a list of candidate genes may be:

We need an idea of how significant are these values We’d like to assign them p-values

Significance testing

44

Nominal p-values

After a test statistic is computed, it is convenient to convert it to a p-value:

The probability that a test statistic, say S(X), takes values equal or greater than that taken on the observed sample, say S(X0), under the assumption that the null hypothesis is true

p=P{S(X)>=S(X0)|H0 true}

45

Significance testing

Test of significance at the level:Reject the null hypothesis if your p-value

is smaller than the significance levelIt has advantages but not free from

criticisms Genes with p-values falling below a

prescribed level may be regarded as significant

46

Hypothesis testing overview for a single gene

Reported decision

H0 is Rejected

(gene is Selected)

H0 is Accepted

(gene not Selected)

State of the nature ("Truth")

H0 is false

(Affected) TP, prob: 1-

FN, prob: 1-Type II error

Sensitiviy

TP/[TP+FN]

H0 is true

(Not Affected)

FP, P[Rej H0|H0]<=

Type I error

TN , prob: Specificity

TN/[TN+FP]

Positive predictive value

TP/[TP+FP]

Negative predictive value

TN/[TN+FN]

47

Calculation of p-values

Standard methods for calculating p-values:

(i) Refer to a statistical distribution table (Normal, t, F, …) or

(ii) Perform a permutation analysis

48

(i) Tabulated p-values

Tabulated p-values can be obtained for standard test statistics (e.g.the t-test)

They often rely on the assumption of normally distributed errors in the data

This assumption can be checked (approximately) using a HistogramQ-Q plot

49

Example

Golub data, 27 ALL vs 11 AML samples, 3051 genesA t-test yields 1045 genes with p< 0.05

50

(ii) Permutations tests

Based on data shuffling. No assumptions Random interchange of labels between samples Estimate p-values for each comparison (gene) by

using the permutation distribution of the t-statistics Repeat for every possible permutation, b=1…B

Permute the n data points for the gene (x). The first n1 are referred to as “treatments”, the second n2 as “controls”

For each gene, calculate the corresponding two samplet-statistic, tb

After all the B permutations are done putp = #{b: |tb| ≥ |tobserved|}/B

51

Permutation tests (2)

52

Volcano plot : fold change vs log(odds)1

Significant change detected No change detected1: log(odds) is proportional to “-log (p-value)”

Linear models and Analysis of the Variance to

analyze designed experiments

54

From experimental design to linear models

Some weaknesses of statistical frameworkWhat to do if treatment has more than 2 levels? How to deal with more than one treatment or

experimental condition?How to deal with nuisance factors such as

batch effects, covariates, etc…? Most of this can be solved with an

alternative approach: Analysis of the Variance

Multiple testing

56

How far can we trust the decision?

The test: "Reject H0 if p-val ≤ " is said to control the type I error because,

under a certain set of assumptions,the probability of falsely rejecting H0 is less than a fixed small threshold

Nothing is warranted about P[FN] “Optimal” tests are built trying to minimize this

probability In practical situations it is often high

57

What if we wish to test more than one gene at once? (1)

Consider more than one test at onceTwo tests each at 5% level. Now probability of

getting a false positive is 1 – 0.95*0.95 = 0.0975Three tests 1 – 0.953 =0.1426n tests 1 – 0.95n

Converge towards 1 as n increases Small p-values don’t necessarily imply

significance!!! We are not controlling the probability of type I error anymore

58

What if we wish to test more than one gene at once? (2): a simulation

Simulation of this process for 6,000 genes with 8 treatments and 8 controls

All the gene expression values were simulated i.i.d from a N (0,1) distribution, i.e. NOTHING is differentially expressed in our simulation

The number of genes falsely rejected will be on the average of (6000 · ), i.e. if we wanted to reject all genes with a p-value of less than 1% we would falsely reject around 60 genes

See example

59

Multiple testing: Counting errors

Decision reported

H0 is Rejected

(Genes Selected)

H0 is accepted (Genes not Selected)

Total

State of the nature

("Truth")

H0 is false

(Affected)mm (S)

(m-mo)-(mm

(T) m-mo

H0 is true

(Not Affected)

m (V) mo-m (U) mo

Total M (R) m-m (m-R) m

V = # Type I errors [false positives]T = # Type II errors [false negatives]All these quantities could be known if m0 was known

60

How does type I error control extend to multiple testing situations?

Selecting genes with a p-value less than doesn’t control for P[FP] anymore

What can be done?Extend the idea of type I error

FWER and FDR are two such extensions

Look for procedures that control the probability for these extended error types

Mainly adjust raw p-values

61

Two main error rate extensions

Family Wise Error Rate (FWER) FWER is probability of at least one false

positiveFWER= Pr(# of false discoveries >0) = Pr(V>0)

False Discovery Rate (FDR) FDR is expected value of proportion of false

positives among rejected null hypothesesFDR = E[V/R; R>0] = E[V/R | R>0]·P[R>0]

62

FDR and FWER controlling procedures

FWER Bonferroni (adj Pvalue = min{n*Pvalue,1})Holm (1979)Hochberg (1986)Westfall & Young (1993) maxT and minP

FDRBenjamini & Hochberg (1995)Benjamini & Yekutieli (2001)

63

Difference between controlling FWER or FDR FWER Controls for no (0) false positives

gives many fewer genes (false positives), but you are likely to miss many adequate if goal is to identify few genes that differ

between two groups

FDR Controls the proportion of false positives if you can tolerate more false positives you will get many fewer false negatives adequate if goal is to pursue the study e.g. to

determine functional relationships among genes

64

Steps to generate a list of candidate genes revisited (2)

Gene 1: M11, M12, …., M1k

Gene 2: M21, M22, …., M2k

…………….Gene G: MG1, MG2, …., MGk

For every gene, calculateSi=t(Mi1, Mi2, …., Mik),

e.g. t-statistics, S, B,…

A list of candidateDE genes

Statistics of interestS1, S2, …., SG

Assumption on the null distribution:data normality

Nominal p-valuesP1, P2, …, PG

Adjusted p-valuesaP1, aP2, …, aPG

Select genes with adjusted P-valuessmaller than

65

Example

Golub data, 27 ALL vs 11 AML samples, 3051 genes

Bonferroni adjustment: 98 genes with padj< 0.05 (praw < 0.000016)

66

Extensions

Some issues we have not dealt withReplicates within and between slidesSeveral effects: use a linear modelANOVA: are the effects equal?Time series: selecting genes for trends

Different solutions have been suggested for each problem

Still many open questions

Examples

68

Ex. 1- Swirl zebrafish experiment

Swirl is a point mutation causing defects in the organization of the developing embryo along its ventral-dorsal axis

As a result some cell types are reduced and others are expanded

A goal of this experiment was to identify genes with altered expression in the swirl mutant compared to the wild zebrafish

69

Example 1: Experimental design

Each microarray contained 8848 cDNA probes (either genes or EST sequences)

4 replicate slides: 2 sets of dye-swap pairs For each pair, target cDNA of the swirl mutant

was labeled using one of Cy5 or Cy3 and the target cDNA of the wild type mutant was labeled using the other dye

Wild type Swirl

2

2

70

Example 1. Data analysis

Gene expression data on 8848 genes for 4 samples (slides): Each hybridixed with Mutant and Wild type

On a gene-per-gene basis this is a one-sample problem

Hypothesis to be tested for each gene:H0: log2(R/G)=0

The decision will be based on average log-ratios

71

Example 2 . Scanvenger receptor BI (SR-BI) experiment

Callow et al. (2000). A study of lipid metabolism and atherosclerosis susceptibility in mice.

Transgenic mice with SR-BI gene overexpressed have low HDL cholesterol levels.

Goal: To identify genes with altered expression in the livers of transgenic mice with SR-BI gene overexpressed mice (T) compared to “normal” FVB control mice (C).

72

Example 2. Experimental design

8 treatment mice (Ti) and 8 control ones (Ci). 16 hybridizations: liver mRNA from each of the 16

mice (Ti , Ci ) is labelled with Cy5, while pooled liver mRNA from the control mice (C*) is labelled with Cy3.

Probes: ~ 6,000 cDNAs (genes), including 200 related to pathogenicity.

T

CC*

8

8

73


Gene expression data on 6348 genes for 16 samples: 8 for treatment (log T/C*) and 8 for control (log (C/C*))

On a gene-per-gene basis this is a 2 sample problem

Hypothesis to be tested for each gene:H0: [log (R1/G)-log (R2/G)]=0

Decision will be based on average difference of log ratios

Software for microarray data analysis

75

Introduction Microarray experiments generate huge

quantities of data which have to be Stored, managed, visualized, processed …

Many options available. However… No tool satisfies all user’s needs Trade-off. A tool must be

Powerful but user friendly Complete but without too many options, Flexible but easy to start with and go further Available, to date, well documented but affordable

76

So, what you need is “R”? R is an open-source system for statistical

computation and graphics. It consists of A language A run-time environment with

Graphics, a debugger, and Access to certain system functions,

It can be used Interactively, through a command languageOr running programs stored in script files

77

http://www.r-project.org/

78

Some pro’s & con’s Powerful, Used by statisticians Easy to extend

Creating add-on packages Many already available

Freely available Unix, windows & Mac Lot of documentation

Not very easy to learn Command-based Documentation

sometimes cryptic Memory intensive

Worst in windows Slow at times

We believe the effort is worth the pity!!!• If you “just want to do statistical analysis”

Easy to find alternatives• If you intend to do microarray data analysis

Probably one of best options

79

R and Microarrays

R is a popular tool between statisticians Once they started to work with microarrays

they continued using itTo perform the analysisTo implement new tools

This gave rise very fast to lots of free R-based software to analyze microarrays

The Bioconductor project groups many of these (but not all) developments

80

The Bioconductor project

Open source and open development software project for the analysis and comprehension of genomic data.

Most early developments as R packages. Extensive documentation and training material from

short courseshttp://www.bioconductor.org/workshop.html.

Has reached some stability but still evolving !!! what is now a standard may not be so in a future.

81

There's much more than R!

Give a look at

"My microarray software comparison"http://ihome.cuhk.edu.hk/~b400559/arraysoft.html

Examples

83

Ex. 1- Swirl zebrafish experiment

Swirl is a point mutation causing defects in the organization of the developing embryo along its ventral-dorsal axis

As a result some cell types are reduced and others are expanded

A goal of this experiment was to identify genes with altered expression in the swirl mutant compared to the wild zebrafish

84

Example 1: Experimental design

Each microarray contained 8848 cDNA probes (either genes or EST sequences)

4 replicate slides: 2 sets of dye-swap pairs For each pair, target cDNA of the swirl mutant

was labeled using one of Cy5 or Cy3 and the target cDNA of the wild type mutant was labeled using the other dye

Wild type Swirl

2

2

85


Gene expression data on 8848 genes for 4 samples (slides): Each hybridixed with Mutant and Wild type

On a gene-per-gene basis this is a one-sample problem

Hypothesis to be tested for each gene:H0: log2(R/G)=0

The decision will be based on average log-ratios

86

Example 2 . Scanvenger receptor BI (SR-BI) experiment

Callow et al. (2000). A study of lipid metabolism and atherosclerosis susceptibility in mice.

Transgenic mice with SR-BI gene overexpressed have low HDL cholesterol levels.

Goal: To identify genes with altered expression in the livers of transgenic mice with SR-BI gene overexpressed mice (T) compared to “normal” FVB control mice (C).

87

Example 2. Experimental design

8 treatment mice (Ti) and 8 control ones (Ci). 16 hybridizations: liver mRNA from each of the 16

mice (Ti , Ci ) is labelled with Cy5, while pooled liver mRNA from the control mice (C*) is labelled with Cy3.

Probes: ~ 6,000 cDNAs (genes), including 200 related to pathogenicity.

T

CC*

8

8

88


Gene expression data on 6348 genes for 16 samples: 8 for treatment (log T/C*) and 8 for control (log (C/C*))

On a gene-per-gene basis this is a 2 sample problem

Hypothesis to be tested for each gene:H0: [log (R1/G)-log (R2/G)]=0

Decision will be based on average difference of log ratios

Documents

Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat