88
Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat d’Estadística i BioinformàticaIR-HUVH

Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

  • View
    221

  • Download
    5

Embed Size (px)

Citation preview

Page 1: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

Some Statistical Issues in Microarray Data Analysis

Alex SánchezEstadística i Bioinformàtica

Departament d’Estadística Universitat de BarcelonaUnitat d’Estadística i BioinformàticaIR-HUVH

Page 2: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

2

Outline

Introduction Experimental design Selecting differentially expressed genes

Statistical tests Significance testing Linear models and Analysis of the variance Multiple testing

Software for microarray data analysis

Page 3: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

Introduction

Page 4: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

4

Microarray experiments: Overview

Page 5: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

5

Why are we talking of statistics?

A microarray experiment is, as called, an experiment, that is: It has been performed to determine if some

previous hypothesis are true or false (although it can also lead to new hypotheses)

It is subject to errors which may arise from many sources

Page 6: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

6

Sources of variability Biological Heterogeneity in Population Specimen Collection/ Handling Effects

Tumor: surgical bx, FNA Cell Line: culture condition, confluence

level Biological Heterogeneity in Specimen RNA extraction RNA amplification

Fluor labeling

Hybridization

Scanning – PMT voltage – laser power

(Geschwind, Nature Reviews Neuroscience, 2001)

Page 7: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

7

Categories of variability

Systematic variability Amount of RNA in the

biopsy Efficiencies of lab

procedures such as: RNA extraction, reverse transcription, Labeling or photodetection

Random variation PCR yield DNA quality spotting efficiency, spot size cross-/unspecific

hybridization stray signal

Page 8: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

8

Dealing with systematic variability

Systematic variability has similar effects on many measurements

Corrections can be estimated from dataCALIBRATION or NORMALIZATION is the

general name for processes that correct for systematic variability

Page 9: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

9

Dealing with random variation

Random variation cannot be explicitly accounted for

Usual way to deal with it is to assume some ERROR MODELS (e.g. ei~N(0, 2))

Assuming these error models are true… EXPERIMENTAL DESIGN is (must be) used to EXPERIMENTAL DESIGN is (must be) used to

control the action of random variationcontrol the action of random variation STATISTICAL INFERENCE is (must be) used to STATISTICAL INFERENCE is (must be) used to

extract conclusions in the presence of random extract conclusions in the presence of random variationvariation

Page 10: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

10

Biological verification and interpretation

Microarray experiment

Experimental design

Image analysis

Normalization

Biological question

TestingEstimation Discrimination

AnalysisClustering

Quality Measurement

Failed

Pass

Today

Page 11: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

Experimental design

Page 12: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

12

Why experimental design?

The objective of experimental design is to make the analysis of the data and the interpretation of the resultsAs simple and as powerful as possibleGiven the purpose of the experimentAnd the constraints of the experimental

material

Page 13: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

13

Scientific aims and design choice

The primary focus of the experiments needs to be clearly stated, whether it is: to identify differentially expressed genes to search for specific gene-expression patterns to identify phenotypic subclasses

Aim of the experiment guides design choiceSometimes only one choice is reasonableSometimes different options available

Page 14: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

14

Designing microarray experiments

The appropriate design of a microarray experiment must considerDesign of the arrayAllocation of mRNA samples to the slides

Page 15: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

15

I: Layout of the array

Which sequences to usecDNA’s Selection of cDNA from library

Riken, NIA, etcAffymetrix PM’s and MM’s

Oligo probes selection (from Operon, Agilent, etc)Control probes

What %?. Where should controls be put

How many sequences to use Should there be replicate spots within a slide?

Page 16: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

16

II: Allocating samples in slides

Types of SamplesReplication: technical vs biologicalPooled vs individual samples

Different design layout / data analysis:Scientific aim of the experimentEfficiency, Robustness, Extensibility

Physical limitations (cost) :Number of slidesAmount of material

Page 17: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

17

Basic principles of experimental design

Apply the following principles to best attain the objectives of experimental designReplicationLocal control or BlockingRandomization

Page 18: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

18

1. Replication It’s important

To reduce uncertainty (increase precision) To obtain sufficient power for the tests As a formal basis for inferential procedures

Consider different types of replicates Technical

Duplicate spots Multiple hybridizations from the same sample

Biological Repeat most what is expected to vary most!

2

var XXn

Page 19: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

19

Biological vs Technical Replicates

@ Nature reviews & G. Churchill (2002)

2B

2A

2e

Page 20: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

20

Replication vs Pooling mRNA from different samples are often combined to

form a ``pooled-sample’’ or pool. Why? If each sample doesn’t yield enough mRNATo compensate an excess of variability ?

Statisticians tend not to like it but pooling may be OK if properly doneCombine several samples in each poolUse several pools from different samplesDo not use pools when individual information is

important (e.g.paired designs)

Page 21: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

21

2. Blocking Assume we wish to perform an experiment to

compare two treatments. The samples or their processing may not be

homogeneous: There are blocks Subjects: Male/Female Arrays produced in two lots (February, March)

If there are systematic differences between blocks the effects of interest (e.g. tretament) may be confounded Observed differences are attributable to treatment

effect or to confounding factors?

Page 22: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

22

Confounding block with treatment effects

Sample Treatment Sex Batch Sample Treatment Sex Batch1 A Male 1 1 A Male 12 A Male 1 2 A Female 23 A Male 1 3 A Male 14 A Male 1 4 A Female 25 B Female 2 5 B Male 16 B Female 2 6 B Female 27 B Female 2 7 B Male 18 B Female 2 8 B Female 2

Awful design Balanced design

Two alternative designs to investigate treatment effects Left: Treatment effects confounded with Sex and Batch effect Right: Treatments are balanced between blocks

Influence of blocks is automatically compensated Statistical analysis may separate block from treatment efefect

Page 23: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

23

3. Randomisation

Randomly assigning samples to groups to eliminate unspecific disturbancesRandomly assign individuals to treatments.Randomise order in which experiments are

performed. Randomisation required to ensure validity

of statistical procedures. Block what you can and randomize what

you cannot

Page 24: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

24

Experimental layout

How are mRNA samples assigned to arrays The experimental layout has to be chosen

so that the resulting analysis can be done as efficient and robust as possibleSometimes there is only one reasonable choiceSometimes several choices are available

Page 25: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

25

Case 1: Meaningful biological control (C)Samples: Liver tissue from 4 mice treated by cholesterol modifying

drugs.Question 1: Genes that respond differently between the T and the C.Question 2: Genes that responded similarly across two or more treatments relative to control.

Case 2: Use of universal reference.Samples: Different tumor samples.

Question: To discover tumor subtypes.

Example I: Only one design choice

T2 T3 T4

C

T1 T1

Ref

T2 Tn-1 Tn

Page 26: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

26

Example 2: a number of different designs are suitable for use (1) Time course experiments

Design choice depends on the comparisons of interest

T2 T3 T4T1

Ref T2 T3 T4T1

T2 T3 T4T1 T2 T3 T4T1

Page 27: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

27

How can we decide?

A-optimality: choosee design which minimizes variance of estimates of effects of interest

A simple example: Direct vs indirect estimates

A BA

BR

Direct Indirect

2 /2 22

average (log (A/B)) log (A / R) – log (B / R )

Page 28: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

28

Summary Selection of mRNA samples is important

Most important: biological replicates Technical replicates also useful, but different If needed and possible use pooling wisely

Choice of experimental layout guided by The scientific question Experimental design principles Efficiency and robustness considerations

Correspondence between experimental Designs-Linear Models-ANOVA can be exploited to select model and analyze data

Page 29: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

29

Experimental design, Linear Models and Analysis of the Variance In experimental design the different

sources of variability influencing the observed response may be identified.

These sources can be related with the response using a linear model

Analysis of the variance can be used to separately estimate and test the relative importance of each source of variability.

Page 30: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

Statistical methods to detect differentially expressed genes

Page 31: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

31

Class comparison: Identifying differentially expressed genes Identify genes differentially expressed between

different conditions such as Treatment, cell type,... (qualitative covariates) Dose, time, ... (quantitative covariate) Survival, infection time,... !

Estimate effects/differences between groups probably using log-ratios, i.e. the difference on log scale log(X)-log(Y) [=log(X/Y)]

Page 32: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

32

What is a “significant change”?

Depends on the variability within groups, which may be different from gene to gene.

To assess the statistical significance of differences, conduct a statistical test for each gene.

Page 33: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

33

Different settings for statistical tests Indirect comparisons: 2 groups, 2 samples, unpaired

E.g. 10 individuals: 5 suffer diabetes, 5 healthy One sample fro each individual Typically: Two sample t-test or similar

Direct comparisons: Two groups, two samples, paired E.g. 6 individuals with brain stroke. Two samples from each: one from healthy (region 1) and

one from affected (region 2). Typically: One sample t-test (also called paired t-test) or

similar based on the individual differences between conditions.

Page 34: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

34

Different ways to do the experiment

An experiment use cDNA arrays (“two-colour”) or affy (“one-colour).

Depending on the technology used allocation of conditions to slides changes.

Type of chip

Experiment

cDNA(2-col)

Affy

(1-col)

10 indiv.

Diab (5)

Heal (5)

Reference design.

(5) Diab/Ref (5) Heal/Ref

Comparison design.

(5) Diab vs (5) Heal

6 indiv.

Region 1

Region 2

6 slides

1 individual per slide

(6) reg1/reg2

12 slides

(6) Paired differences

Page 35: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

35

1 1

1 1Mean difference =

Classical t-test = ( ) 1/ 1/

Robust t-test = Use robust estimates of location &scale

CT nn

i ii iT C

p T C

T C T Cn n

t T C s n n

“Natural” measures of discrepancy

1

1Mean (log) ratio = , (R or M used indistinctly)

Classical t-test = ( ) , ( estimates standard error of R)

Robust t-test = Use robust estimates of location &scale

Tn

iiT

Rn

t R SE SE

For Direct comparisons in two colour or paired-one colour.

For Indirect comparisons in two colour or Direct comparisons in one colour.

Page 36: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

36

Some Issues Can we trust average effect sizes (average difference of

means) alone? Can we trust the t statistic alone? Here is evidence that the answer is no.

Gene M1 M2 M3 M4 M5 M6 Mean SD t

A 2.5 2.7 2.5 2.8 3.2 2 2.61 0.40 16.10

B 0.01 0.05 -0.05 0.01 0 0 0.003 0.03 0.25

C 2.5 2.7 2.5 1.8 20 1 5.08 7.34 1.69

D 0.5 0 0.2 0.1 -0.3 0.3 0.13 0.27 1.19

E 0.1 0.11 0.1 0.1 0.11 0.09 0.10 0.01 33.09

Courtesy of Y.H. Yang

Page 37: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

37

Some Issues

Gene M1 M2 M3 M4 M5 M6 Mean SD t

A 2.5 2.7 2.5 2.8 3.2 2 2.61 0.40 16.10

B 0.01 0.05 -0.05 0.01 0 0 0.003 0.03 0.25

C 2.5 2.7 2.5 1.8 20 1 5.08 7.34 1.69

D 0.5 0 0.2 0.1 -0.3 0.3 0.13 0.27 1.19

E 0.1 0.11 0.1 0.1 0.11 0.09 0.10 0.01 33.09

Courtesy of Y.H. Yang

Can we trust average effect sizes (average difference of means) alone?

Can we trust the t statistic alone? Here is evidence that the answer is no.

•Averages can be driven by outliers.

Page 38: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

38

Some Issues

Gene M1 M2 M3 M4 M5 M6 Mean SD t

A 2.5 2.7 2.5 2.8 3.2 2 2.61 0.40 16.10

B 0.01 0.05 -0.05 0.01 0 0 0.003 0.03 0.25

C 2.5 2.7 2.5 1.8 20 1 5.08 7.34 1.69

D 0.5 0 0.2 0.1 -0.3 0.3 0.13 0.27 1.19

E 0.1 0.11 0.1 0.1 0.11 0.09 0.10 0.01 33.09

Courtesy of Y.H. Yang•t’s can be driven by tiny variances.

Can we trust average effect sizes (average difference of means) alone?

Can we trust the t statistic alone? Here is evidence that the answer is no.

Page 39: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

39

Variations in t-tests (1)

Let Rg mean observed log ratio

SEg standard error of Rg estimated from data on gene g.

SE standard error of Rg estimated from data across all genes.

Global t-test: t=Rg/SE

Gene-specific t-test t=Rg/SEg

Page 40: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

40

Some pro’s and con’s of t-test

Test Pro’s Con’s

Global t-test:

t=Rg/SE

Yields stable variance estimate

Assumes variance homogeneity

biased if false

Gene-specific: t=Rg/SEg

Robust to variance heterogeneity

Low power Yields unstable variance estimates (due to few data)

Page 41: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

41

T-tests extensions

g

g

RS

c SE

2 20

0

( 1)

2

g

g

Rt

v SE n SE

v n

2 20 0

0

g

g

Rt

d SE d SE

d d

SAM (Tibshirani, 2001)

Regularized-t (Baldi, 2001)

EB-moderated t(Smyth, 2003)

Page 42: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

42

Up to here…: Can we generate a list of candidate genes?

Gene 1: M11, M12, …., M1k

Gene 2: M21, M22, …., M2k

…………….Gene G: MG1, MG2, …., MGk

For every gene, calculateSi=t(Mi1, Mi2, …., Mik),

e.g. t-statistics, S, B,…

A list of candidateDE genes

Statistics of interestS1, S2, …., SG

?

With the tools we have, the reasonable steps to generate a list of candidate genes may be:

We need an idea of how significant are these values We’d like to assign them p-values

Page 43: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

Significance testing

Page 44: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

44

Nominal p-values

After a test statistic is computed, it is convenient to convert it to a p-value:

The probability that a test statistic, say S(X), takes values equal or greater than that taken on the observed sample, say S(X0), under the assumption that the null hypothesis is true

p=P{S(X)>=S(X0)|H0 true}

Page 45: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

45

Significance testing

Test of significance at the level:Reject the null hypothesis if your p-value

is smaller than the significance levelIt has advantages but not free from

criticisms Genes with p-values falling below a

prescribed level may be regarded as significant

Page 46: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

46

Hypothesis testing overview for a single gene

Reported decision

H0 is Rejected

(gene is Selected)

H0 is Accepted

(gene not Selected)

State of the nature ("Truth")

H0 is false

(Affected) TP, prob: 1-

FN, prob: 1-Type II error

Sensitiviy

TP/[TP+FN]

H0 is true

(Not Affected)

FP, P[Rej H0|H0]<=

Type I error

TN , prob: Specificity

TN/[TN+FP]

Positive predictive value

TP/[TP+FP]

Negative predictive value

TN/[TN+FN]

Page 47: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

47

Calculation of p-values

Standard methods for calculating p-values:

(i) Refer to a statistical distribution table (Normal, t, F, …) or

(ii) Perform a permutation analysis

Page 48: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

48

(i) Tabulated p-values

Tabulated p-values can be obtained for standard test statistics (e.g.the t-test)

They often rely on the assumption of normally distributed errors in the data

This assumption can be checked (approximately) using a HistogramQ-Q plot

Page 49: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

49

Example

Golub data, 27 ALL vs 11 AML samples, 3051 genesA t-test yields 1045 genes with p< 0.05

Page 50: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

50

(ii) Permutations tests

Based on data shuffling. No assumptions Random interchange of labels between samples Estimate p-values for each comparison (gene) by

using the permutation distribution of the t-statistics Repeat for every possible permutation, b=1…B

Permute the n data points for the gene (x). The first n1 are referred to as “treatments”, the second n2 as “controls”

For each gene, calculate the corresponding two samplet-statistic, tb

After all the B permutations are done putp = #{b: |tb| ≥ |tobserved|}/B

Page 51: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

51

Permutation tests (2)

Page 52: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

52

Volcano plot : fold change vs log(odds)1

Significant change detected No change detected1: log(odds) is proportional to “-log (p-value)”

Page 53: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

Linear models and Analysis of the Variance to

analyze designed experiments

Page 54: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

54

From experimental design to linear models

Some weaknesses of statistical frameworkWhat to do if treatment has more than 2 levels? How to deal with more than one treatment or

experimental condition?How to deal with nuisance factors such as

batch effects, covariates, etc…? Most of this can be solved with an

alternative approach: Analysis of the Variance

Page 55: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

Multiple testing

Page 56: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

56

How far can we trust the decision?

The test: "Reject H0 if p-val ≤ " is said to control the type I error because,

under a certain set of assumptions,the probability of falsely rejecting H0 is less than a fixed small threshold

Nothing is warranted about P[FN] “Optimal” tests are built trying to minimize this

probability In practical situations it is often high

Page 57: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

57

What if we wish to test more than one gene at once? (1)

Consider more than one test at onceTwo tests each at 5% level. Now probability of

getting a false positive is 1 – 0.95*0.95 = 0.0975Three tests 1 – 0.953 =0.1426n tests 1 – 0.95n

Converge towards 1 as n increases Small p-values don’t necessarily imply

significance!!! We are not controlling the probability of type I error anymore

Page 58: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

58

What if we wish to test more than one gene at once? (2): a simulation

Simulation of this process for 6,000 genes with 8 treatments and 8 controls

All the gene expression values were simulated i.i.d from a N (0,1) distribution, i.e. NOTHING is differentially expressed in our simulation

The number of genes falsely rejected will be on the average of (6000 · ), i.e. if we wanted to reject all genes with a p-value of less than 1% we would falsely reject around 60 genes

See example

Page 59: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

59

Multiple testing: Counting errors

Decision reported

H0 is Rejected

(Genes Selected)

H0 is accepted (Genes not Selected)

Total

State of the nature

("Truth")

H0 is false

(Affected)mm (S)

(m-mo)-(mm

(T) m-mo

H0 is true

(Not Affected)

m (V) mo-m (U) mo

Total M (R) m-m (m-R) m

V = # Type I errors [false positives]T = # Type II errors [false negatives]All these quantities could be known if m0 was known

Page 60: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

60

How does type I error control extend to multiple testing situations?

Selecting genes with a p-value less than doesn’t control for P[FP] anymore

What can be done?Extend the idea of type I error

FWER and FDR are two such extensions

Look for procedures that control the probability for these extended error types

Mainly adjust raw p-values

Page 61: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

61

Two main error rate extensions

Family Wise Error Rate (FWER) FWER is probability of at least one false

positiveFWER= Pr(# of false discoveries >0) = Pr(V>0)

False Discovery Rate (FDR) FDR is expected value of proportion of false

positives among rejected null hypothesesFDR = E[V/R; R>0] = E[V/R | R>0]·P[R>0]

Page 62: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

62

FDR and FWER controlling procedures

FWER Bonferroni (adj Pvalue = min{n*Pvalue,1})Holm (1979)Hochberg (1986)Westfall & Young (1993) maxT and minP

FDRBenjamini & Hochberg (1995)Benjamini & Yekutieli (2001)

Page 63: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

63

Difference between controlling FWER or FDR FWER Controls for no (0) false positives

gives many fewer genes (false positives), but you are likely to miss many adequate if goal is to identify few genes that differ

between two groups

FDR Controls the proportion of false positives if you can tolerate more false positives you will get many fewer false negatives adequate if goal is to pursue the study e.g. to

determine functional relationships among genes

Page 64: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

64

Steps to generate a list of candidate genes revisited (2)

Gene 1: M11, M12, …., M1k

Gene 2: M21, M22, …., M2k

…………….Gene G: MG1, MG2, …., MGk

For every gene, calculateSi=t(Mi1, Mi2, …., Mik),

e.g. t-statistics, S, B,…

A list of candidateDE genes

Statistics of interestS1, S2, …., SG

Assumption on the null distribution:data normality

Nominal p-valuesP1, P2, …, PG

Adjusted p-valuesaP1, aP2, …, aPG

Select genes with adjusted P-valuessmaller than

Page 65: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

65

Example

Golub data, 27 ALL vs 11 AML samples, 3051 genes

Bonferroni adjustment: 98 genes with padj< 0.05 (praw < 0.000016)

Page 66: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

66

Extensions

Some issues we have not dealt withReplicates within and between slidesSeveral effects: use a linear modelANOVA: are the effects equal?Time series: selecting genes for trends

Different solutions have been suggested for each problem

Still many open questions

Page 67: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

Examples

Page 68: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

68

Ex. 1- Swirl zebrafish experiment

Swirl is a point mutation causing defects in the organization of the developing embryo along its ventral-dorsal axis

As a result some cell types are reduced and others are expanded

A goal of this experiment was to identify genes with altered expression in the swirl mutant compared to the wild zebrafish

Page 69: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

69

Example 1: Experimental design

Each microarray contained 8848 cDNA probes (either genes or EST sequences)

4 replicate slides: 2 sets of dye-swap pairs For each pair, target cDNA of the swirl mutant

was labeled using one of Cy5 or Cy3 and the target cDNA of the wild type mutant was labeled using the other dye

Wild type Swirl

2

2

Page 70: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

70

Example 1. Data analysis

Gene expression data on 8848 genes for 4 samples (slides): Each hybridixed with Mutant and Wild type

On a gene-per-gene basis this is a one-sample problem

Hypothesis to be tested for each gene:H0: log2(R/G)=0

The decision will be based on average log-ratios

Page 71: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

71

Example 2 . Scanvenger receptor BI (SR-BI) experiment

Callow et al. (2000). A study of lipid metabolism and atherosclerosis susceptibility in mice.

Transgenic mice with SR-BI gene overexpressed have low HDL cholesterol levels.

Goal: To identify genes with altered expression in the livers of transgenic mice with SR-BI gene overexpressed mice (T) compared to “normal” FVB control mice (C).

Page 72: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

72

Example 2. Experimental design

8 treatment mice (Ti) and 8 control ones (Ci). 16 hybridizations: liver mRNA from each of the 16

mice (Ti , Ci ) is labelled with Cy5, while pooled liver mRNA from the control mice (C*) is labelled with Cy3.

Probes: ~ 6,000 cDNAs (genes), including 200 related to pathogenicity.

T

CC*

8

8

Page 73: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

73

Example 2. Data analysis

Gene expression data on 6348 genes for 16 samples: 8 for treatment (log T/C*) and 8 for control (log (C/C*))

On a gene-per-gene basis this is a 2 sample problem

Hypothesis to be tested for each gene:H0: [log (R1/G)-log (R2/G)]=0

Decision will be based on average difference of log ratios

Page 74: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

Software for microarray data analysis

Page 75: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

75

Introduction Microarray experiments generate huge

quantities of data which have to be Stored, managed, visualized, processed …

Many options available. However… No tool satisfies all user’s needs Trade-off. A tool must be

Powerful but user friendly Complete but without too many options, Flexible but easy to start with and go further Available, to date, well documented but affordable

Page 76: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

76

So, what you need is “R”? R is an open-source system for statistical

computation and graphics. It consists of A language A run-time environment with

Graphics, a debugger, and Access to certain system functions,

It can be used Interactively, through a command languageOr running programs stored in script files

Page 77: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

77

http://www.r-project.org/

Page 78: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

78

Some pro’s & con’s Powerful, Used by statisticians Easy to extend

Creating add-on packages Many already available

Freely available Unix, windows & Mac Lot of documentation

Not very easy to learn Command-based Documentation

sometimes cryptic Memory intensive

Worst in windows Slow at times

We believe the effort is worth the pity!!!• If you “just want to do statistical analysis”

Easy to find alternatives• If you intend to do microarray data analysis

Probably one of best options

Page 79: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

79

R and Microarrays

R is a popular tool between statisticians Once they started to work with microarrays

they continued using itTo perform the analysisTo implement new tools

This gave rise very fast to lots of free R-based software to analyze microarrays

The Bioconductor project groups many of these (but not all) developments

Page 80: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

80

The Bioconductor project

Open source and open development software project for the analysis and comprehension of genomic data.

Most early developments as R packages. Extensive documentation and training material from

short courseshttp://www.bioconductor.org/workshop.html.

Has reached some stability but still evolving !!! what is now a standard may not be so in a future.

Page 81: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

81

There's much more than R!

Give a look at

"My microarray software comparison"http://ihome.cuhk.edu.hk/~b400559/arraysoft.html

Page 82: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

Examples

Page 83: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

83

Ex. 1- Swirl zebrafish experiment

Swirl is a point mutation causing defects in the organization of the developing embryo along its ventral-dorsal axis

As a result some cell types are reduced and others are expanded

A goal of this experiment was to identify genes with altered expression in the swirl mutant compared to the wild zebrafish

Page 84: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

84

Example 1: Experimental design

Each microarray contained 8848 cDNA probes (either genes or EST sequences)

4 replicate slides: 2 sets of dye-swap pairs For each pair, target cDNA of the swirl mutant

was labeled using one of Cy5 or Cy3 and the target cDNA of the wild type mutant was labeled using the other dye

Wild type Swirl

2

2

Page 85: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

85

Example 1. Data analysis

Gene expression data on 8848 genes for 4 samples (slides): Each hybridixed with Mutant and Wild type

On a gene-per-gene basis this is a one-sample problem

Hypothesis to be tested for each gene:H0: log2(R/G)=0

The decision will be based on average log-ratios

Page 86: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

86

Example 2 . Scanvenger receptor BI (SR-BI) experiment

Callow et al. (2000). A study of lipid metabolism and atherosclerosis susceptibility in mice.

Transgenic mice with SR-BI gene overexpressed have low HDL cholesterol levels.

Goal: To identify genes with altered expression in the livers of transgenic mice with SR-BI gene overexpressed mice (T) compared to “normal” FVB control mice (C).

Page 87: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

87

Example 2. Experimental design

8 treatment mice (Ti) and 8 control ones (Ci). 16 hybridizations: liver mRNA from each of the 16

mice (Ti , Ci ) is labelled with Cy5, while pooled liver mRNA from the control mice (C*) is labelled with Cy3.

Probes: ~ 6,000 cDNAs (genes), including 200 related to pathogenicity.

T

CC*

8

8

Page 88: Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat

88

Example 2. Data analysis

Gene expression data on 6348 genes for 16 samples: 8 for treatment (log T/C*) and 8 for control (log (C/C*))

On a gene-per-gene basis this is a 2 sample problem

Hypothesis to be tested for each gene:H0: [log (R1/G)-log (R2/G)]=0

Decision will be based on average difference of log ratios