View
221
Download
5
Tags:
Embed Size (px)
Citation preview
Some Statistical Issues in Microarray Data Analysis
Alex SánchezEstadística i Bioinformàtica
Departament d’Estadística Universitat de BarcelonaUnitat d’Estadística i BioinformàticaIR-HUVH
2
Outline
Introduction Experimental design Selecting differentially expressed genes
Statistical tests Significance testing Linear models and Analysis of the variance Multiple testing
Software for microarray data analysis
Introduction
4
Microarray experiments: Overview
5
Why are we talking of statistics?
A microarray experiment is, as called, an experiment, that is: It has been performed to determine if some
previous hypothesis are true or false (although it can also lead to new hypotheses)
It is subject to errors which may arise from many sources
6
Sources of variability Biological Heterogeneity in Population Specimen Collection/ Handling Effects
Tumor: surgical bx, FNA Cell Line: culture condition, confluence
level Biological Heterogeneity in Specimen RNA extraction RNA amplification
Fluor labeling
Hybridization
Scanning – PMT voltage – laser power
(Geschwind, Nature Reviews Neuroscience, 2001)
7
Categories of variability
Systematic variability Amount of RNA in the
biopsy Efficiencies of lab
procedures such as: RNA extraction, reverse transcription, Labeling or photodetection
Random variation PCR yield DNA quality spotting efficiency, spot size cross-/unspecific
hybridization stray signal
8
Dealing with systematic variability
Systematic variability has similar effects on many measurements
Corrections can be estimated from dataCALIBRATION or NORMALIZATION is the
general name for processes that correct for systematic variability
9
Dealing with random variation
Random variation cannot be explicitly accounted for
Usual way to deal with it is to assume some ERROR MODELS (e.g. ei~N(0, 2))
Assuming these error models are true… EXPERIMENTAL DESIGN is (must be) used to EXPERIMENTAL DESIGN is (must be) used to
control the action of random variationcontrol the action of random variation STATISTICAL INFERENCE is (must be) used to STATISTICAL INFERENCE is (must be) used to
extract conclusions in the presence of random extract conclusions in the presence of random variationvariation
10
Biological verification and interpretation
Microarray experiment
Experimental design
Image analysis
Normalization
Biological question
TestingEstimation Discrimination
AnalysisClustering
Quality Measurement
Failed
Pass
Today
Experimental design
12
Why experimental design?
The objective of experimental design is to make the analysis of the data and the interpretation of the resultsAs simple and as powerful as possibleGiven the purpose of the experimentAnd the constraints of the experimental
material
13
Scientific aims and design choice
The primary focus of the experiments needs to be clearly stated, whether it is: to identify differentially expressed genes to search for specific gene-expression patterns to identify phenotypic subclasses
Aim of the experiment guides design choiceSometimes only one choice is reasonableSometimes different options available
14
Designing microarray experiments
The appropriate design of a microarray experiment must considerDesign of the arrayAllocation of mRNA samples to the slides
15
I: Layout of the array
Which sequences to usecDNA’s Selection of cDNA from library
Riken, NIA, etcAffymetrix PM’s and MM’s
Oligo probes selection (from Operon, Agilent, etc)Control probes
What %?. Where should controls be put
How many sequences to use Should there be replicate spots within a slide?
16
II: Allocating samples in slides
Types of SamplesReplication: technical vs biologicalPooled vs individual samples
Different design layout / data analysis:Scientific aim of the experimentEfficiency, Robustness, Extensibility
Physical limitations (cost) :Number of slidesAmount of material
17
Basic principles of experimental design
Apply the following principles to best attain the objectives of experimental designReplicationLocal control or BlockingRandomization
18
1. Replication It’s important
To reduce uncertainty (increase precision) To obtain sufficient power for the tests As a formal basis for inferential procedures
Consider different types of replicates Technical
Duplicate spots Multiple hybridizations from the same sample
Biological Repeat most what is expected to vary most!
2
var XXn
19
Biological vs Technical Replicates
@ Nature reviews & G. Churchill (2002)
2B
2A
2e
20
Replication vs Pooling mRNA from different samples are often combined to
form a ``pooled-sample’’ or pool. Why? If each sample doesn’t yield enough mRNATo compensate an excess of variability ?
Statisticians tend not to like it but pooling may be OK if properly doneCombine several samples in each poolUse several pools from different samplesDo not use pools when individual information is
important (e.g.paired designs)
21
2. Blocking Assume we wish to perform an experiment to
compare two treatments. The samples or their processing may not be
homogeneous: There are blocks Subjects: Male/Female Arrays produced in two lots (February, March)
If there are systematic differences between blocks the effects of interest (e.g. tretament) may be confounded Observed differences are attributable to treatment
effect or to confounding factors?
22
Confounding block with treatment effects
Sample Treatment Sex Batch Sample Treatment Sex Batch1 A Male 1 1 A Male 12 A Male 1 2 A Female 23 A Male 1 3 A Male 14 A Male 1 4 A Female 25 B Female 2 5 B Male 16 B Female 2 6 B Female 27 B Female 2 7 B Male 18 B Female 2 8 B Female 2
Awful design Balanced design
Two alternative designs to investigate treatment effects Left: Treatment effects confounded with Sex and Batch effect Right: Treatments are balanced between blocks
Influence of blocks is automatically compensated Statistical analysis may separate block from treatment efefect
23
3. Randomisation
Randomly assigning samples to groups to eliminate unspecific disturbancesRandomly assign individuals to treatments.Randomise order in which experiments are
performed. Randomisation required to ensure validity
of statistical procedures. Block what you can and randomize what
you cannot
24
Experimental layout
How are mRNA samples assigned to arrays The experimental layout has to be chosen
so that the resulting analysis can be done as efficient and robust as possibleSometimes there is only one reasonable choiceSometimes several choices are available
25
Case 1: Meaningful biological control (C)Samples: Liver tissue from 4 mice treated by cholesterol modifying
drugs.Question 1: Genes that respond differently between the T and the C.Question 2: Genes that responded similarly across two or more treatments relative to control.
Case 2: Use of universal reference.Samples: Different tumor samples.
Question: To discover tumor subtypes.
Example I: Only one design choice
T2 T3 T4
C
T1 T1
Ref
T2 Tn-1 Tn
26
Example 2: a number of different designs are suitable for use (1) Time course experiments
Design choice depends on the comparisons of interest
T2 T3 T4T1
Ref T2 T3 T4T1
T2 T3 T4T1 T2 T3 T4T1
27
How can we decide?
A-optimality: choosee design which minimizes variance of estimates of effects of interest
A simple example: Direct vs indirect estimates
A BA
BR
Direct Indirect
2 /2 22
average (log (A/B)) log (A / R) – log (B / R )
28
Summary Selection of mRNA samples is important
Most important: biological replicates Technical replicates also useful, but different If needed and possible use pooling wisely
Choice of experimental layout guided by The scientific question Experimental design principles Efficiency and robustness considerations
Correspondence between experimental Designs-Linear Models-ANOVA can be exploited to select model and analyze data
29
Experimental design, Linear Models and Analysis of the Variance In experimental design the different
sources of variability influencing the observed response may be identified.
These sources can be related with the response using a linear model
Analysis of the variance can be used to separately estimate and test the relative importance of each source of variability.
Statistical methods to detect differentially expressed genes
31
Class comparison: Identifying differentially expressed genes Identify genes differentially expressed between
different conditions such as Treatment, cell type,... (qualitative covariates) Dose, time, ... (quantitative covariate) Survival, infection time,... !
Estimate effects/differences between groups probably using log-ratios, i.e. the difference on log scale log(X)-log(Y) [=log(X/Y)]
32
What is a “significant change”?
Depends on the variability within groups, which may be different from gene to gene.
To assess the statistical significance of differences, conduct a statistical test for each gene.
33
Different settings for statistical tests Indirect comparisons: 2 groups, 2 samples, unpaired
E.g. 10 individuals: 5 suffer diabetes, 5 healthy One sample fro each individual Typically: Two sample t-test or similar
Direct comparisons: Two groups, two samples, paired E.g. 6 individuals with brain stroke. Two samples from each: one from healthy (region 1) and
one from affected (region 2). Typically: One sample t-test (also called paired t-test) or
similar based on the individual differences between conditions.
34
Different ways to do the experiment
An experiment use cDNA arrays (“two-colour”) or affy (“one-colour).
Depending on the technology used allocation of conditions to slides changes.
Type of chip
Experiment
cDNA(2-col)
Affy
(1-col)
10 indiv.
Diab (5)
Heal (5)
Reference design.
(5) Diab/Ref (5) Heal/Ref
Comparison design.
(5) Diab vs (5) Heal
6 indiv.
Region 1
Region 2
6 slides
1 individual per slide
(6) reg1/reg2
12 slides
(6) Paired differences
35
1 1
1 1Mean difference =
Classical t-test = ( ) 1/ 1/
Robust t-test = Use robust estimates of location &scale
CT nn
i ii iT C
p T C
T C T Cn n
t T C s n n
“Natural” measures of discrepancy
1
1Mean (log) ratio = , (R or M used indistinctly)
Classical t-test = ( ) , ( estimates standard error of R)
Robust t-test = Use robust estimates of location &scale
Tn
iiT
Rn
t R SE SE
For Direct comparisons in two colour or paired-one colour.
For Indirect comparisons in two colour or Direct comparisons in one colour.
36
Some Issues Can we trust average effect sizes (average difference of
means) alone? Can we trust the t statistic alone? Here is evidence that the answer is no.
Gene M1 M2 M3 M4 M5 M6 Mean SD t
A 2.5 2.7 2.5 2.8 3.2 2 2.61 0.40 16.10
B 0.01 0.05 -0.05 0.01 0 0 0.003 0.03 0.25
C 2.5 2.7 2.5 1.8 20 1 5.08 7.34 1.69
D 0.5 0 0.2 0.1 -0.3 0.3 0.13 0.27 1.19
E 0.1 0.11 0.1 0.1 0.11 0.09 0.10 0.01 33.09
Courtesy of Y.H. Yang
37
Some Issues
Gene M1 M2 M3 M4 M5 M6 Mean SD t
A 2.5 2.7 2.5 2.8 3.2 2 2.61 0.40 16.10
B 0.01 0.05 -0.05 0.01 0 0 0.003 0.03 0.25
C 2.5 2.7 2.5 1.8 20 1 5.08 7.34 1.69
D 0.5 0 0.2 0.1 -0.3 0.3 0.13 0.27 1.19
E 0.1 0.11 0.1 0.1 0.11 0.09 0.10 0.01 33.09
Courtesy of Y.H. Yang
Can we trust average effect sizes (average difference of means) alone?
Can we trust the t statistic alone? Here is evidence that the answer is no.
•Averages can be driven by outliers.
38
Some Issues
Gene M1 M2 M3 M4 M5 M6 Mean SD t
A 2.5 2.7 2.5 2.8 3.2 2 2.61 0.40 16.10
B 0.01 0.05 -0.05 0.01 0 0 0.003 0.03 0.25
C 2.5 2.7 2.5 1.8 20 1 5.08 7.34 1.69
D 0.5 0 0.2 0.1 -0.3 0.3 0.13 0.27 1.19
E 0.1 0.11 0.1 0.1 0.11 0.09 0.10 0.01 33.09
Courtesy of Y.H. Yang•t’s can be driven by tiny variances.
Can we trust average effect sizes (average difference of means) alone?
Can we trust the t statistic alone? Here is evidence that the answer is no.
39
Variations in t-tests (1)
Let Rg mean observed log ratio
SEg standard error of Rg estimated from data on gene g.
SE standard error of Rg estimated from data across all genes.
Global t-test: t=Rg/SE
Gene-specific t-test t=Rg/SEg
40
Some pro’s and con’s of t-test
Test Pro’s Con’s
Global t-test:
t=Rg/SE
Yields stable variance estimate
Assumes variance homogeneity
biased if false
Gene-specific: t=Rg/SEg
Robust to variance heterogeneity
Low power Yields unstable variance estimates (due to few data)
41
T-tests extensions
g
g
RS
c SE
2 20
0
( 1)
2
g
g
Rt
v SE n SE
v n
2 20 0
0
g
g
Rt
d SE d SE
d d
SAM (Tibshirani, 2001)
Regularized-t (Baldi, 2001)
EB-moderated t(Smyth, 2003)
42
Up to here…: Can we generate a list of candidate genes?
Gene 1: M11, M12, …., M1k
Gene 2: M21, M22, …., M2k
…………….Gene G: MG1, MG2, …., MGk
For every gene, calculateSi=t(Mi1, Mi2, …., Mik),
e.g. t-statistics, S, B,…
A list of candidateDE genes
Statistics of interestS1, S2, …., SG
?
With the tools we have, the reasonable steps to generate a list of candidate genes may be:
We need an idea of how significant are these values We’d like to assign them p-values
Significance testing
44
Nominal p-values
After a test statistic is computed, it is convenient to convert it to a p-value:
The probability that a test statistic, say S(X), takes values equal or greater than that taken on the observed sample, say S(X0), under the assumption that the null hypothesis is true
p=P{S(X)>=S(X0)|H0 true}
45
Significance testing
Test of significance at the level:Reject the null hypothesis if your p-value
is smaller than the significance levelIt has advantages but not free from
criticisms Genes with p-values falling below a
prescribed level may be regarded as significant
46
Hypothesis testing overview for a single gene
Reported decision
H0 is Rejected
(gene is Selected)
H0 is Accepted
(gene not Selected)
State of the nature ("Truth")
H0 is false
(Affected) TP, prob: 1-
FN, prob: 1-Type II error
Sensitiviy
TP/[TP+FN]
H0 is true
(Not Affected)
FP, P[Rej H0|H0]<=
Type I error
TN , prob: Specificity
TN/[TN+FP]
Positive predictive value
TP/[TP+FP]
Negative predictive value
TN/[TN+FN]
47
Calculation of p-values
Standard methods for calculating p-values:
(i) Refer to a statistical distribution table (Normal, t, F, …) or
(ii) Perform a permutation analysis
48
(i) Tabulated p-values
Tabulated p-values can be obtained for standard test statistics (e.g.the t-test)
They often rely on the assumption of normally distributed errors in the data
This assumption can be checked (approximately) using a HistogramQ-Q plot
49
Example
Golub data, 27 ALL vs 11 AML samples, 3051 genesA t-test yields 1045 genes with p< 0.05
50
(ii) Permutations tests
Based on data shuffling. No assumptions Random interchange of labels between samples Estimate p-values for each comparison (gene) by
using the permutation distribution of the t-statistics Repeat for every possible permutation, b=1…B
Permute the n data points for the gene (x). The first n1 are referred to as “treatments”, the second n2 as “controls”
For each gene, calculate the corresponding two samplet-statistic, tb
After all the B permutations are done putp = #{b: |tb| ≥ |tobserved|}/B
51
Permutation tests (2)
52
Volcano plot : fold change vs log(odds)1
Significant change detected No change detected1: log(odds) is proportional to “-log (p-value)”
Linear models and Analysis of the Variance to
analyze designed experiments
54
From experimental design to linear models
Some weaknesses of statistical frameworkWhat to do if treatment has more than 2 levels? How to deal with more than one treatment or
experimental condition?How to deal with nuisance factors such as
batch effects, covariates, etc…? Most of this can be solved with an
alternative approach: Analysis of the Variance
Multiple testing
56
How far can we trust the decision?
The test: "Reject H0 if p-val ≤ " is said to control the type I error because,
under a certain set of assumptions,the probability of falsely rejecting H0 is less than a fixed small threshold
Nothing is warranted about P[FN] “Optimal” tests are built trying to minimize this
probability In practical situations it is often high
57
What if we wish to test more than one gene at once? (1)
Consider more than one test at onceTwo tests each at 5% level. Now probability of
getting a false positive is 1 – 0.95*0.95 = 0.0975Three tests 1 – 0.953 =0.1426n tests 1 – 0.95n
Converge towards 1 as n increases Small p-values don’t necessarily imply
significance!!! We are not controlling the probability of type I error anymore
58
What if we wish to test more than one gene at once? (2): a simulation
Simulation of this process for 6,000 genes with 8 treatments and 8 controls
All the gene expression values were simulated i.i.d from a N (0,1) distribution, i.e. NOTHING is differentially expressed in our simulation
The number of genes falsely rejected will be on the average of (6000 · ), i.e. if we wanted to reject all genes with a p-value of less than 1% we would falsely reject around 60 genes
See example
59
Multiple testing: Counting errors
Decision reported
H0 is Rejected
(Genes Selected)
H0 is accepted (Genes not Selected)
Total
State of the nature
("Truth")
H0 is false
(Affected)mm (S)
(m-mo)-(mm
(T) m-mo
H0 is true
(Not Affected)
m (V) mo-m (U) mo
Total M (R) m-m (m-R) m
V = # Type I errors [false positives]T = # Type II errors [false negatives]All these quantities could be known if m0 was known
60
How does type I error control extend to multiple testing situations?
Selecting genes with a p-value less than doesn’t control for P[FP] anymore
What can be done?Extend the idea of type I error
FWER and FDR are two such extensions
Look for procedures that control the probability for these extended error types
Mainly adjust raw p-values
61
Two main error rate extensions
Family Wise Error Rate (FWER) FWER is probability of at least one false
positiveFWER= Pr(# of false discoveries >0) = Pr(V>0)
False Discovery Rate (FDR) FDR is expected value of proportion of false
positives among rejected null hypothesesFDR = E[V/R; R>0] = E[V/R | R>0]·P[R>0]
62
FDR and FWER controlling procedures
FWER Bonferroni (adj Pvalue = min{n*Pvalue,1})Holm (1979)Hochberg (1986)Westfall & Young (1993) maxT and minP
FDRBenjamini & Hochberg (1995)Benjamini & Yekutieli (2001)
63
Difference between controlling FWER or FDR FWER Controls for no (0) false positives
gives many fewer genes (false positives), but you are likely to miss many adequate if goal is to identify few genes that differ
between two groups
FDR Controls the proportion of false positives if you can tolerate more false positives you will get many fewer false negatives adequate if goal is to pursue the study e.g. to
determine functional relationships among genes
64
Steps to generate a list of candidate genes revisited (2)
Gene 1: M11, M12, …., M1k
Gene 2: M21, M22, …., M2k
…………….Gene G: MG1, MG2, …., MGk
For every gene, calculateSi=t(Mi1, Mi2, …., Mik),
e.g. t-statistics, S, B,…
A list of candidateDE genes
Statistics of interestS1, S2, …., SG
Assumption on the null distribution:data normality
Nominal p-valuesP1, P2, …, PG
Adjusted p-valuesaP1, aP2, …, aPG
Select genes with adjusted P-valuessmaller than
65
Example
Golub data, 27 ALL vs 11 AML samples, 3051 genes
Bonferroni adjustment: 98 genes with padj< 0.05 (praw < 0.000016)
66
Extensions
Some issues we have not dealt withReplicates within and between slidesSeveral effects: use a linear modelANOVA: are the effects equal?Time series: selecting genes for trends
Different solutions have been suggested for each problem
Still many open questions
Examples
68
Ex. 1- Swirl zebrafish experiment
Swirl is a point mutation causing defects in the organization of the developing embryo along its ventral-dorsal axis
As a result some cell types are reduced and others are expanded
A goal of this experiment was to identify genes with altered expression in the swirl mutant compared to the wild zebrafish
69
Example 1: Experimental design
Each microarray contained 8848 cDNA probes (either genes or EST sequences)
4 replicate slides: 2 sets of dye-swap pairs For each pair, target cDNA of the swirl mutant
was labeled using one of Cy5 or Cy3 and the target cDNA of the wild type mutant was labeled using the other dye
Wild type Swirl
2
2
70
Example 1. Data analysis
Gene expression data on 8848 genes for 4 samples (slides): Each hybridixed with Mutant and Wild type
On a gene-per-gene basis this is a one-sample problem
Hypothesis to be tested for each gene:H0: log2(R/G)=0
The decision will be based on average log-ratios
71
Example 2 . Scanvenger receptor BI (SR-BI) experiment
Callow et al. (2000). A study of lipid metabolism and atherosclerosis susceptibility in mice.
Transgenic mice with SR-BI gene overexpressed have low HDL cholesterol levels.
Goal: To identify genes with altered expression in the livers of transgenic mice with SR-BI gene overexpressed mice (T) compared to “normal” FVB control mice (C).
72
Example 2. Experimental design
8 treatment mice (Ti) and 8 control ones (Ci). 16 hybridizations: liver mRNA from each of the 16
mice (Ti , Ci ) is labelled with Cy5, while pooled liver mRNA from the control mice (C*) is labelled with Cy3.
Probes: ~ 6,000 cDNAs (genes), including 200 related to pathogenicity.
T
CC*
8
8
73
Example 2. Data analysis
Gene expression data on 6348 genes for 16 samples: 8 for treatment (log T/C*) and 8 for control (log (C/C*))
On a gene-per-gene basis this is a 2 sample problem
Hypothesis to be tested for each gene:H0: [log (R1/G)-log (R2/G)]=0
Decision will be based on average difference of log ratios
Software for microarray data analysis
75
Introduction Microarray experiments generate huge
quantities of data which have to be Stored, managed, visualized, processed …
Many options available. However… No tool satisfies all user’s needs Trade-off. A tool must be
Powerful but user friendly Complete but without too many options, Flexible but easy to start with and go further Available, to date, well documented but affordable
76
So, what you need is “R”? R is an open-source system for statistical
computation and graphics. It consists of A language A run-time environment with
Graphics, a debugger, and Access to certain system functions,
It can be used Interactively, through a command languageOr running programs stored in script files
77
http://www.r-project.org/
78
Some pro’s & con’s Powerful, Used by statisticians Easy to extend
Creating add-on packages Many already available
Freely available Unix, windows & Mac Lot of documentation
Not very easy to learn Command-based Documentation
sometimes cryptic Memory intensive
Worst in windows Slow at times
We believe the effort is worth the pity!!!• If you “just want to do statistical analysis”
Easy to find alternatives• If you intend to do microarray data analysis
Probably one of best options
79
R and Microarrays
R is a popular tool between statisticians Once they started to work with microarrays
they continued using itTo perform the analysisTo implement new tools
This gave rise very fast to lots of free R-based software to analyze microarrays
The Bioconductor project groups many of these (but not all) developments
80
The Bioconductor project
Open source and open development software project for the analysis and comprehension of genomic data.
Most early developments as R packages. Extensive documentation and training material from
short courseshttp://www.bioconductor.org/workshop.html.
Has reached some stability but still evolving !!! what is now a standard may not be so in a future.
81
There's much more than R!
Give a look at
"My microarray software comparison"http://ihome.cuhk.edu.hk/~b400559/arraysoft.html
Examples
83
Ex. 1- Swirl zebrafish experiment
Swirl is a point mutation causing defects in the organization of the developing embryo along its ventral-dorsal axis
As a result some cell types are reduced and others are expanded
A goal of this experiment was to identify genes with altered expression in the swirl mutant compared to the wild zebrafish
84
Example 1: Experimental design
Each microarray contained 8848 cDNA probes (either genes or EST sequences)
4 replicate slides: 2 sets of dye-swap pairs For each pair, target cDNA of the swirl mutant
was labeled using one of Cy5 or Cy3 and the target cDNA of the wild type mutant was labeled using the other dye
Wild type Swirl
2
2
85
Example 1. Data analysis
Gene expression data on 8848 genes for 4 samples (slides): Each hybridixed with Mutant and Wild type
On a gene-per-gene basis this is a one-sample problem
Hypothesis to be tested for each gene:H0: log2(R/G)=0
The decision will be based on average log-ratios
86
Example 2 . Scanvenger receptor BI (SR-BI) experiment
Callow et al. (2000). A study of lipid metabolism and atherosclerosis susceptibility in mice.
Transgenic mice with SR-BI gene overexpressed have low HDL cholesterol levels.
Goal: To identify genes with altered expression in the livers of transgenic mice with SR-BI gene overexpressed mice (T) compared to “normal” FVB control mice (C).
87
Example 2. Experimental design
8 treatment mice (Ti) and 8 control ones (Ci). 16 hybridizations: liver mRNA from each of the 16
mice (Ti , Ci ) is labelled with Cy5, while pooled liver mRNA from the control mice (C*) is labelled with Cy3.
Probes: ~ 6,000 cDNAs (genes), including 200 related to pathogenicity.
T
CC*
8
8
88
Example 2. Data analysis
Gene expression data on 6348 genes for 16 samples: 8 for treatment (log T/C*) and 8 for control (log (C/C*))
On a gene-per-gene basis this is a 2 sample problem
Hypothesis to be tested for each gene:H0: [log (R1/G)-log (R2/G)]=0
Decision will be based on average difference of log ratios