Microarray Data Analysis

Microarray Data AnalysisStatistical methods to detect differentially expressed genes

OutlineThe class comparison problemStatistical testsCalculation of p-valuesPermutations testsThe volcano plotMultiple testingExtensionsExamples

Class comparison: Identifying differentially expressed genesIdentify genes whose expression is significantly associated with different conditionsTreatment, cell type,... (qualitative covariates) Dose, time, ... (quantitative covariate)Survival, infection time,... !Estimate effects/differences between groups probably using log-ratios, i.e. the difference on log scale log(X)-log(Y) [=log(X/Y)]

What is a significant change?Depends on the variability within groups, which may be different from gene to gene.To assess the statistical significance of differences, conduct a statistical test for each gene.

Different settings for statistical testsIndirect comparisons: 2 groups, 2 samples, unpairedE.g. 10 individuals: 5 suffer diabetes, 5 healthyOne sample fro each individualTypically: Two sample t-test or similarDirect comparisons: Two groups, two samples, pairedE.g. 6 individuals with brain stroke. Two samples from each: one from healthy (region 1) and one from affected (region 2).Typically: One sample t-test (also called paired t-test) or similar based on the individual differences between conditions.

Different ways to do the experimentAn experiment use cDNA arrays (two-colour) or affy (one-colour).Depending on the technology used allocation of conditions to slides changes.

Natural measures of discrepancyFor Direct comparisons in two colour or paired-one colour.For Indirect comparisons in two colour or Direct comparisons in one colour.

Some issues in gene selectionGene expression values have peculiarities that have to be dealt with.Some related with small sample sizesVariance unstabilityNon-normality of the dataOther related to big number of variablesMultiple testing

Variance unstabilityCan we trust average effect sizes (average difference of means) alone?Can we trust the t statistic alone?Here is evidence that the answer is no.Courtesy of Y.H. Yang

Variance unstability (1): outliersCan we trust average effect sizes (average difference of means) alone?Can we trust the t statistic alone?Here is evidence that the answer is no.Courtesy of Y.H. YangAverages can be driven by outliers.

Variance unstability (2): tiny variancesCan we trust average effect sizes (average difference of means) alone?Can we trust the t statistic alone?Here is evidence that the answer is no.Courtesy of Y.H. Yangts can be driven by tiny variances.

Solutions: Adapt t-testsLetRg mean observed log ratio SEg standard error of Rg estimated from data on gene g. SE standard error of Rg estimated from data across all genes.Global t-test: t=Rg/SEGene-specific t-testt=Rg/SEg

Some pros and cons of t-test

T-tests extensionsSAM (Tibshirani, 2001)Regularized-t (Baldi, 2001)EB-moderated t(Smyth, 2003)

Up to here: Can we generate a list of candidate genes? A list of candidate DE genes?With the tools we have, the reasonable steps to generate a list of candidate genes may be:We need an idea of how significant are these values Wed like to assign them p-values

Nominal p-valuesAfter a test statistic is computed, it is convenient to convert it to a p-value: The probability that a test statistic, say S(X), takes values equal or greater than the observed value, say X0, under the assumption that the null hypothesis is true p=P{S(X)>=S(X0)|H0 true}

Significance testingTest of significance at the a level:Reject the null hypothesis if your p-value is smaller than the significance levelIt has advantages but not free from criticisms Genes with p-values falling below a prescribed level may be regarded as significant

Hypothesis testing overview for a single gene

Calculation of p-valuesStandard methods for calculating p-values:(i) Refer to a statistical distribution table (Normal, t, F, ) or (ii) Perform a permutation analysis

(i) Tabulated p-valuesTabulated p-values can be obtained for standard test statistics (e.g.the t-test)They often rely on the assumption of normally distributed errors in the dataThis assumption can be checked (approximately) using a HistogramQ-Q plot

ExampleGolub data, 27 ALL vs 11 AML samples, 3051 genesA t-test yields 1045 genes with p< 0.05

(ii) Permutations testsBased on data shuffling. No assumptionsRandom interchange of labels between samplesEstimate p-values for each comparison (gene) by using the permutation distribution of the t-statisticsRepeat for every possible permutation, b=1BPermute the n data points for the gene (x). The first n1 are referred to as treatments, the second n2 as controlsFor each gene, calculate the corresponding two sample t-statistic, tbAfter all the B permutations are done put p = #{b: |tb| |tobserved|}/B

Permutation tests (2)

The volcano plot: fold change vs log(odds)1Significant change detectedNo change detected1: log(odds) is proportional to -log (p-value)

Multiple testing

How far can we trust the decision?The test: "Reject H0 if p-val a"is said to control the type I error because, under a certain set of assumptions, the probability of falsely rejecting H0 is less than a fixed small thresholdP[Reject H0|H0 true]=P[FP] aNothing is warranted about P[FN]Optimal tests are built trying to minimize this probabilityIn practical situations it is often high

What if we wish to test more than one gene at once? (1)Consider more than one test at onceTwo tests each at 5% level. Now probability of getting a false positive is 1 0.95*0.95 = 0.0975Three tests 1 0.953 =0.1426n tests 1 0.95nConverge towards 1 as n increasesSmall p-values dont necessarily imply significance!!! We are not controlling the probability of type I error anymore

What if we wish to test more than one gene at once? (2): a simulationSimulation of this process for 6,000 genes with 8 treatments and 8 controls All the gene expression values were simulated i.i.d from a N (0,1) distribution, i.e. NOTHING is differentially expressed in our simulationThe number of genes falsely rejected will be on the average of (6000 a), i.e. if we wanted to reject all genes with a p-value of less than 1% we would falsely reject around 60 genesSee example

Multiple testing: Counting errorsV = # Type I errors [false positives]T = # Type II errors [false negatives]All these quantities could be known if m0 was known

How does type I error control extend to multiple testing situations?Selecting genes with a p-value less than a doesnt control for P[FP] anymoreWhat can be done?Extend the idea of type I errorFWER and FDR are two such extensionsLook for procedures that control the probability for these extended error typesMainly adjust raw p-values

Two main error rate extensionsFamily Wise Error Rate (FWER) FWER is probability of at least one false positiveFWER= Pr(# of false discoveries >0) = Pr(V>0)False Discovery Rate (FDR) FDR is expected value of proportion of false positives among rejected null hypothesesFDR = E[V/R; R>0] = E[V/R | R>0]P[R>0]

FDR and FWER controlling proceduresFWER Bonferroni (adj Pvalue = min{n*Pvalue,1})Holm (1979)Hochberg (1986)Westfall & Young (1993) maxT and minPFDRBenjamini & Hochberg (1995)Benjamini & Yekutieli (2001)

Difference between controlling FWER or FDRFWER Controls for no (0) false positivesgives many fewer genes (false positives), but you are likely to miss manyadequate if goal is to identify few genes that differ between two groupsFDR Controls the proportion of false positivesif you can tolerate more false positives you will get many fewer false negativesadequate if goal is to pursue the study e.g. to determine functional relationships among genes

Steps to generate a list of candidate genes revisited (2)A list of candidate DE genesNominal p-values P1, P2, , PGAdjusted p-values aP1, aP2, , aPGSelect genes with adjusted P-values smaller than a

Example (1)Golub data, 27 ALL vs 11 AML samples, 3051 genesBonferroni adjustment: 98 genes with padj< 0.05 (praw < 0.000016)

Example (2)Se the examples of testing in the case study found in this link http://www.ub.edu/stat/docencia/bioinformatica/microarrays/ADM/labs/Virtaneva2002/Ejemplo_AML8.R

ExtensionsSome issues we have not dealt withReplicates within and between slidesSeveral effects: use a linear modelANOVA: are the effects equal?Time series: selecting genes for trendsDifferent solutions have been suggested for each problemStill many open questions

One of the things we want to do with our t-statistics is roughly speaking, to identify the extreme ones.

It is natural to rank them, but how extreme is extreme? Since the sample sizes here are not too small ( two samples of 8 each gives 16 terms in the difference of the means), approximate normality is not an unreasonable expectation for the null marginal distribution.

Converting ranked ts into a normal q-q plot is a great way to see the extremes: they are the ones that are off the line, at one end or another. This technique is particularly helpful when we have thousands of values. Of course we cant expect all differentially expressed genes to stand out as extremes: many will be masked by more extreme random variation, which is a big problem in this context. For strong control of the FWER at some level , there are procedures which will take m unadjusted p-values and modify them separately, so-called single step procedures, the Bonferroni adjustment or correction being the simplest and most well known. Another is due to Sidk.

Other, more powerful procedures, adjust sequentially, from the smallest to the largest, or vice versa. These are the step-up and step-down methods, and well meet a number of these, usually variations on single-step procedures.

As we will see, there is a bewildering variety of multiple testing procedures. How can we choose which to use? There is no simple answer here, but each can be judged according to a number of criteria:Interpretation: does the procedure answer a relevant question for you?Type of control: strong, exact or weak?Validity: are the assumptions under which the procedure applies clear and definitely or plausibly true, or are they unclear and most probably not true?Computability: are the procedures calculations straightforward to calculate accurately, or is there possibly numerical or simulation uncertainty, or discreteness?

Documents

Microarray Data Analysis