Microarray Statistics

Statistical Analysis of cDNA Microarray Genomics Data

Yuehua Cui

Graduate student

Department of Statistics

December 4th, 2002

Outline of the topics• Introduction• Data preprocessing

– Alignment

– Background calculation

– Data transformation

• An example• Normalization Comparison• Post Hoc Analysis

Introduction• New technique introduced in 1995 by Schena.• Quantitatively monitor expression level for thousands of genes

at a time.• All the methods and applications are based on Nylon

membrane microarrays and can be extended to other DNA microarray analysis using other platforms.

• Why normalization: – A number of systematic variations can occur during experiments.

For example, different samples being compared are hybridized on different nylon membranes. Need normalization to remove these sources of variation.

– Well normalized data are the foundation of good analysis results.

• Statistical analysis

AtlasImage Data Preprocessing• Alignment: each gene is represented by two spots. Match these two

spots to a schematic representation of an array. Final intensity for this gene will be the average value of the intensities of these two spots.

• Background calculation– external(global):median intensity of the black space between

different panels.– user-defined external:median intensity of user-defined area– local:median intensity of the space surrounding the gene spot

• Data transformation: – Adjusted intensity = raw intensity - background value– log2, log10 or natural.

part of Atlas nylon membrane array

*Note: the two spots above or below the white bar represent

one gene, i.e. one gene has two spots.

An example RL95 cell line data set• Each Clontech Stress array contains 234 sequences expressed in

response to stress.

• Each insert cDNA is denatured and UV cross-linked to a positively charged membrane

• Samples are treated with DMSO and BaP (Benzo(a)pyrene) dissolved in DMSO. So DMSO is the control and BaP is the treatment.

• DMSO and BaP treated samples are hybridized under the same condition each time. Two membranes are used three times for DMSO and BaP treated samples, respectively.

• Three biological replicates done with the same membrane(s) (correlation occurs)

Control RNA Sample Test RNA Sample

Hybridization to microarray filters

Use Phosphor Imager laser scanner to obtain densities of each spot on filter.

radio-labelled

cDNA probes

Reverse-Transcription 33P - dCTP33P - dCTP

Compare densities at each spot to determine if treatment changes gene expression. Compile subset of differentially expressed genes.

Gene Control Test A 1X 3X : : : Z 1X 0.5X

Scatter plots of adjusted log intensities for paired experiments of DMSO vs BaP

2 3 4 5 6 7

23

45

67

scatter plot of DMSO vs BAP

log(dmso1)

log(b

ap1)

2 3 4 5 6 7 8

23

45

67

8


log(dmso2)

log(b

ap2)

0 2 4 6 8

12

34

56

78


log(dmso3)

log(b

ap3)

Normalization• Gobal normalization (AtlasImageTM)

– assumption: given large enough sample size, the average signal intensity (gene expression level) does not change.

– Sum method: Norm coef.(kj) =

Where Imi = intensity of gene i on array Array m, m=1,2

Bm= background intensity on Array m, m=1,2 n = number of genes on the array

– problem: validity of the assumption; stronger signals dominate the summation.

– Median (robust with respect to outliers)

Normalization coefficient (kj) =

n

iii

n

iii

BI

BI

122

111

)(

)(

22

11

ii

ii

BImedian

BImedian

Normalization continued• Housekeeping gene normalization

– Housekeeping genes are a set of genes whose expression levels are not affected by the treatment.

– The normalization coefficient is the ratio of mC/mT, where mC and mT are the means of the selected housekeeping genes for control and treatment respectively.

– Problem: housekeeping genes change their expression level sometimes. The assumption doesn’t hold.

• Trimmed mean normalization(adjusted global method)

trim off 5% highest and lowest extreme values, then globally normalize data. The normalization coefficient is:

where are the trimmed means for the ith treatment

and control respectively.

i

i

T

Ci m

mk

ii TC mandm

Normalization continued

• Regression normalization: – Fit the linear regression model:

– Assumption: all the genes on the array have the same variance (homogeneity)

– Test the significance of the intercept . Fit a linear regression without if it is insignificant.

– Transform the treatment data:

– Problem: • assumption may not hold

• nonlinear trend (the third replicates of RL95 data has a slight quadratic trend) .

iii xy

ii

yy

Scatter plot of log intensity before and after regression normalization

2 3 4 5 6 7

23

45

67


log(dmso1)

log(b

ap1)

2 3 4 5 6 7 8

24

68


log(dmso2)

log(b

ap2)

0 2 4 6 8

13

57


log(dmso3)

log(b

ap3)

2 3 4 5 6 7

23

45

67

scatter plot after norm

log(dmso1)

log(b

ap1)

2 3 4 5 6 7 8

24

68


log(dmso2)lo

g(b

ap2)

2 3 4 5 6 7 8

13

57


log(dmso3)

log(b

ap3)

Normalization continued• Rank normalization: (this method assumes only a small number of

genes will be differentially expressed)– RCjc criteria, j=1,…,g, where c =g 10%, g is the total number of

genes and RCj is the rank for gene j in control.

– choose a set of genes which have a similar expression pattern, ie. RTj(RCjc )

– Normalization coefficient: where and are the means

of the selected genes for the ith treatment and control respectively

– Question: how to choose c?– Rank invariant genes (Eric Schadt, 2001, Journal of Cellular

Biochemistry (supplement) 37:120-125)

i

i

T

Ci m

mk

iCmiTm

Normalization continued• Intensity-dependent normalization (Yang, YH, 2002 )

– Do M-A plot to check the data distribution, where

– Use Lowess function in R to perform normalization

where c(A) is the lowess fit to the M-A plot

– Transform data by M'=M - c(A). – Locally nonparametric method and is robust to a small

number of differentially expressed genes.

CTAandCTM *log/log 22

)/(log)(/log/log 222 kCTAcCTCT

M-A plot of DMSO vs BaP (Before and after intensity-dependent normalization, f=0.3)

2 4 6 8 10

-1.0

0.0

1.0

M-A plot

A

M

2 4 6 8 10

-1.0

0.0

1.0

M-A plot after Lowess norm

A

M`

4 6 8 10 12

-20

12

M-A plot

A

M

4 6 8 10 12-2

01

2


A

M`

2 4 6 8 10 12

-20

24

M-A plot

A

M

2 4 6 8 10 12

-20

24


A

M`

Conclusion• Global or local, parametric or nonparametric method

• No unique normalization method for the same data. It depends on what kind of experiment you have and what the data look like.

• No absolute criteria for normalization. Basically, the normalized log ratio should be centered around 0. Combing with post hoc analysis to choose the best one.

Post Hoc Analysis• Before analysis

– Data adjustment: for paired normalization, truncate big ratios first. Quantile criteria (1% or 5%, 95% or 99% quantiles)

– Parametric tests assume that data follow a certain distribution

– Non-parametric tests do not make such assumptions

– Check the validity of the assumptions made for

parametric test and make sure using the right test.

AtlasImageTM 2-fold criteria AtlasImageTM software: report genes with 2-fold

change as up or down-regulated genes.

Fails to account for sample variation.

Low intensity tends to have higher ratio

Ignores the fact that a difference less than 2-fold can also elicit meaningful biological effects.

One sample t-test• Obtain normalized log ratio for each pair (control vs

treatment). Calculate the mean and SD for each gene.

• Hypothesis:

Under the null hypothesis,ie., there is no expression difference, the mean of the log ratio for gene i is 0:

• The test statistic is

• where mi and sdi are the mean and standard deviation of the log ratio for gene i.

• Problem:small sample size;normality assumption; multiple test adjustment.

i

ii sd

mt

0)log(1/ i

iii C

TCT

0:0: 10 ii HvsH

Two sample t-test• Obtain normalized log intensity.

• Let the sample mean and variances of Yij’s for gene j under the two conditions be , the test statistics is:

with df

if unequal variance is assumed and

with df di=2(n-1)

if equal variance is assumed.

• Under the normality assumption for Yij, Zi approximately has a t-distribution with di degree of freedom:

• Problem: small sample size; normality assumption; multiple test adjustment.

)1/()/()1/()/(

)//(2

)2(22

)1(2

2)2(

2)1(

2

kkSkkS

kSkSd

ii

ii

i

)2(2

)1(2

)2()1( ,,, iiii SSYY

kSkS

YYZ

ii

iii

// )2(2

)1(2

)2()1(

nnS

YYZ

p

iii

/1/121

Multiple test adjustment• Hundreds of genes tested at the same time. Assume 1000 genes

are not differentially expressed. P-value of 0.01(false positive rate) means that around 10 genes will nevertheless be significant.

• Bonferroni correction: want to make sure that P[1 gene significant from 1000] 0.05. Consequently, p-value for a single gene to be announced as significant is: P[single gene] 0.05/1000 = 0.00005

• Conservative and lower power.• keep FWR manageable and try some p-value, say 0.001 as the

significant level.• Westfall and Young’s step-down adjusted P-value.

Predictive Interval (PI) method• Use the normalization method discussed above to normalize data.

• Obtain the average log ratio(ALR) which is centered around zero.

• Using normal approximation method. – Step I: Treating the maximum or minimum value of ALR greater

than mean+3*sd or less than mean-3*sd as outlier, delete it from ALR and take it as a differentially expressed gene.

– Step II: calculate the mean and sd for remaining genes and repeat step I.

– Do above steps iteratively until no more ourlier exists. Then, calculate the 95% predictive interval for the remaining genes. Those values outside of the PI are significant.

– The final set of differentially expressed genes include those outliers detected in step I and II and those outside of PI.

Yidong’s algorithm• Assumption:

– Assuming there is constant coefficient of variation c for the entire gene set

– the observed differential expression, Rk=Tk/Ck(ratio of treatment and control intensity at gene k), has a sampling distribution dependent only on c. Rkis approximately normally distributed.

– Assume – The density function of R becomes:

• Use the Maximum likelihood method to estimate the constant c, and use the EM algorithm to get the final estimate of c and m.

• Use the polynomial: to get the CI.

• Measurement errors depends on signal strength

kk TT c kk CC c

2

1

12

2

])1(

)1(1[ˆ

n

i i

i

R

R

nc

kk CT m )1,;/(

1),;( cmrf

mmcrf RR

)1

(1ˆ

1ˆ

1

n

jj

ii r

nm

012

23

3 acacacay

Significant genes list of BaP/DMSOGene dmso bap ratio Gene dmso bap ratio

5H 27.693 44.965 1.62371

7L 42.753 84.959 1.9872 7L 42.753 84.959 1.9872

8B 58.043 94.026 1.61993 8F 32.951 57.004 1.72997

8F 32.951 57.004 1.72997 8I 50.003 102.417 2.04822

8I 50.003 102.417 2.04822 9C 124.219 216.932 1.74637

9C 124.219 216.932 1.74637 18E 53.169 131.328 2.46998

11O 12.758 19.051 1.49324 20C 106.946 549.492 5.13801

18E 53.169 131.328 2.46998 22H 127.946 66.701 0.52132

20C 106.946 549.492 5.13801

22H 127.946 66.701 0.52132

23J 31.097 48.815 1.56978

99% CI ( 0.581621, 1.681510) 99% CI ( 0.48995 , 1.68639 )

95% CI (0.660326, 1.481089) 95% CI ( 0.55649 , 1.68269 )

the left hand side is the list of significant genes using PI the right hand side is the list of genes using Yidong’s algorithm

Permutation test

• For gene i in each paired experiment, permute data within pair to get the permuted sample. Under the assumption that genes do not change their expression pattern under the two conditions of study, we can permute data as follows:

Gene T1 C1 T2 C2 T3 C3

1 X1j Y1j X2j Y2j X3I Y3j

. . . . . . .

. . . . . . .

. . . . . . .g X1g Y1g X2g Y2g X3g Y3g

Permutation test continued• Get the normalized average log ratio for original(ALR) and

permuted data(ALR*)

• calculate the p-value for gene i:

where g is the total number of genes

• permute data n times and obtain n p-value for each gene. Then get the mean and sd for each gene and calculate 95% CI.

• If lower bound is less than 0.05, claim this gene as significant.

g

ALRALRjvaluep ij

i

|}||*:|{#

List of significant genes picked up by permutation test

Gene LB.95 P.mean UB.95 dmso1 dmso2 dmso3 bap1 bap2 bap36I 0.005937 0.00812 0.010302 18 20 15 28 39 87

14K 0.034691 0.040599 0.046506 50 124 7 48 126 127

18E 0.015475 0.01859 0.021705 43 72 38 85 280 75

20C -4.51E-05 0.000641 0.001327 66 241 36 218 1413 200

22H 3.84E-02 0.044445 0.050442 55 107 224 43 86 101

Significance Analysis of Microarrays (SAM)

• Limitation of parametric test: – Estimation of Variance:limited sample size (= few replicates)– Normal Distribution assumptions: error model still not clear– Multiple Testing

• Excel add-in performing robust method for differential analysis of microarray data.(Method developed and implemented by the Tibshirani group at Stanford (free for academic use)

• Permutation technique:Assuming no difference between conditions, all genes are from the same population.

• False Discovery Rate: Number of falsely called genes divided by number of differential genes in original data

• need large number of replicates

SAM test Statistic

0ss

rd

i

ii

• di = Score • si = Standard Deviation• s0 = Fudge Factor

21 iii xxr

2

)()(11

21

2

22

1

21

21

nn

xxxx

nns Cj

iijCj

iij

i

The SAM process• Perform permutation and compute test statistics

for each permutation• Rank test statistics in ascending order• Compute mean test statistics for each “rank” over

all permutations• Plot original “ranked” test Statistic Versus Mean

test statistic from permutations• Define distance from mean permuted value you

call significant• Compute false discovery rate for this value• Iterate until you get appropriate FDR

SAM analysis

Significant genes list

Row Gene Name Score(d) Numerator(r) Denominator(s+s0) Fold Change q-value (%)

201 20C 1.76581104 556 314.8694775 5.86297 8.870599

185 18E 1.64637644 95.66666667 58.10740719 2.87582 8.870599

51 6I 1.58325948 33.66666667 21.26414969 2.90566 8.870599

36 5H 1.22363840 30.66666667 25.06187011 2.31429 8.870599

124 11L 1.06476227 24.33333333 22.8533015 2.04286 8.870599

76 8F 0.96626427 41 42.43145579 2.30851 8.870599

68 7L 0.93100607 65.66666667 70.5330163 2.60163 8.870599

72 8B 0.92740987 61.66666667 66.49343301 2.10119 8.870599

79 8I 0.88594788 78.33333333 88.41754107 2.65493 8.870599

Other Methods and Software

• ANOVA • Likelihood ratio test

• Bayesian analysis• GeneSpring, GenePix etc.

• http://www.cs.tcd.ie/Nadia.Bolshakova/softwarelist.html

Conclusion Cutoff point determination: set up critical point to eliminate

genes whose intensity is less than this point. Statistically significant? No unique method to analyze data.

Some methods are better for one data set, but may not be good for other data sets. In practice, we have to try different ways to see which methods work well.

Biologically significant? For those genes picked up by statistics, we have to be careful to draw conclusions. Some genes shown to be significant may not be functionally meaningful. Conversely, genes that do not show up significant may be significant,

especially for those genes at the boarder line in the statistical test.

AcknowledgementsAcknowledgementsDept. of Pharmacology &

TherapeuticsDr. Shiverick Terry MedranoRenita Handayani

Dept. of StatisticsDr. Booth

Presentation download:http://www.stat.ufl.edu/~ycui

Education

Microarray Statistics