22
Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments Presented by Nan Lin 13 October 2002

Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments Presented by Nan Lin 13 October 2002

Embed Size (px)

Citation preview

Page 1: Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments Presented by Nan Lin 13 October 2002

Statistical Methods for Identifying Differentially Expressed Genes in

Replicated cDNA Microarray Experiments

Presented by Nan Lin

13 October 2002

Page 2: Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments Presented by Nan Lin 13 October 2002

Introduction to cDNA Microarray Experiment

Single-slide Design– Two mRNA samples (red/green) on the same slide

Multiple-slide Design– Two or more types of mRNA on different slides– Exclude: time-course experiment

Page 3: Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments Presented by Nan Lin 13 October 2002

Examples of Multiple-slide Design

Apo AI– Treatment group: 8 mice with apo AI gene knocked out– Control group: 8 C57B1/6 mice– Cy5: each of 16 mice– Cy3: pooling cDNA from 8 control mice

SR-BI– Treatment group: 8 SR-BI transgenic mice– Control group: 8 “normal” FVB mice

Microarray Setup– 6384 spots, 4X4 grids with 19X21 spots in each

Page 4: Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments Presented by Nan Lin 13 October 2002

Single-slide Methods

Two types– Based solely on intensity ratio R/G– Take into account overall transcript abundance measured by

R*G

Historical Review– Fold increase/decrease cut-offs (1995-1996)– Probabilistic modeling based on distributional assumptions

(1997-2000)– Consider R*G (2000-2001) e.g. Gamma-Gamma-Bernoulli

Page 5: Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments Presented by Nan Lin 13 October 2002

Summary of Single-slide Methods

Producing a model dependent rule: drawing two curves in the (R,G) plane

– Power (1-Type II error rate)– False positive rate (Type I error rate)

Multiple testing

Replication is needed because gene expression data are too noisy

Page 6: Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments Presented by Nan Lin 13 October 2002

Image Analysis

“Raw” data: 16-bit TIFF files Addressing

– Within a batch, important characteristics are similar Segmentation

– Seeded region growing algorithm Background adjustment

– Morphological opening (a nonlinear filter) Software package: Spot in R environment

Page 7: Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments Presented by Nan Lin 13 October 2002

Single-slide Data Display

Plot log2R vs. log2G– variation less dependent on absolute magnitude– normalization is additive for logged intensities– evens out highly skewed distributions– a more realistic sense of variation

Plot M=log2 (R/G) vs. A=[log2(RG)]/2– More revealing in terms of identifying spot artifacts

and for normalization purpose

Page 8: Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments Presented by Nan Lin 13 October 2002

Normalization

Identify and remove sources of systematic variation other than differential expression

– Different labeling efficiencies and scanning properties for Cy3 and Cy5

– Different scanning parameters– Print-tip, spatial or plate effects

Red intensity is often lower than green intensity The imbalance between R and G varies

– across spots and between arrays– Overall spot intensity A– Location on the array, plate origin, etc.

Page 9: Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments Presented by Nan Lin 13 October 2002

An Example: Self-Self Experiment

Page 10: Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments Presented by Nan Lin 13 October 2002

Normalization (Cont.)

Global normalization– subtract mean or median from all intensity log-ratios

More complex normalization– Robust locally weighted regression

M=spot intensity A+location+plate origin Use print-tip group to represent the spot locations log2 (R/G) log2 (R/G) –l(A,j) l(A,j): lowess in R (0.2<f<0.4)

Control sequences

Page 11: Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments Presented by Nan Lin 13 October 2002

Apo AI: Normalization

Page 12: Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments Presented by Nan Lin 13 October 2002

Graphical Display for Test Statistics (I)

Test statistics– Hj: no association between treatment and the

expression level of gene j, j=1,…,m.– Two-sided alternative– Two-sample Welch t-statistics– Replication is essential to assess the variability in

treatment and control group– The joint distribution is estimated by a permutation

procedure because the actual distribution is not a t-distribution

Page 13: Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments Presented by Nan Lin 13 October 2002

Graphical Display for Test Statistics (II)

Quantile-Quantile plots

Page 14: Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments Presented by Nan Lin 13 October 2002

Graphical Display for Test Statistics (III)

Plots vs. absolute expression levels

Page 15: Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments Presented by Nan Lin 13 October 2002

Multiple Hypothesis Testing: Adjusted p-values (I)

P-value: Pj=Pr(|Tj|>=|tj||Hj), j=1,…,m. Family-wise Type I Error Rate (FWER)

– The probability of at least one Type I error in the family

Strong Control of the FWER– Control the FWER for any combination of true and false

hypotheses

Weak Control of the FWER– Control the FWER only under the complete null hypothesis

that all hypotheses in the family are true

Page 16: Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments Presented by Nan Lin 13 October 2002

Multiple Hypothesis Testing: Adjusted p-values (II)

Adjusted p-value for Hj

– Pj=inf{a: Hj is rejected at FWER=a}

– Hj is rejected at FWER a if Pj<=a

P-value adjustment approaches– Bonferroni – Sidak single-step– Holm step-down– Westfall and Young step-down minP

Page 17: Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments Presented by Nan Lin 13 October 2002

Multiple Hypothesis Testing: Estimation of adjusted p-values (I)

Page 18: Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments Presented by Nan Lin 13 October 2002

Multiple Hypothesis Testing: Estimation of adjusted p-values (II)

Page 19: Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments Presented by Nan Lin 13 October 2002

Apo AI: Adjusted p-values (I)

Page 20: Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments Presented by Nan Lin 13 October 2002

Apo AI: Adjusted p-values (II)

Page 21: Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments Presented by Nan Lin 13 October 2002

Apo AI: Comparison with Single-slide Methods

Page 22: Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments Presented by Nan Lin 13 October 2002

Discussion

M-A plots Normalization

– Robust local regression, e.g. lowess Q-Q plots & Plots vs. absolute expression level False discovery rate (FDR) Replication is necessary Design issues Factorial experiments Joint behavior of genes R package SMA