24
Biostatistics Case Studies 2010 Peter D. Christenson Biostatistician http://gcrc.LABioMed.org/ Biostat Session 5: Microarray Statistics

Biostatistics Case Studies 2010

Embed Size (px)

DESCRIPTION

Biostatistics Case Studies 2010. Session 5: Microarray Statistics. Peter D. Christenson Biostatistician http://gcrc. LABioMed.org /Biostat. Case Study. A compound found in red grapes improves the health and lifespan of mice on a high calorie diet. Treatment Groups. - PowerPoint PPT Presentation

Citation preview

Page 1: Biostatistics Case Studies 2010

Biostatistics Case Studies 2010

Peter D. ChristensonBiostatistician

http://gcrc.LABioMed.org/Biostat

Session 5: Microarray Statistics

Page 2: Biostatistics Case Studies 2010

Case Study

Page 3: Biostatistics Case Studies 2010

A compound found in red grapes improves the health and lifespan of mice on a high calorie diet.

Page 4: Biostatistics Case Studies 2010

Treatment Groups

Male middle-aged mice (11 months) were randomized to:

1. Standard Diet (SD): known as AIN-93G

N=60; N=5 for gene expression

2. High Calorie (HC): SD + coconut oil → 60% fat.

N=55; N=5 for gene expression

3. Resveritrol (HCR): HC + 22.4 mg resveritrol/ kg/day.

N=55; N=4 for gene expression (+1 w/ degraded sample)

Page 5: Biostatistics Case Studies 2010

Outcome: Mortality

• Methods: Survival Analysis.

• 114 week mortality ratio for HCR/HC = 0.42/0.58=0.72, which is a 28% reduction at 114 weeks.

Paper reports a “hazard ratio”, which is similar at 0.69, with a p-value of 0.02. How does it differ from the mortality ratio?

0.42 died

0.58 died

Page 6: Biostatistics Case Studies 2010

Outcome: Agility

What statistical analysis was done here?

Page 7: Biostatistics Case Studies 2010

Outcome: Clinical Markers

What statistical analyses?

Page 8: Biostatistics Case Studies 2010

Other Outcomes

Page 9: Biostatistics Case Studies 2010

Gene Expression in Liver at age 18 months

Fourteen Microarray “Experiments”: each of 5+5+4 mice had a separate array run for ~40,000 genes.

Page 10: Biostatistics Case Studies 2010

Gene Expression Data: 536,872 Numbers

38,348 rows: each a gene

First 2 SD mice. 12 others →http://www.grc.nia.nih.gov/

branches/rrb/dna/index/dnapubs.htm#2

Page 11: Biostatistics Case Studies 2010

Gene Expression Results

How were results for (a) and (b) calculated?

HCR over-expressed, compared to HC.

HCR under-expressed, compared to HC.

Page 12: Biostatistics Case Studies 2010

Microarray Analysis

• How can we analyze these data?• What are “experimental units”: mice or genes?• Consider each gene independently?• If so, Ns of 4 and 5 seem small to say much - low power.• So, maybe combine genes for larger Ns?• Pair up HCR and HC mice, find ratio, and average?• Ratio of mean for N=4 HCR and mean for N=5 HC?• If p<0.05 is used for each gene, expect many false positives among 38,348 genes.• SD among only 5 mice could be large just due to differences from array to array, not biologic diff, and thus miss finding important genes.

Page 13: Biostatistics Case Studies 2010

Detectable Effects with N=5 per Group

So, we need ~ 2SD difference in gene expression to be fairly sure (80%) of detecting this gene with only N=5+5.

This is a large effect – see next slide.

Suppose we compare the mean of 5 appropriately scaled #s for a gene’s expression with the mean of 5 in another group, using a t-test.

SD=sigma

Page 14: Biostatistics Case Studies 2010

Detectable Effects with N=5 per Group

R

ela

tive

Fre

quen

cy

Gene Expression

HC HCR

2SD Shift

2SD Effect corresponds to 50th → 97th percentile, about 2/5 of normal range

Effect

Normal Range

Page 15: Biostatistics Case Studies 2010

Detectable Effects with N=5 per Group

So, how can we try to avoid missing genes that are important, but are not detected with p<0.05?

Recall that p<0.05 corresponds to approximately:

|t| =|effect/SE(effect)| = |Δ/SE(Δ)| = |signal/noise| >2

where noise is a function of ~ SD/ sqrt(N).

Thus, if ↑ N is not possible to reduce noise, we can:

1. Try to reduce SD, or

2. Ignore SD and base the decision for gene selection on the signal, i.e., effect, i.e., mean differential expression, only.

Page 16: Biostatistics Case Studies 2010

Microarray Analysis: 1. Try to reduce SD

Here, SD is the SD among the expressions for 5 mice in a group.

How can we “reduce SD”? Isn’t it natural subject-to-subject heterogeneity, a characteristic of the population?

This SD is among measured expression, which includes both array-to-array error and subject-to-subject heterogeneity. (Confounded-there is no internal control.)

We try to statistically remove some of the inherent array-to-array error through normalization.

Page 17: Biostatistics Case Studies 2010

Side Point on Microarray Design

1. Single Channel Chip: 1 sample, many probes.

• No replicated measures. This study.

2. Two Channel Chip: 2 samples, possibly fewer probes that are common for both samples.

• One sample may be an internal control.

• The two samples may be matched, e.g.,

• Two conditions, times, etc, for the same subject.

• Twins, littermates, etc, treated differently.

Page 18: Biostatistics Case Studies 2010

Normalization

There are many ways to normalize. They exploit the assumption that most of 1000s of genes will be the same in many subjects. Two common methods:

• Global: All genes in an array are multiplied by the ratio of the (global) mean over all genes for all arrays to the mean over all genes for this array. E.g., array1 has mean 1000 and fourteen arrays have mean 900, multiply by 0.90.

• Z-score: Replace expression x by z=(x-mean)/SD, where mean and SD are over genes for this array. Expression becomes # of SDs deviant from gene mean.

Page 19: Biostatistics Case Studies 2010

Microarray Analysis: 2. Ignore SD

Here, SD is the SD among the expressions for 5 mice in a group.

Use an effect measure for each gene, such as the ratio of mean of 4 HCR to the mean of 5 HC, usually standardized to a “normal range” as with z-scores.

Usually select genes with either:1. Ratio>c or <1/c, some c such as 1.5 or 2.2. A specified number or percent of genes with

largest or smallest ratios.

Page 20: Biostatistics Case Studies 2010

Microarray Analysis: This study

Genes selected with both:1. “Z-Ratio” >1.5 or <-1.5.2. The p-value from a z-test for comparing the

mean z-score of 4 HCR mice to the mean of 5 HC mice is <0.05.

Raw expression is normalized within each array by z-scores on log(expression).

The Z-Ratio is the difference between the mean z-score of 4 HCR mice to the mean of 5 HC mice (which is the numerator for the z-test), divided by the SD of these differences over different genes.

Page 21: Biostatistics Case Studies 2010

Microarray Analysis: Gene Hsd3b5

Use raw data to generate results for the most up-regulated gene.

Page 22: Biostatistics Case Studies 2010

Microarray Analysis: Gene Hsd3b5Raw Log Mean SD ZscoreGroup

13145.2 9.483811 6.54116 1.4847 1.967 SD14405.2 9.575347 6.51822 1.5039 2.02 SD22271.5 10.01106 6.50518 1.5105 2.303 SD12349.9 9.421401 6.41534 1.4934 1.997 SD14037.6 9.549494 6.70083 1.5341 1.843 SD

261.143 5.565067 6.66922 1.5141 -0.72 HC341.867 5.834423 6.72307 1.5464 -0.57 HC329.622 5.797947 6.68663 1.5291 -0.57 HC368.763 5.910154 6.58712 1.4414 -0.47 HC418.856 6.037528 6.7602 1.5719 -0.45 HC

9663.86 9.176149 6.65757 1.526 1.639 HCR8397.5 9.03569 6.56456 1.5104 1.623 HCR

3243.64 8.084451 6.61664 1.4968 0.976 HCR1226.37 7.111811 6.67217 1.443 0.305 HCR

Page 23: Biostatistics Case Studies 2010

Microarray Analysis: Gene Hsd3b5

Two Sample T-Test for HCR vs. HC on Gene Hsd3b5

N Mean SD SE

HCR 4 1.136 0.634 0.32

HC 5 -0.555 0.107 0.048

Diff 1.691

95% CI for Diff: ( 1.02, 2.362)

T-Test T = 5.96 P = 0.0006

Antilog(1.691) =~ 5.42 fold greater HCR expression

“Z-Ratio” = Diff of logs/SD = 1.691/0.14 = 11.99

Here, SD=0.14 is among these diffs over genes.

Page 24: Biostatistics Case Studies 2010

Expected Identified Genes among 38,348 Genes using p-values

Suppose the decision rule is to declare a particular gene important if its mean expression in HCR mice differs enough from that for HC mice so that p<0.05:

Significantly less → down-regulated.

Significantly more → up-regulated.

Then the expected number of identified genes among, say, 38,000 that are not affected (false positives) is:

0.05*38,000 = 1900

Thus, confirmatory analyses such as PCR are needed.