30
Gene Expression Index Stat 115 2012

Gene Expression Index Stat 115 2012. 2 Outline Gene expression index –MAS4, average –MAS5, Tukey Biweight –dChip, model based, multi-array –RMA, model

Embed Size (px)

Citation preview

Page 1: Gene Expression Index Stat 115 2012. 2 Outline Gene expression index –MAS4, average –MAS5, Tukey Biweight –dChip, model based, multi-array –RMA, model

Gene Expression Index

Stat 115

2012

Page 2: Gene Expression Index Stat 115 2012. 2 Outline Gene expression index –MAS4, average –MAS5, Tukey Biweight –dChip, model based, multi-array –RMA, model

2

Outline• Gene expression index

– MAS4, average

– MAS5, Tukey Biweight

– dChip, model based, multi-array

– RMA, model based, multi-array

– Method comparison• Latin Square spike-in experiment

– Importance of probe mapping

These are perhaps the few most popular of many methods for normalizing and computing expression measures using Affymetrix data. Currently over 50 methods are describedand compared at http://affycomp.biostat.jhsph.edu/.

Page 3: Gene Expression Index Stat 115 2012. 2 Outline Gene expression index –MAS4, average –MAS5, Tukey Biweight –dChip, model based, multi-array –RMA, model

3

cDNA Microarrays

• Fold change: ratio Cy5 / Cy3

• When fold change is negative

Log2(Cy5 / Cy3)

Genes

Arrays

array 1 array 2 array 3 array 4 array 5 …

1 0.46 0.30 0.80 1.51 0.90 ...2 -0.10 0.49 0.24 0.06 0.46 ...3 0.15 0.74 0.04 0.10 0.20 ...4 -0.45 -1.03 -0.79 -0.56 -0.32 ...5 -0.06 1.06 1.35 1.09 -1.09 ...

Page 4: Gene Expression Index Stat 115 2012. 2 Outline Gene expression index –MAS4, average –MAS5, Tukey Biweight –dChip, model based, multi-array –RMA, model

4

Affymetrix Microarray Expression Index

• How to summarize probes in a probeset?

Brighter PM usually carries more information, but not always the case (cross-hybridization)

Page 5: Gene Expression Index Stat 115 2012. 2 Outline Gene expression index –MAS4, average –MAS5, Tukey Biweight –dChip, model based, multi-array –RMA, model

5

MAS4• GeneChip® older software Microarray Analysis

Software 4.0 uses AvgDiff

• A: a set of suitable pairs chosen by software– Remove highest/lowest– Calculate mean, sd from remaining probes– Eliminate probes more than 3 sd from mean

• Drawback (naïve algorithm):– Can omit 30-40% probes – Can give negative values

j

jj MMPMAvgDiff )(1

Page 6: Gene Expression Index Stat 115 2012. 2 Outline Gene expression index –MAS4, average –MAS5, Tukey Biweight –dChip, model based, multi-array –RMA, model

6

MAS5• GeneChip® newest version

• CT* (change threshold) a version of MM that is never bigger than PM– If MM<PM, CT* = MM– If MM>PM, estimate typical

case MM for PM • Tukeybiweight of MMs

with similar PM values ~70% PM

– If typical MMs>PM for, set CT* = PM - • Robust weighting to down weight outliers

)}{log( *jj CTPMghtTukeyBiweisignal

Page 7: Gene Expression Index Stat 115 2012. 2 Outline Gene expression index –MAS4, average –MAS5, Tukey Biweight –dChip, model based, multi-array –RMA, model

7

Li & Wong (dChip)Important observation: relative values of probes within a

probeset very stable across multiple samples.

Page 8: Gene Expression Index Stat 115 2012. 2 Outline Gene expression index –MAS4, average –MAS5, Tukey Biweight –dChip, model based, multi-array –RMA, model

8

Model-Based Expression Index

• Look at multiple samples at a time, give different probes a different weight

• Each probe signal is proportional to – Amount of target sample:

– Affinity of specific probe sequence to the target: j

1

2

Probes 1 2 3

sample 1

sample 2

Page 9: Gene Expression Index Stat 115 2012. 2 Outline Gene expression index –MAS4, average –MAS5, Tukey Biweight –dChip, model based, multi-array –RMA, model

9

Li & Wong (dChip)

• Model

• Iteratively estimate θi and φj to minimize εij

• Try to minimize the sum of errors

ijjiijij MMPM

............

...)()()(

...)()()(

...)()()(

333231

232221

131211

MPMPMP

MPMPMP

MPMPMP

Sample1

Sample2Sample3…

φ1 φ2 φ3

Probe1 Probe2 Probe3 …1

2

3

Concentration Probe affinity

Error

Page 10: Gene Expression Index Stat 115 2012. 2 Outline Gene expression index –MAS4, average –MAS5, Tukey Biweight –dChip, model based, multi-array –RMA, model

10

RMA = Robust Multi-chip Analysis

• Irizarry & Speed, 2003

• Eliminates MM probes

• Probe intensity background adjustment

• Quantile normalize the background adjusted PM

• Take Log of PM

• Robust probe summary

Page 11: Gene Expression Index Stat 115 2012. 2 Outline Gene expression index –MAS4, average –MAS5, Tukey Biweight –dChip, model based, multi-array –RMA, model

11

RMA Background Subtraction

• Signal + BG = PM

• Signal ~ exponential; BG ~ normal

+ =

Signal + Noise = Observed

Page 12: Gene Expression Index Stat 115 2012. 2 Outline Gene expression index –MAS4, average –MAS5, Tukey Biweight –dChip, model based, multi-array –RMA, model

12

RMA Background Subtraction

• BG distribution

Page 13: Gene Expression Index Stat 115 2012. 2 Outline Gene expression index –MAS4, average –MAS5, Tukey Biweight –dChip, model based, multi-array –RMA, model

13

Why Log(PM)• Captures the fact that higher value probes are

more variable• Assume probe noise is comparable on log scale

Page 14: Gene Expression Index Stat 115 2012. 2 Outline Gene expression index –MAS4, average –MAS5, Tukey Biweight –dChip, model based, multi-array –RMA, model

14

• For each probe set, PMij = ij

• Fit the model:

– aj is expression index, bj is probe effect– Log2n() stands for logarithm after quantile

normalization of n samples

• Iteratively refit aj and bj (similar to dChip)– Main difference is to minimize error at log PM

RMA

)log()log()(log jiijPM

ijjiij baPM )bg(nlog2

Page 15: Gene Expression Index Stat 115 2012. 2 Outline Gene expression index –MAS4, average –MAS5, Tukey Biweight –dChip, model based, multi-array –RMA, model

RMA model fitting: Median Polish

15

• For a given probe set with J probe pairs, let yij denote the background-adjusted, base-2-logged, and quantile-normalized value for GeneChip i and probe j.

• Assume yij = μi + αj + eij where α1 + α2 + ... + αn = 0.

• Perform Tukey’s Median Polish on the matrix of yij values with yij in the ith row and jth column.

gene expressionof the probe seton GeneChip i

probe affinityaffect for thejth probe in theprobe set

residual

Page 16: Gene Expression Index Stat 115 2012. 2 Outline Gene expression index –MAS4, average –MAS5, Tukey Biweight –dChip, model based, multi-array –RMA, model

16

An Example (from Dan Nettleton)

Suppose the following are background-adjusted, log2-transformed, quantile-normalized PM intensitiesfor a single probe set. Determine the final RMAexpression measures for this probe set.

1 2 3 4 51 4 3 6 4 72 8 1 10 5 113 6 2 7 8 84 9 4 12 9 125 7 5 9 6 10

Gen

eChi

p

Probe

Page 17: Gene Expression Index Stat 115 2012. 2 Outline Gene expression index –MAS4, average –MAS5, Tukey Biweight –dChip, model based, multi-array –RMA, model

17

An Example (continued)

4 3 6 4 7 8 1 10 5 11 6 2 7 8 8 9 4 12 9 12 7 5 9 6 10

48797

rowmedians

0 -1 2 0 3 0 -7 2 -3 3-1 -5 0 1 1 0 -5 3 0 3 0 -2 2 -1 3

matrix afterremoving

row medians

Page 18: Gene Expression Index Stat 115 2012. 2 Outline Gene expression index –MAS4, average –MAS5, Tukey Biweight –dChip, model based, multi-array –RMA, model

18

An Example (continued)

0 -1 2 0 3 0 -7 2 -3 3-1 -5 0 1 1 0 -5 3 0 3 0 -2 2 -1 3

0 -5 2 0 3

column medians

0 4 0 0 0 0 -2 0 -3 0-1 0 -2 1 -2 0 0 1 0 0 0 3 0 -1 0

matrix aftersubtracting

column medians

Page 19: Gene Expression Index Stat 115 2012. 2 Outline Gene expression index –MAS4, average –MAS5, Tukey Biweight –dChip, model based, multi-array –RMA, model

19

An Example (continued)

0 4 0 0 0 0 -2 0 -3 0-1 0 -2 1 -2 0 0 1 0 0 0 3 0 -1 0

0 0-1 0 0

rowmedians

matrix afterremoving

row medians

0 4 0 0 0 0 -2 0 -3 0 0 1 -1 2 -1 0 0 1 0 0 0 3 0 -1 0

Page 20: Gene Expression Index Stat 115 2012. 2 Outline Gene expression index –MAS4, average –MAS5, Tukey Biweight –dChip, model based, multi-array –RMA, model

20

An Example (continued)

0 4 0 0 0 0 -2 0 -3 0 0 1 -1 2 -1 0 0 1 0 0 0 3 0 -1 0

0 1 0 0 0

column medians

matrix aftersubtracting

column medians

0 3 0 0 0 0 -3 0 -3 0 0 0 -1 2 -1 0 -1 1 0 0 0 2 0 -1 0

Page 21: Gene Expression Index Stat 115 2012. 2 Outline Gene expression index –MAS4, average –MAS5, Tukey Biweight –dChip, model based, multi-array –RMA, model

21

An Example (continued)

0 3 0 0 0 0 -3 0 -3 0 0 0 -1 2 -1 0 -1 1 0 0 0 2 0 -1 0

All row medians and column medians are 0.Thus the median polish procedure has converged.The above is the residual matrix that we willsubtract from the original matrix to obtain thefitted values.

Page 22: Gene Expression Index Stat 115 2012. 2 Outline Gene expression index –MAS4, average –MAS5, Tukey Biweight –dChip, model based, multi-array –RMA, model

22

An Example (continued)

0 3 0 0 0 0 -3 0 -3 0 0 0 -1 2 -1 0 -1 1 0 0 0 2 0 -1 0

4 3 6 4 7 8 1 10 5 11 6 2 7 8 8 9 4 12 9 12 7 5 9 6 10

4 0 6 4 78 4 10 8 116 2 8 6 99 5 11 9 127 3 9 7 10

original matrix residuals from median polish

matrix of fitted values

4.28.26.29.27.2

row means= μ1

= μ2

= μ3

= μ4

= μ5

^^

^^^

RMAexpressionmeasuresfor the 5 GeneChips

Page 23: Gene Expression Index Stat 115 2012. 2 Outline Gene expression index –MAS4, average –MAS5, Tukey Biweight –dChip, model based, multi-array –RMA, model

23

Method Comparison Standard• Spike-ins: introduce markers with known

concentration (intensity) to RNA samples– Should cover a broad range of concentrations– Run two samples with and without spike-in, see

whether algorithm can detect the spike-in (differential expression)

• Dilutions: – Serial dilutions: 1:2, 1:4, 1:8…

• Latin square spike-in captures both approaches above

• Compare both accuracy qualitatively and expression index quantiatively

Page 24: Gene Expression Index Stat 115 2012. 2 Outline Gene expression index –MAS4, average –MAS5, Tukey Biweight –dChip, model based, multi-array –RMA, model

24

Latin Square Spike-ins

Page 25: Gene Expression Index Stat 115 2012. 2 Outline Gene expression index –MAS4, average –MAS5, Tukey Biweight –dChip, model based, multi-array –RMA, model

25

MAS4 MAS 5

dChip RMA

Red numbers indicate spikedgenes

Method Comparison of Spike-in

Page 26: Gene Expression Index Stat 115 2012. 2 Outline Gene expression index –MAS4, average –MAS5, Tukey Biweight –dChip, model based, multi-array –RMA, model

26

Method Comparison Conclusion

• No one uses MAS4 now• With fold change, RMA > dChip > MAS5• With p-value, RMA ~ MAS5 > dChip• MAS 5.0 does a good job on abundant genes• dChip and RMA do better on less abundant genes • Affy developed multi-chip model-based PLIER,

currently open source, although no documentation• All five models are implemented in BioConductor

(open source R package)

Page 27: Gene Expression Index Stat 115 2012. 2 Outline Gene expression index –MAS4, average –MAS5, Tukey Biweight –dChip, model based, multi-array –RMA, model

27

214019_at: CCND1

....

Page 28: Gene Expression Index Stat 115 2012. 2 Outline Gene expression index –MAS4, average –MAS5, Tukey Biweight –dChip, model based, multi-array –RMA, model

28

Page 29: Gene Expression Index Stat 115 2012. 2 Outline Gene expression index –MAS4, average –MAS5, Tukey Biweight –dChip, model based, multi-array –RMA, model

29

Probe Mapping in Affymetrix Expression arrays

• Inconsistencies in ~5% of NetAffx probe-to-gene annotations (Perez-Iratxeta et al. 2005).

• Remapping all the probes with documented human transcripts resulted in the redefinition of ~37% of probes in Affy’s newest U133 Plus 2.0 array (Harbig et al. 2005).– Provide new and better .cdf file for probe mapping

• Evolving gene/transcript definitions can cause ~30% difference in the differentially expressed genes (Dai et al. 2005).

Page 30: Gene Expression Index Stat 115 2012. 2 Outline Gene expression index –MAS4, average –MAS5, Tukey Biweight –dChip, model based, multi-array –RMA, model

30

Acknowledgment

• Terry Speed, Rafael Irizarry & group• Kevin Coombes & Keith Baggerly• Erick Rouchka• Wing Wong & Cheng Li• Mark Reimers• Erin Conlon• Larry Hunter• Zhijin Wu• Wei Li