Gene Expression Index Stat 115 2012. 2 Outline Gene expression index –MAS4, average –MAS5, Tukey Biweight –dChip, model based, multi-array –RMA, model

Gene Expression Index

Stat 115

2012

2

Outline• Gene expression index

– MAS4, average

– MAS5, Tukey Biweight

– dChip, model based, multi-array

– RMA, model based, multi-array

– Method comparison• Latin Square spike-in experiment

– Importance of probe mapping

These are perhaps the few most popular of many methods for normalizing and computing expression measures using Affymetrix data. Currently over 50 methods are describedand compared at http://affycomp.biostat.jhsph.edu/.

3

cDNA Microarrays

• Fold change: ratio Cy5 / Cy3

• When fold change is negative

Log2(Cy5 / Cy3)

Genes

Arrays

array 1 array 2 array 3 array 4 array 5 …

1 0.46 0.30 0.80 1.51 0.90 ...2 -0.10 0.49 0.24 0.06 0.46 ...3 0.15 0.74 0.04 0.10 0.20 ...4 -0.45 -1.03 -0.79 -0.56 -0.32 ...5 -0.06 1.06 1.35 1.09 -1.09 ...

4

Affymetrix Microarray Expression Index

• How to summarize probes in a probeset?

Brighter PM usually carries more information, but not always the case (cross-hybridization)

5

MAS4• GeneChip® older software Microarray Analysis

Software 4.0 uses AvgDiff

• A: a set of suitable pairs chosen by software– Remove highest/lowest– Calculate mean, sd from remaining probes– Eliminate probes more than 3 sd from mean

• Drawback (naïve algorithm):– Can omit 30-40% probes – Can give negative values

j

jj MMPMAvgDiff )(1

6

MAS5• GeneChip® newest version

• CT* (change threshold) a version of MM that is never bigger than PM– If MM<PM, CT* = MM– If MM>PM, estimate typical

case MM for PM • Tukeybiweight of MMs

with similar PM values ~70% PM

– If typical MMs>PM for, set CT* = PM - • Robust weighting to down weight outliers

)}{log( *jj CTPMghtTukeyBiweisignal

7

Li & Wong (dChip)Important observation: relative values of probes within a

probeset very stable across multiple samples.

8

Model-Based Expression Index

• Look at multiple samples at a time, give different probes a different weight

• Each probe signal is proportional to – Amount of target sample:

– Affinity of specific probe sequence to the target: j

1

2

Probes 1 2 3

sample 1

sample 2

9

Li & Wong (dChip)

• Model

• Iteratively estimate θi and φj to minimize εij

• Try to minimize the sum of errors

ijjiijij MMPM

............

...)()()(

...)()()(

...)()()(

333231

232221

131211

MPMPMP

MPMPMP

MPMPMP

Sample1

Sample2Sample3…

φ1 φ2 φ3

Probe1 Probe2 Probe3 …1

2

3

…

Concentration Probe affinity

Error

10

RMA = Robust Multi-chip Analysis

• Irizarry & Speed, 2003

• Eliminates MM probes

• Probe intensity background adjustment

• Quantile normalize the background adjusted PM

• Take Log of PM

• Robust probe summary

11

RMA Background Subtraction

• Signal + BG = PM

• Signal ~ exponential; BG ~ normal

+ =

Signal + Noise = Observed

12

RMA Background Subtraction

• BG distribution

13

Why Log(PM)• Captures the fact that higher value probes are

more variable• Assume probe noise is comparable on log scale

14

• For each probe set, PMij = ij

• Fit the model:

– aj is expression index, bj is probe effect– Log2n() stands for logarithm after quantile

normalization of n samples

• Iteratively refit aj and bj (similar to dChip)– Main difference is to minimize error at log PM

RMA

)log()log()(log jiijPM

ijjiij baPM )bg(nlog2

RMA model fitting: Median Polish

15

• For a given probe set with J probe pairs, let yij denote the background-adjusted, base-2-logged, and quantile-normalized value for GeneChip i and probe j.

• Assume yij = μi + αj + eij where α1 + α2 + ... + αn = 0.

• Perform Tukey’s Median Polish on the matrix of yij values with yij in the ith row and jth column.

gene expressionof the probe seton GeneChip i

probe affinityaffect for thejth probe in theprobe set

residual

16

An Example (from Dan Nettleton)

Suppose the following are background-adjusted, log2-transformed, quantile-normalized PM intensitiesfor a single probe set. Determine the final RMAexpression measures for this probe set.

1 2 3 4 51 4 3 6 4 72 8 1 10 5 113 6 2 7 8 84 9 4 12 9 125 7 5 9 6 10

Gen

eChi

p

Probe

17

An Example (continued)

4 3 6 4 7 8 1 10 5 11 6 2 7 8 8 9 4 12 9 12 7 5 9 6 10

48797

rowmedians

0 -1 2 0 3 0 -7 2 -3 3-1 -5 0 1 1 0 -5 3 0 3 0 -2 2 -1 3

matrix afterremoving

row medians

18


0 -1 2 0 3 0 -7 2 -3 3-1 -5 0 1 1 0 -5 3 0 3 0 -2 2 -1 3

0 -5 2 0 3

column medians

0 4 0 0 0 0 -2 0 -3 0-1 0 -2 1 -2 0 0 1 0 0 0 3 0 -1 0

matrix aftersubtracting

column medians

19


0 4 0 0 0 0 -2 0 -3 0-1 0 -2 1 -2 0 0 1 0 0 0 3 0 -1 0

0 0-1 0 0

rowmedians

matrix afterremoving

row medians

0 4 0 0 0 0 -2 0 -3 0 0 1 -1 2 -1 0 0 1 0 0 0 3 0 -1 0

20


0 4 0 0 0 0 -2 0 -3 0 0 1 -1 2 -1 0 0 1 0 0 0 3 0 -1 0

0 1 0 0 0

column medians

matrix aftersubtracting

column medians

0 3 0 0 0 0 -3 0 -3 0 0 0 -1 2 -1 0 -1 1 0 0 0 2 0 -1 0

21


0 3 0 0 0 0 -3 0 -3 0 0 0 -1 2 -1 0 -1 1 0 0 0 2 0 -1 0

All row medians and column medians are 0.Thus the median polish procedure has converged.The above is the residual matrix that we willsubtract from the original matrix to obtain thefitted values.

22


0 3 0 0 0 0 -3 0 -3 0 0 0 -1 2 -1 0 -1 1 0 0 0 2 0 -1 0

4 3 6 4 7 8 1 10 5 11 6 2 7 8 8 9 4 12 9 12 7 5 9 6 10

4 0 6 4 78 4 10 8 116 2 8 6 99 5 11 9 127 3 9 7 10

original matrix residuals from median polish

matrix of fitted values

4.28.26.29.27.2

row means= μ1

= μ2

= μ3

= μ4

= μ5

^^

^^^

RMAexpressionmeasuresfor the 5 GeneChips

23

Method Comparison Standard• Spike-ins: introduce markers with known

concentration (intensity) to RNA samples– Should cover a broad range of concentrations– Run two samples with and without spike-in, see

whether algorithm can detect the spike-in (differential expression)

• Dilutions: – Serial dilutions: 1:2, 1:4, 1:8…

• Latin square spike-in captures both approaches above

• Compare both accuracy qualitatively and expression index quantiatively

24

Latin Square Spike-ins

25

MAS4 MAS 5

dChip RMA

Red numbers indicate spikedgenes

Method Comparison of Spike-in

26

Method Comparison Conclusion

• No one uses MAS4 now• With fold change, RMA > dChip > MAS5• With p-value, RMA ~ MAS5 > dChip• MAS 5.0 does a good job on abundant genes• dChip and RMA do better on less abundant genes • Affy developed multi-chip model-based PLIER,

currently open source, although no documentation• All five models are implemented in BioConductor

(open source R package)

27

214019_at: CCND1

....

28

29

Probe Mapping in Affymetrix Expression arrays

• Inconsistencies in ~5% of NetAffx probe-to-gene annotations (Perez-Iratxeta et al. 2005).

• Remapping all the probes with documented human transcripts resulted in the redefinition of ~37% of probes in Affy’s newest U133 Plus 2.0 array (Harbig et al. 2005).– Provide new and better .cdf file for probe mapping

• Evolving gene/transcript definitions can cause ~30% difference in the differentially expressed genes (Dai et al. 2005).

30

Acknowledgment

• Terry Speed, Rafael Irizarry & group• Kevin Coombes & Keith Baggerly• Erick Rouchka• Wing Wong & Cheng Li• Mark Reimers• Erin Conlon• Larry Hunter• Zhijin Wu• Wei Li

Documents

Gene Expression Index Stat 115 2012. 2 Outline Gene expression index –MAS4, average –MAS5, Tukey Biweight –dChip, model based, multi-array –RMA, model