Upload
julie-snow
View
216
Download
0
Embed Size (px)
Citation preview
Gene Expression Index
Stat 115
2012
2
Outline• Gene expression index
– MAS4, average
– MAS5, Tukey Biweight
– dChip, model based, multi-array
– RMA, model based, multi-array
– Method comparison• Latin Square spike-in experiment
– Importance of probe mapping
These are perhaps the few most popular of many methods for normalizing and computing expression measures using Affymetrix data. Currently over 50 methods are describedand compared at http://affycomp.biostat.jhsph.edu/.
3
cDNA Microarrays
• Fold change: ratio Cy5 / Cy3
• When fold change is negative
Log2(Cy5 / Cy3)
Genes
Arrays
array 1 array 2 array 3 array 4 array 5 …
1 0.46 0.30 0.80 1.51 0.90 ...2 -0.10 0.49 0.24 0.06 0.46 ...3 0.15 0.74 0.04 0.10 0.20 ...4 -0.45 -1.03 -0.79 -0.56 -0.32 ...5 -0.06 1.06 1.35 1.09 -1.09 ...
4
Affymetrix Microarray Expression Index
• How to summarize probes in a probeset?
Brighter PM usually carries more information, but not always the case (cross-hybridization)
5
MAS4• GeneChip® older software Microarray Analysis
Software 4.0 uses AvgDiff
• A: a set of suitable pairs chosen by software– Remove highest/lowest– Calculate mean, sd from remaining probes– Eliminate probes more than 3 sd from mean
• Drawback (naïve algorithm):– Can omit 30-40% probes – Can give negative values
j
jj MMPMAvgDiff )(1
6
MAS5• GeneChip® newest version
• CT* (change threshold) a version of MM that is never bigger than PM– If MM<PM, CT* = MM– If MM>PM, estimate typical
case MM for PM • Tukeybiweight of MMs
with similar PM values ~70% PM
– If typical MMs>PM for, set CT* = PM - • Robust weighting to down weight outliers
)}{log( *jj CTPMghtTukeyBiweisignal
7
Li & Wong (dChip)Important observation: relative values of probes within a
probeset very stable across multiple samples.
8
Model-Based Expression Index
• Look at multiple samples at a time, give different probes a different weight
• Each probe signal is proportional to – Amount of target sample:
– Affinity of specific probe sequence to the target: j
1
2
Probes 1 2 3
sample 1
sample 2
9
Li & Wong (dChip)
• Model
• Iteratively estimate θi and φj to minimize εij
• Try to minimize the sum of errors
ijjiijij MMPM
............
...)()()(
...)()()(
...)()()(
333231
232221
131211
MPMPMP
MPMPMP
MPMPMP
Sample1
Sample2Sample3…
φ1 φ2 φ3
Probe1 Probe2 Probe3 …1
2
3
…
Concentration Probe affinity
Error
10
RMA = Robust Multi-chip Analysis
• Irizarry & Speed, 2003
• Eliminates MM probes
• Probe intensity background adjustment
• Quantile normalize the background adjusted PM
• Take Log of PM
• Robust probe summary
11
RMA Background Subtraction
• Signal + BG = PM
• Signal ~ exponential; BG ~ normal
+ =
Signal + Noise = Observed
12
RMA Background Subtraction
• BG distribution
13
Why Log(PM)• Captures the fact that higher value probes are
more variable• Assume probe noise is comparable on log scale
14
• For each probe set, PMij = ij
• Fit the model:
– aj is expression index, bj is probe effect– Log2n() stands for logarithm after quantile
normalization of n samples
• Iteratively refit aj and bj (similar to dChip)– Main difference is to minimize error at log PM
RMA
)log()log()(log jiijPM
ijjiij baPM )bg(nlog2
RMA model fitting: Median Polish
15
• For a given probe set with J probe pairs, let yij denote the background-adjusted, base-2-logged, and quantile-normalized value for GeneChip i and probe j.
• Assume yij = μi + αj + eij where α1 + α2 + ... + αn = 0.
• Perform Tukey’s Median Polish on the matrix of yij values with yij in the ith row and jth column.
gene expressionof the probe seton GeneChip i
probe affinityaffect for thejth probe in theprobe set
residual
16
An Example (from Dan Nettleton)
Suppose the following are background-adjusted, log2-transformed, quantile-normalized PM intensitiesfor a single probe set. Determine the final RMAexpression measures for this probe set.
1 2 3 4 51 4 3 6 4 72 8 1 10 5 113 6 2 7 8 84 9 4 12 9 125 7 5 9 6 10
Gen
eChi
p
Probe
17
An Example (continued)
4 3 6 4 7 8 1 10 5 11 6 2 7 8 8 9 4 12 9 12 7 5 9 6 10
48797
rowmedians
0 -1 2 0 3 0 -7 2 -3 3-1 -5 0 1 1 0 -5 3 0 3 0 -2 2 -1 3
matrix afterremoving
row medians
18
An Example (continued)
0 -1 2 0 3 0 -7 2 -3 3-1 -5 0 1 1 0 -5 3 0 3 0 -2 2 -1 3
0 -5 2 0 3
column medians
0 4 0 0 0 0 -2 0 -3 0-1 0 -2 1 -2 0 0 1 0 0 0 3 0 -1 0
matrix aftersubtracting
column medians
19
An Example (continued)
0 4 0 0 0 0 -2 0 -3 0-1 0 -2 1 -2 0 0 1 0 0 0 3 0 -1 0
0 0-1 0 0
rowmedians
matrix afterremoving
row medians
0 4 0 0 0 0 -2 0 -3 0 0 1 -1 2 -1 0 0 1 0 0 0 3 0 -1 0
20
An Example (continued)
0 4 0 0 0 0 -2 0 -3 0 0 1 -1 2 -1 0 0 1 0 0 0 3 0 -1 0
0 1 0 0 0
column medians
matrix aftersubtracting
column medians
0 3 0 0 0 0 -3 0 -3 0 0 0 -1 2 -1 0 -1 1 0 0 0 2 0 -1 0
21
An Example (continued)
0 3 0 0 0 0 -3 0 -3 0 0 0 -1 2 -1 0 -1 1 0 0 0 2 0 -1 0
All row medians and column medians are 0.Thus the median polish procedure has converged.The above is the residual matrix that we willsubtract from the original matrix to obtain thefitted values.
22
An Example (continued)
0 3 0 0 0 0 -3 0 -3 0 0 0 -1 2 -1 0 -1 1 0 0 0 2 0 -1 0
4 3 6 4 7 8 1 10 5 11 6 2 7 8 8 9 4 12 9 12 7 5 9 6 10
4 0 6 4 78 4 10 8 116 2 8 6 99 5 11 9 127 3 9 7 10
original matrix residuals from median polish
matrix of fitted values
4.28.26.29.27.2
row means= μ1
= μ2
= μ3
= μ4
= μ5
^^
^^^
RMAexpressionmeasuresfor the 5 GeneChips
23
Method Comparison Standard• Spike-ins: introduce markers with known
concentration (intensity) to RNA samples– Should cover a broad range of concentrations– Run two samples with and without spike-in, see
whether algorithm can detect the spike-in (differential expression)
• Dilutions: – Serial dilutions: 1:2, 1:4, 1:8…
• Latin square spike-in captures both approaches above
• Compare both accuracy qualitatively and expression index quantiatively
24
Latin Square Spike-ins
25
MAS4 MAS 5
dChip RMA
Red numbers indicate spikedgenes
Method Comparison of Spike-in
26
Method Comparison Conclusion
• No one uses MAS4 now• With fold change, RMA > dChip > MAS5• With p-value, RMA ~ MAS5 > dChip• MAS 5.0 does a good job on abundant genes• dChip and RMA do better on less abundant genes • Affy developed multi-chip model-based PLIER,
currently open source, although no documentation• All five models are implemented in BioConductor
(open source R package)
27
214019_at: CCND1
....
28
29
Probe Mapping in Affymetrix Expression arrays
• Inconsistencies in ~5% of NetAffx probe-to-gene annotations (Perez-Iratxeta et al. 2005).
• Remapping all the probes with documented human transcripts resulted in the redefinition of ~37% of probes in Affy’s newest U133 Plus 2.0 array (Harbig et al. 2005).– Provide new and better .cdf file for probe mapping
• Evolving gene/transcript definitions can cause ~30% difference in the differentially expressed genes (Dai et al. 2005).
30
Acknowledgment
• Terry Speed, Rafael Irizarry & group• Kevin Coombes & Keith Baggerly• Erick Rouchka• Wing Wong & Cheng Li• Mark Reimers• Erin Conlon• Larry Hunter• Zhijin Wu• Wei Li