A Distribution-Free Summarization Method for Affymetrix GeneChip Arrays Zhongxue Chen, Monnie McGee, Qingzhong Liu and Richard Scheuermann Dallas Area

A Distribution-Free Summarization Method for Affymetrix GeneChip Arrays

Zhongxue Chen, Monnie McGee, Qingzhong Liu and Richard Scheuermann

Dallas Area Bioinformatics Workshop

August 29, 2006

DAB Workshop 2006 2

A new summarization method

• Distribution Free Weighted (DFW) Summarization

• Use information on variability of probe intensities to summarize Affymetrix data

• Translate variability into weights which allow downweighting of poorly performing probes

DAB Workshop 2006 3

Need for Summarization

• Result of unique Affymetrix array structure

• Summarization is necessary to obtain one number for each gene

• All 11 - 20 probes interrogating each gene must be summarized into one expression value

DAB Workshop 2006 4

Structure of Affymetrix Arrays• Probe = sequence of 25

bases• Probe pair = perfect match

(PM) probe and its corresponding mismatch (MM)

• Probe set = 11 to 20 probe pairs interrogating one gene or EST

• Chips contain 6K to 54K probe sets

Image courtesy of Affymetrix

DAB Workshop 2006 5

PM and MM • PM = 25 base probe perfectly complementary to

a specific region of a gene• MM = 25 base probe agreeing with PM apart

from middle base• Middle base is a transition to Watson-Crick

complement (AT, G C)

DAB Workshop 2006 6

DFW• Transform probe-level intensities to log2 scale for all

arrays in experiment• Stabilizes the variance (larger intensity increased

variability• Arrange arrays in N by R matrix

• N = total number of PM probes • R = total number of arrays for entire experiment

• For each probe set, calculate a weight for each PM probe using Tukey biweight function

• Multiply weights by each probe intensity and summarize

DAB Workshop 2006 7

Calculating Weights• Calculate range of log intensities for

each PM• Find median of each range (M)• Calculate distance of range to M for

each PM (call this distance x)• Weighting function:

€

w(x) = 1−x

max(x)

⎛

⎝ ⎜

⎞

⎠ ⎟2 ⎛

⎝ ⎜ ⎜

⎞

⎠ ⎟ ⎟

2

DAB Workshop 2006 8

Probe Weights• Weight for probe i is given by

• J = number of probes in the probe set

€

wi

=w(x

i)

w(xj)

j =1

J∑

DAB Workshop 2006 9

More Calculations

• Weighted Range (WR)• Range of weighted intensities

• Weighted Standard Deviation (WSD)• Transformed Intensity Values (TIV)

• Standardizes measures between DEGs and non-DEGs

€

ExpValue = min(int) + TIV( ) WRmWSDn( )m and n should be positive integers

DAB Workshop 2006 10

Example

array-1, 2, 3, 4, 5, 6 range x w(x) w i SD wi(SD)

PM1 5.8 6.2 5.9 9.5 10.1 9.2 4.3 0.45 0.86 0.32 2.02 0.30

PM2 8.2 7.9 7.8 11.7 12.0 10.7 4.2 0.35 0.91 0.34 1.97 0.35

PM3 7.3 7.4 8.1 8.8 7.9 9.5 2.2 1.65 0 0 0.85 0

PM4 7.7 6.9 7.4 10.4 9.3 8.5 3.5 0.35 0.91 0.34 1.31 0.35

M = 3.85 max(x) = 1.65

Weighted Intensities: 7.26 7.02 7.06 10.55 10.47 9.47

Transformed Intensities (TI): 0.07 0 0.01 1 0.98 0.69

Weighted Range (WR): 10.55 - 7.02 = 3.53

Weighted SD (WSD): 1.75

Expression values (m=3, n=1): 7.28 7.02 7.06 10.87 10.78 9.69


Why Weight?

• Some PMs may have poor behavior• Give small or 0 weight to “poor” PM

• Use information across arrays• Assess quality of PM based on overall

behavior• SD of range provides information for

detecting differentially expressed genes


Probe Performance

Poorly performing probes


Comparison Data Sets• Affymetrix Latin Square Spike-In Experiments

• Two experiments: on HGU-95Av2 platform and HGU-133A platform

• HGU-95 experiments has 14 transcripts spiked-in at concentrations from 0 to 1024 pM (59 arrays)

• HGU-133 experiment has 42 transcripts spiked-in in triplicate at concentrations from 0 to 512 (42 arrays)

• McGee and Chen (2006) report 22 more spike-ins

• “GoldenSpike” Experiment (Choe et al., 2005)• Six arrays (3 experiment, 3 control) on DrosGenome1 Chip• 1309 transcripts recognizing known fold differences

(from 1.2 to 4)• 2551 recognizing transcripts included at the same

concentration


Comparison Methods• ROC curves, AUC values and CPU time• Competitors:

• Robust Multichip Average (RMA) • Bolstad, 2004; Irizarry et al., 2003

• Gene Chip RMA (GCRMA) • Wu et al., 2004

• MAS 5.0, PLIER • Affymetrix 2001, 2004

• Model-Based Expression Index (MBEI) • Li & Wong, 2001a,b

• Factor Analysis for Robust Array Summarization (FARMS) • Hochreiter et al., 2006


HGU-95 dataset :


HGU-133 dataset (64 spike-ins)


“Preferred” Method• Choe et al. tested dozens of combinations of

background correction, normalization, and summarization methods

• Preferred = the “best performing” method (according to DEGs obtained by CyberT - Baldi & Long, 2001)

• MAS 5.0 background correction Quantile normalization median polish summarization second expression level normalization using LOESS procedure


GoldenSpike Data (FC = 1.2)

0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0

0.0

0.2

0.4

0.6

0.8

1.0

R O C

F a l s e P o s i t i v e R a t e

True Positive Rate

R M A

R M A N o B G

G C R M A

M A S 5

D C H I P

P L I E R

F A R M S

D F W

P R E F F


Overall Area Under the Curve

HGU-95a HGU-133a Choeb

DFW 1.00 1.00 0.85

FARMS 0.91 0.95 0.83

GCRMA 0.69 0.57 0.88

RMA 0.60 0.63 0.77

RMA-noBG 0.65 0.63 0.82

MAS 5 0.05 0.06 0.39

MBEI 0.26 0.40 0.76

PLIER 0.03 0.20 0.50a From Affycomp II competition: 16 spike-ins for HGU95, 42 spike-ins for HGU133, bAll spike-ins


Computation Speed (in seconds)


Computational Speed (in seconds)

HGU-95 HGU-133 Choe

DFW 112 150 68

FARMS 132 198 280

GCRMA 214 210 78

RMA 342 388 150

RMA-noBG 299 353 147

MAS 5 953 1064 130

MBEI 869 833 269

PLIER 321 239 17


Further Comparisons• Affycomp II Competition

• Cope, et al., 2004; Irizarry et al., 2006• For Hgu95 spikein data, uses 16 spike-ins• For Hgu133 spikein data, uses 42 spike-ins

http://affycomp.biostat.jhsph.edu/AFFY2/TABLES.hgu/0.html

• SMU Technical Reporthttp://www.smu.edu/statistics/TechReports/TR344.pdf

• Monnie McGee’s websitehttp://faculty.smu.edu/mmcgee


ReferencesAffymetrix, Inc.. (2002) Statistical algorithms description document. Affymetrix, Inc. (2005) Technical note: guide to probe logarithmic intensity error (PLIER) estimation.Baldi,P. and Long, A.D. (2001) A Bayesian framework for the analysis of microarray expression data:

regularized t-test and statistical inferences of gene changes. Bioinformatics, 17, 509-519.Bolstad, BM. (2004) Low Level Analysis of High-density oligonucleotide array data: Background, normalization

and summarization [dissertation]. Department of Statistics, University of California at Berkeley.Choe, S.E. et al. (2005) Preferred analysis methods for Affymetrix GeneChips revealed by a wholly defined

control datasets. Genome Biol., 6, R16.1-R16.6.Cope, L.M. et.al. (2004) A benchmark for Affymetrix GeneChip expression measures. Bioinformatics, 20, 323-

331. Hochreiter, S. et al. (2006) A new summarization method for Affymetrix probe level data. Bioinformatics, 22,

943-949Irizarry, R.A. et al. (2003) Exploration, normalization, and summaries of high density oligonucleotide array probe

level data. Biostatistics, 4, 249-264.Irizarry, R.A. et al. (2006) Comparison of Affymetrix GeneChip expression measures. Bioinformatics, 22, 789-

794. Li, C. and Wong, H.W. (2001a) Model-based analysis of oligonucleotide arrays: expression index computation

and outlier detection. Proc. Nat. Acad. Sci., 98, 31-36.Li, C and Wong, H.W. (2001b) Model-based analysis of oligonucleotide arrays: model validation, design issues

and standard error application. Genome Biol., 2, research0032.1-0032.11.McGee, M. and Chen, Z. (2006) New spiked-in probe sets for the Affymetrix HG-U133A Latin Square

experiment. COBRA Preprint Series, Article 5Wu, Z. et.al. (2004) A model-based background adjustment for oligonucleotide expression arrays. J. Am. Stat.

Assoc., 99, 909-917.

Documents

A Distribution-Free Summarization Method for Affymetrix GeneChip Arrays Zhongxue Chen, Monnie McGee, Qingzhong Liu and Richard Scheuermann Dallas Area