ViaLogy Lien Chung Jim Breaux, Ph.D. SoCalBSI 2004 “ Improvements to Microarray Analytical Methods and Development of Differential Expression Toolkit ”

ViaLogyViaLogy

Lien ChungLien Chung

Jim Breaux, Ph.D.Jim Breaux, Ph.D.

SoCalBSI 2004SoCalBSI 2004

““ Improvements to Microarray Analytical Improvements to Microarray Analytical Methods and Development of Methods and Development of

Differential Expression Toolkit ”Differential Expression Toolkit ”

Funded by the National Science Foundation and National Institute of Health

Outline of Talk

Background Affymetrix GeneChips Vialogy and Microarray Analysis

Accelerating Low Level Analysis Algorithms Quantile Normalization Median Polish

Differential Expression Toolkit Statistical Analysis of Microarrays (SAM)

Future Direction

Affymetrix GeneChip® Microarrays

Useful tool to measure the level of mRNA expression of thousands of genes in a biological

sample

Signal detection

Convert fluorescence to signal

Normalization

Reduce unwanted variation across chips

Summarization

Reduce 11- 20 probe intensities of each gene to a single value

Low Level Low Level AnalysisAnalysis

Internet Resources

An open source and open software project for the analysis and comprehension of genomic data

A collection of analysis packages implemented in the R language

Packages used: affy, siggenes

BioConductor

R Project Open source language and environment for statistical

computing and graphics

Pros: built in mathematical functions, supports graphics

Cons: computationally slow

ViaLogy’s Low Level Analysis(Part 1)

VMAxS

Microarray image

Pixel intensity

CEL Report

Feature level signal

Signal Detection via“Active Signal Processing”

CEL Report

NORMALIZATION

(Quantile Normalization)

SUMMARIZATION

(Median Polish)

Project 1: Recode RMA as a C interface from R Specific to Vialogy’s input files

Introduce a way to deal with zero values

Break up process into individual functions

ViaLogy’s Low Level Analysis (Part 2)

Robust Multi-Chip Analysis (RMA)

Written in R and C language (affy package)

Only specific to Affymetrix input files

Do not have special ways of dealing with zero values

Irizarry, R. et al (2003)

Slow Run Time in R language

Quantile Normalization Significant variation in the distribution of intensity values across arrays Transforms the distribution of probe intensities to be same across arrays Final distribution is the average of each quantile across chips

Bolstad et al. (2003)

Density

Log Intensities

Quantile Normalization cont’d

1 5 3 5

2 1 6 7

3 2 2 6

4 6 1 8

1 1 1 5

2 2 2 6

3 5 3 7

4 6 6 8

2

3

4.5

6

Sort each column of original matrix

Take average across rows

Set each value to corresponding row

average

Unsort columns of

matrix to original order

2 4.5 4.5 2

3 2 6 4.5

4.5 3 3 3

6 6 2 6

2 2 2 2

3 3 3 3

4.5 4.5 4.5 4.5

6 6 6 6

Bolstad et al. (2003)

Median Polish

Summarization step used in RMA Fits a linear model to the data for each probe

set across all microarrays Greatly reduces variability for genes

expressed at lower levels

Tukey, J. (1977)

Irizarry, R. (2003)

11-20 features

per gene

1 expression value

per gene

Quantile Normalization and Median Polish in C

Read literature on Quantile Normalization and Median Polish

Use R and C code as foundation for my code

Add functionalities to deal with ties and zeroes

Testing of code for accuracy of algorithm

Steps Involved . . .

Results . . .

QUANTILE NORMALIZATION

11 min 53 secs

For ~ 20,000 genes, 30 Arrays

MEDIAN POLISH

4 min 43 secs

10 secs 20 secs

R code

C code

CEL file

NORMALIZATION

(Quantile Normalization)

SUMMARIZATION

(Median Polish)

Differential ExpressionToolkit

Project 2 :

To Recap . . .

Statistical Analysis of Microarrays (SAM)

; 1, 2,...,id i p

(1) (2) ( )...b b bpd d d

( ) ( )1

1 Bb

i ib

d dB

Calculate a statistic (d-score) for each gene.

Order the d-scores.

Create B sets of random permutations of group labels. For each permutation calculate d-scores for all genes and order them.

From the B set of ordered statistics, find expected order statistics.

Plot observed d-scores v. expected d-scores and evaluate significant genes based on user-defined threshold (Δ)

(1) (2) ( )... pd d d

Tusher et al. (2001)

SAM Example

Group 1 Group 2

1 2 3 4 5 6

Gene 1 1.1 0.3 0.4 2.1 1.6 1.3

Gene 2 0.1 1.2 0.5 1.5 -0.3 -0.3

Gene 3 0.7 -0.2 1.3 -0.3 -0.5 1.5

Gene 4 -0.9 1.4 0.6 -0.6 1.0 1.3

Gene 5 1.5 0.8 1.0 -0.7 0.3 -0.8

ordered

d-score d-score

-1.5 -1.5

0.3 -0.2

0.4 0.3

-0.2 0.4

1.6 1.6

Observed d-scores

2 1

0i

i

x xd

s s

0

mean of group

stand. dev

fudge factor (constant)

i

i

x i

s

s

SAM Example (cont’d)

Permutation # i

Group 1 Group 2 ordered

5 2 4 1 6 3 d-score d-score

Gene 1 1.1 0.3 0.4 2.1 1.6 1.3 0.3 -1.2

Gene 2 0.1 1.2 0.5 1.5 -0.3 -0.3 0.9 -0.2

Gene 3 0.7 -0.2 1.3 -0.3 -0.5 1.5 -0.2 0.3

Gene 4 -0.9 1.4 0.6 -0.6 1.0 1.3 0.5 0.5

Gene 5 1.5 0.8 1.0 -0.7 0.3 -0.8 -1.2 0.9

Permutation #1 Permutation #2 … Permutation #B Avg d-scores

-1.2 -0.6 -0.2 0.5

Ordered -0.2 -0.3 -0.1 0.8

d-scores 0.3 0.1 1.0 1.3

0.5 0.2 1.2 1.2

0.9 1.6 1.3 0.6

Expected d-scores

SAM Example (cont’d)

SAM Implementation

Siggenes (BioConductor)

R language (slow)

Too many options

C interface from R Faster run time

Specific to Vialogy’s input files and functionalities

Read SAM literature and understand algorithm

Go through Siggenes source code

Write C code, taking out unnecessary steps and adding additional functionalities

For data set of ~ 7000 genes, 8 Arrays

SAM in R C interface from R

~60 seconds ~5 seconds

Input to SAM

Results in R

Future Direction

1. SAM Implementation for other study types such as

“paired” and “one-class” Procedures for dealing with zeros

2. Differential Expression Toolkit Evaluate other more accurate and efficient

methods

References Journals

Irizarry, R. et al. (2003) “Exploration, normalization, and summaries of high density oligonucleotide array probe level data,” Biostatistics.

Bolstad, (2003). “A Comparison of Normalization Methods for High Density Oligonucleotide Array Data Based on Bias and Variance” Bioinformatics

Tukey, John. (1977) “Exploratory Data Analysis”. Tusher et al. (2001). “Significance analysis of microarrays

applied to ionizing radiation response,” PNAS. Websites

www.bioconductor.org www.r-project.org www-stat.stanford.edu/~tibs/SAM/

Acknowledgements

SoCalBSI Members Prof. Jamil Momand Prof. Sandra Sharp Prof. Wendie Johnston Prof. Nancy Warter-Perez Jacqueline Heras Fellow Interns

Jim Breaux, Ph.D.

Sandeep Gulati, Ph.D.

Robin Hill

Juan Guitterez

Vijay Daggumati

Other Employees

National Science Foundation & National Institute of Health

Median Polish Cont’d

and so on…until sum of the “residuals” of the matrix is small

The probeset summary for each gene is computed by taking into account the row effect and column effect that is determined by Median Polish

Tukey, J. (1977)

Documents

ViaLogy Lien Chung Jim Breaux, Ph.D. SoCalBSI 2004 “ Improvements to Microarray Analytical Methods and Development of Differential Expression Toolkit ”