View
212
Download
0
Tags:
Embed Size (px)
Citation preview
ViaLogyViaLogy
Lien ChungLien Chung
Jim Breaux, Ph.D.Jim Breaux, Ph.D.
SoCalBSI 2004SoCalBSI 2004
““ Improvements to Microarray Analytical Improvements to Microarray Analytical Methods and Development of Methods and Development of
Differential Expression Toolkit ”Differential Expression Toolkit ”
Funded by the National Science Foundation and National Institute of Health
Outline of Talk
Background Affymetrix GeneChips Vialogy and Microarray Analysis
Accelerating Low Level Analysis Algorithms Quantile Normalization Median Polish
Differential Expression Toolkit Statistical Analysis of Microarrays (SAM)
Future Direction
Affymetrix GeneChip® Microarrays
Useful tool to measure the level of mRNA expression of thousands of genes in a biological
sample
Signal detection
Convert fluorescence to signal
Normalization
Reduce unwanted variation across chips
Summarization
Reduce 11- 20 probe intensities of each gene to a single value
Low Level Low Level AnalysisAnalysis
Internet Resources
An open source and open software project for the analysis and comprehension of genomic data
A collection of analysis packages implemented in the R language
Packages used: affy, siggenes
BioConductor
R Project Open source language and environment for statistical
computing and graphics
Pros: built in mathematical functions, supports graphics
Cons: computationally slow
ViaLogy’s Low Level Analysis(Part 1)
VMAxS
Microarray image
Pixel intensity
CEL Report
Feature level signal
Signal Detection via“Active Signal Processing”
CEL Report
NORMALIZATION
(Quantile Normalization)
SUMMARIZATION
(Median Polish)
Project 1: Recode RMA as a C interface from R Specific to Vialogy’s input files
Introduce a way to deal with zero values
Break up process into individual functions
ViaLogy’s Low Level Analysis (Part 2)
Robust Multi-Chip Analysis (RMA)
Written in R and C language (affy package)
Only specific to Affymetrix input files
Do not have special ways of dealing with zero values
Irizarry, R. et al (2003)
Slow Run Time in R language
Quantile Normalization Significant variation in the distribution of intensity values across arrays Transforms the distribution of probe intensities to be same across arrays Final distribution is the average of each quantile across chips
Bolstad et al. (2003)
Density
Log Intensities
Quantile Normalization cont’d
1 5 3 5
2 1 6 7
3 2 2 6
4 6 1 8
1 1 1 5
2 2 2 6
3 5 3 7
4 6 6 8
2
3
4.5
6
Sort each column of original matrix
Take average across rows
Set each value to corresponding row
average
Unsort columns of
matrix to original order
2 4.5 4.5 2
3 2 6 4.5
4.5 3 3 3
6 6 2 6
2 2 2 2
3 3 3 3
4.5 4.5 4.5 4.5
6 6 6 6
Bolstad et al. (2003)
Median Polish
Summarization step used in RMA Fits a linear model to the data for each probe
set across all microarrays Greatly reduces variability for genes
expressed at lower levels
Tukey, J. (1977)
Irizarry, R. (2003)
11-20 features
per gene
1 expression value
per gene
Quantile Normalization and Median Polish in C
Read literature on Quantile Normalization and Median Polish
Use R and C code as foundation for my code
Add functionalities to deal with ties and zeroes
Testing of code for accuracy of algorithm
Steps Involved . . .
Results . . .
QUANTILE NORMALIZATION
11 min 53 secs
For ~ 20,000 genes, 30 Arrays
MEDIAN POLISH
4 min 43 secs
10 secs 20 secs
R code
C code
CEL file
NORMALIZATION
(Quantile Normalization)
SUMMARIZATION
(Median Polish)
Differential ExpressionToolkit
Project 2 :
To Recap . . .
Statistical Analysis of Microarrays (SAM)
; 1, 2,...,id i p
(1) (2) ( )...b b bpd d d
( ) ( )1
1 Bb
i ib
d dB
Calculate a statistic (d-score) for each gene.
Order the d-scores.
Create B sets of random permutations of group labels. For each permutation calculate d-scores for all genes and order them.
From the B set of ordered statistics, find expected order statistics.
Plot observed d-scores v. expected d-scores and evaluate significant genes based on user-defined threshold (Δ)
(1) (2) ( )... pd d d
Tusher et al. (2001)
SAM Example
Group 1 Group 2
1 2 3 4 5 6
Gene 1 1.1 0.3 0.4 2.1 1.6 1.3
Gene 2 0.1 1.2 0.5 1.5 -0.3 -0.3
Gene 3 0.7 -0.2 1.3 -0.3 -0.5 1.5
Gene 4 -0.9 1.4 0.6 -0.6 1.0 1.3
Gene 5 1.5 0.8 1.0 -0.7 0.3 -0.8
ordered
d-score d-score
-1.5 -1.5
0.3 -0.2
0.4 0.3
-0.2 0.4
1.6 1.6
Observed d-scores
2 1
0i
i
x xd
s s
0
mean of group
stand. dev
fudge factor (constant)
i
i
x i
s
s
SAM Example (cont’d)
Permutation # i
Group 1 Group 2 ordered
5 2 4 1 6 3 d-score d-score
Gene 1 1.1 0.3 0.4 2.1 1.6 1.3 0.3 -1.2
Gene 2 0.1 1.2 0.5 1.5 -0.3 -0.3 0.9 -0.2
Gene 3 0.7 -0.2 1.3 -0.3 -0.5 1.5 -0.2 0.3
Gene 4 -0.9 1.4 0.6 -0.6 1.0 1.3 0.5 0.5
Gene 5 1.5 0.8 1.0 -0.7 0.3 -0.8 -1.2 0.9
Permutation #1 Permutation #2 … Permutation #B Avg d-scores
-1.2 -0.6 -0.2 0.5
Ordered -0.2 -0.3 -0.1 0.8
d-scores 0.3 0.1 1.0 1.3
0.5 0.2 1.2 1.2
0.9 1.6 1.3 0.6
Expected d-scores
SAM Example (cont’d)
SAM Implementation
Siggenes (BioConductor)
R language (slow)
Too many options
C interface from R Faster run time
Specific to Vialogy’s input files and functionalities
Read SAM literature and understand algorithm
Go through Siggenes source code
Write C code, taking out unnecessary steps and adding additional functionalities
For data set of ~ 7000 genes, 8 Arrays
SAM in R C interface from R
~60 seconds ~5 seconds
Input to SAM
Results in R
Future Direction
1. SAM Implementation for other study types such as
“paired” and “one-class” Procedures for dealing with zeros
2. Differential Expression Toolkit Evaluate other more accurate and efficient
methods
References Journals
Irizarry, R. et al. (2003) “Exploration, normalization, and summaries of high density oligonucleotide array probe level data,” Biostatistics.
Bolstad, (2003). “A Comparison of Normalization Methods for High Density Oligonucleotide Array Data Based on Bias and Variance” Bioinformatics
Tukey, John. (1977) “Exploratory Data Analysis”. Tusher et al. (2001). “Significance analysis of microarrays
applied to ionizing radiation response,” PNAS. Websites
www.bioconductor.org www.r-project.org www-stat.stanford.edu/~tibs/SAM/
Acknowledgements
SoCalBSI Members Prof. Jamil Momand Prof. Sandra Sharp Prof. Wendie Johnston Prof. Nancy Warter-Perez Jacqueline Heras Fellow Interns
Jim Breaux, Ph.D.
Sandeep Gulati, Ph.D.
Robin Hill
Juan Guitterez
Vijay Daggumati
Other Employees
National Science Foundation & National Institute of Health
Median Polish Cont’d
and so on…until sum of the “residuals” of the matrix is small
The probeset summary for each gene is computed by taking into account the row effect and column effect that is determined by Median Polish
Tukey, J. (1977)