Upload
everett-allen
View
214
Download
2
Embed Size (px)
Citation preview
Detection and Compensation of Cross-
Hybridization in DNA Microarray Data
Joint work with Quaid Morris(1), Tim Hughes(2)
and Brendan Frey(1)
(1) Probabilistic and Statistical Inference Group, University of Toronto
(2) Banting & Best Department of Medical Research, University of Toronto
Jim Huang (1),
Description and Applications of DNA Microarrays Microarrays consist of a 2-D
array of probes, each with a short DNA sequence attached. These sequences are called oligonucleotide sequences.
The output of each probe is approximately proportional to the amount of DNA that binds to the probe from a given tissue; the data for each probe is an N-dimensional expression profile vector, where N is the number of tissues used on the array.
DNA microarrays can be used to measure the level of gene expression across these N tissues.
Hybridization and cross-hybridization The process of 2
complementary DNA strands binding is called hybridization;
Ideally, an oligonucleotide probe will only bind to the DNA sequence for which it was designed and to which it is complementary;
However, many DNA sequences are similar to one another and can bind to other probes on the array;
This phenomenon is called cross-hybridization;
AGCTAGGAT
TCGAT CCTA
ATCTAGAAT
TCGAT CCTA
Hybridization Cross-hybridization
Oligonucleotide Probe
DNA from tissue sample
The trouble with cross-hybridization
With cross-hybridization, each probe will signal the presence of multiple sequences other than that it was designed for;
This skews the observed data from the expected data.
Expected expression profile vector
(no hybridization)
Observed expression profile vector
(cross-hybridized)
= +
Detecting cross-hybridization (1)
To test for whether cross-hybridization is impacting the gene expression data, we perform a BLAST sequence match on all oligonucleotide probe sequences used on the microarray;
Many probes will be matched with sequences for which it wasn’t specifically designed.
Detecting cross-hybridization (2) We compute the Pearson
correlation coefficient ρ between matched probe sequence expression profiles and between the profiles of randomly-paired probes;
Approximately 33% of the BLAST-matched probes have ρ > 0.95, whereas only 2% of randomly-matched probes have ρ >0.95;
This difference in the 2 distributions indicates that cross-hybridization indeed has a significant impact on the observed gene expression data.
Compensating for cross-hybridization
We model the observed, cross-hybridized expression profile vector x as a matrix product of a hybridization matrix Λ and an unobserved expression profile vector z in which there is no cross-hybridization.
The elements λij of the Λ matrix are set as parameterized functions of the Gibbs free energy ΔGij between probes i and j.
To compensate for cross-hybridization, we use a generalized Expectation-Maximization algorithm in which we solve for z and Λ iteratively.