48
MICROARRAY DATA ANALYSIS Mark Kon, Boston University Presentation of joint work with Dr. James Lyons-Weiler Centers for Pathology and Oncology Informatics/ University of Pittsburgh © 2005 all contents, except those reproduced from copyrighted sources and reproduced with permission and citation. All reproduced content is protected by the original copyright owners. No content may be reproduced in physical or electronic form without permission. All rights reserved.

MICROARRAY DATA ANALYSIS Mark Kon, Boston University Presentation of joint work with Dr. James Lyons-Weiler Centers for Pathology and Oncology Informatics

  • View
    215

  • Download
    0

Embed Size (px)

Citation preview

Page 1: MICROARRAY DATA ANALYSIS Mark Kon, Boston University Presentation of joint work with Dr. James Lyons-Weiler Centers for Pathology and Oncology Informatics

MICROARRAY DATA ANALYSIS  

Mark Kon, Boston UniversityPresentation of joint work with Dr. James Lyons-Weiler

Centers for Pathology and Oncology Informatics/University of Pittsburgh

 © 2005 all contents, except those reproduced from copyrighted sources and reproduced

with permission and citation. All reproduced content is protected by the original copyright owners. No content may be reproduced in physical or electronic form without permission. All rights reserved.

Page 2: MICROARRAY DATA ANALYSIS Mark Kon, Boston University Presentation of joint work with Dr. James Lyons-Weiler Centers for Pathology and Oncology Informatics

I. Microarray Technologies

Page 3: MICROARRAY DATA ANALYSIS Mark Kon, Boston University Presentation of joint work with Dr. James Lyons-Weiler Centers for Pathology and Oncology Informatics
Page 4: MICROARRAY DATA ANALYSIS Mark Kon, Boston University Presentation of joint work with Dr. James Lyons-Weiler Centers for Pathology and Oncology Informatics

We are at the cusp of a wave…

But: we can be swallowed

up by data!

Page 5: MICROARRAY DATA ANALYSIS Mark Kon, Boston University Presentation of joint work with Dr. James Lyons-Weiler Centers for Pathology and Oncology Informatics

“Probe”: single-stranded DNA with a defined identity tethered to a solid medium “Target”: the labeled DNA or RNA

Page 6: MICROARRAY DATA ANALYSIS Mark Kon, Boston University Presentation of joint work with Dr. James Lyons-Weiler Centers for Pathology and Oncology Informatics

Two Main Types of Microarrays

cDNA arrays: spotted onto surface

oligonucleotide arrays: created on surface

Page 7: MICROARRAY DATA ANALYSIS Mark Kon, Boston University Presentation of joint work with Dr. James Lyons-Weiler Centers for Pathology and Oncology Informatics

microarray chip

population ofcDNA

Page 8: MICROARRAY DATA ANALYSIS Mark Kon, Boston University Presentation of joint work with Dr. James Lyons-Weiler Centers for Pathology and Oncology Informatics

Target labelling

Cyanine dyes (Cy3/Cy5)

Page 9: MICROARRAY DATA ANALYSIS Mark Kon, Boston University Presentation of joint work with Dr. James Lyons-Weiler Centers for Pathology and Oncology Informatics

Custom cDNA Arrays • Competitive hybridization,

typically uses 2-dye labelling

(Cy3, Cy5)• RT-PCR to amplify targets

Page 10: MICROARRAY DATA ANALYSIS Mark Kon, Boston University Presentation of joint work with Dr. James Lyons-Weiler Centers for Pathology and Oncology Informatics
Page 11: MICROARRAY DATA ANALYSIS Mark Kon, Boston University Presentation of joint work with Dr. James Lyons-Weiler Centers for Pathology and Oncology Informatics

Disadvantages

• Spot effects (identifying spot location)• Channel differences

– labeling efficiencies of the cDNA targets– Nonlinearities– Intensity-related heteroscedasticity – residuals

may be input-dependent– Confounding: sometimes many other variables

Page 12: MICROARRAY DATA ANALYSIS Mark Kon, Boston University Presentation of joint work with Dr. James Lyons-Weiler Centers for Pathology and Oncology Informatics

From: "Data Analysis Tools for DNA microarrays" by S. Draghici, published byChapman and Hall/CRC Press.

Page 13: MICROARRAY DATA ANALYSIS Mark Kon, Boston University Presentation of joint work with Dr. James Lyons-Weiler Centers for Pathology and Oncology Informatics

Proposed solutions

• Adaptive normalization (intensity-specific, location-specific; locally weighted regression methods)

• Dye-flip or dye-swap experiments (twice the cost!)

Page 14: MICROARRAY DATA ANALYSIS Mark Kon, Boston University Presentation of joint work with Dr. James Lyons-Weiler Centers for Pathology and Oncology Informatics

Dye Strategies

• Custom: Typically 2 dye (but not necessarily!)

• Affymetrix, Amersham: Single Dye

• Agilent and Mergen:Two-dye

Page 15: MICROARRAY DATA ANALYSIS Mark Kon, Boston University Presentation of joint work with Dr. James Lyons-Weiler Centers for Pathology and Oncology Informatics

Oligonucleotide Arrays

• Lockhart et al. 1996. Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat Biotechnol. 1996 Dec;14(13):1675-80.

Photolithography; 25-mers, 16 probes per probe set

e.g., Affymetrix/Agilent

Page 16: MICROARRAY DATA ANALYSIS Mark Kon, Boston University Presentation of joint work with Dr. James Lyons-Weiler Centers for Pathology and Oncology Informatics

Spotted Arrays• Amersham: CodeLink™ System:

Oligonucleotides • Mergen: Oligonucleotides• cDNA spotted technology• 3-D surface (contact for covalent attachment

of probes)

Next two slides; © 2003, Phillip Stafford and Peng Liu. Ch. 15, Microarray technology, comparison, statistical analysis and experimental design, IN: Microarray Methods and Applications: Nuts & Bolts (G. Hardiman, ed., DNA Press)

Page 17: MICROARRAY DATA ANALYSIS Mark Kon, Boston University Presentation of joint work with Dr. James Lyons-Weiler Centers for Pathology and Oncology Informatics
Page 18: MICROARRAY DATA ANALYSIS Mark Kon, Boston University Presentation of joint work with Dr. James Lyons-Weiler Centers for Pathology and Oncology Informatics

Affymetrix Agilent Amersham Mergen

Human liver vs. human heart: 3904/22,283 (18%)Human liver vs. human liver: 3875/22,283 (17%)Human heart vs. human heart: 4026/22,283 (18%)

Human liver vs. human heart: 6963/14,159 (49%)Human liver vs. human liver: 5129/14,159 (36%)Human heart vs. human heart: 1204/18,006 (6%)

Human liver vs. human heart: 8572/11,904 (72%)Human liver vs. human liver: 2811/11,904 (24%)Human heart vs. human heart: 3515/11,904 (30%)

Heart replicates Heart replicates Heart replicates Heart replicates

Heart:Liver Heart:Liver Heart:Liver Heart:Liver

Human liver vs. human heart: 2595/9970 (26%)Human liver vs. human liver: 318/9778 (3%)Human heart vs. human heart: 454/9772 (5%)

Page 19: MICROARRAY DATA ANALYSIS Mark Kon, Boston University Presentation of joint work with Dr. James Lyons-Weiler Centers for Pathology and Oncology Informatics

Golden Rule

• A technological fix to a problem is always preferred to a statistical fix.

Page 20: MICROARRAY DATA ANALYSIS Mark Kon, Boston University Presentation of joint work with Dr. James Lyons-Weiler Centers for Pathology and Oncology Informatics

Predictive Genomics, Biology, Medicine

• Learning theory: SLT – what is it?• Parametric statistics – small number of parameters –

appropriate to small amounts of data• Ex. Find mean m and standard deviation s for a

normal distribution from sample data.• Nonparametric statistics – large number of

parameters – appropriate to large amounts of data• Ex. Neural Network, RBF network, support vector

machine

Page 21: MICROARRAY DATA ANALYSIS Mark Kon, Boston University Presentation of joint work with Dr. James Lyons-Weiler Centers for Pathology and Oncology Informatics

Genomics: Current interests:• New algorithms for classification of and prediction from

microarray gene expression data.• Genome: about 50,000 genes• Gene expression in cell reflects physiological factors and

processes. • Discovery of patterns in gene expression data: major

computational challenge. • Includes genome and genetic regulation and expression

information. • Information important in diagnosing physiological factors, for

example:– nature of disease, e.g. tumor – state and prognosis for a genetically inherited disease

Page 22: MICROARRAY DATA ANALYSIS Mark Kon, Boston University Presentation of joint work with Dr. James Lyons-Weiler Centers for Pathology and Oncology Informatics

Technology:• New, error-prone - statistical analysis must tease apart errors as

well as many physiological factors present. Current methods of classification may not be as effective or accurate as they can be.

• Understanding physiological correlates of gene expression (hence protein expression) promises to provide insight into conditions and diseases whose etiologies have been difficult to understand, e.g.:

• autism• multiple sclerosis• muscular dystrophy• propensities for cancers and arteriosclerosis, • Alzheimer’s disease

Preliminary results have been obtained in these areas.

Page 23: MICROARRAY DATA ANALYSIS Mark Kon, Boston University Presentation of joint work with Dr. James Lyons-Weiler Centers for Pathology and Oncology Informatics

Purpose of project: work on aspects of such an approach.

• Our work involves modeling, simulation, and algorithm based approaches to classification and prediction of cell physiology from microarray information.

Page 24: MICROARRAY DATA ANALYSIS Mark Kon, Boston University Presentation of joint work with Dr. James Lyons-Weiler Centers for Pathology and Oncology Informatics

Major aspect: deal with numerical simulations and their complexity.

• Emphasize accuracy of statistical models• Computed algorithm discovery methods will search

for algorithms appropriate to models. • Subarray cocluster patterns (patterns occurring for

subsets of genes and of the population). • Computational demands require the high

performance resources of Center for Computational Science at BU

• Statistical models of microarray experiments: Gene Expression Data Simulator (GEDS) at University of Pittsburgh

Page 25: MICROARRAY DATA ANALYSIS Mark Kon, Boston University Presentation of joint work with Dr. James Lyons-Weiler Centers for Pathology and Oncology Informatics

Numerical Simulations and their complexity

• Error of classification, prediction algorithms calculated with Monte Carlo simulations on GEDS

• Algorithms for discovery of subarray coclusters, testing sparse data for underlying distribution families, extending regression-attraction algorithm.

• Will also develop local numerical algorithms for "customized" predictions for individuals from microarrays.

Page 26: MICROARRAY DATA ANALYSIS Mark Kon, Boston University Presentation of joint work with Dr. James Lyons-Weiler Centers for Pathology and Oncology Informatics

Collaboration: • Boston University (Mathematics and Statistics,

Microarray Resource at the Medical School, and Center for Computational Science; Bioinformatics Program

• University of Massachusetts Lowell (Mathematics and Statistics)

• University of Pittsburgh Medical School (Medical School microarray core laboratory; UPCI Cancer Biomarkers Laboratory, PittArray core laboratory)

• Ben Gurion University in Israel (BGU Human Molecular Genetics Lab, Computer Science Department’s Bioinformatics Program)

Page 27: MICROARRAY DATA ANALYSIS Mark Kon, Boston University Presentation of joint work with Dr. James Lyons-Weiler Centers for Pathology and Oncology Informatics

Goals: answer questions -

• Can computer implementations of microarray models be used to improve them?

• Can model and parameter determination be accomplished computationally?

• Can statistical algorithms to solve the models be tested, developed, and improved on such models?

Page 28: MICROARRAY DATA ANALYSIS Mark Kon, Boston University Presentation of joint work with Dr. James Lyons-Weiler Centers for Pathology and Oncology Informatics

Goal: answer questions -• Can statistical methods improve the yield of microarray

information for small numbers of subjects?• Can sub-patterns (patterns in subsets of the genome and

population) in microarray data be verified, discovered, and used?

• What are maximal levels of information which can be obtained from gene expression information? Can we obtain probabilities that a queried patient belongs to a given trained group together with confidence bounds?

• What can simulation of the genetic expression profile of cancer cells reveal about potential responses to therapies?

Page 29: MICROARRAY DATA ANALYSIS Mark Kon, Boston University Presentation of joint work with Dr. James Lyons-Weiler Centers for Pathology and Oncology Informatics

Statistical methods work better when they incorporate biological models as a priori information.

• Strategy: divide translation of physiological models into algorithms into three parts:

• biological modeling• statistical modeling• algorithm development• From biology to statistical modeling: two way

process• biology statistical model simulated

biological data (with scalable microarray simulation).

Page 30: MICROARRAY DATA ANALYSIS Mark Kon, Boston University Presentation of joint work with Dr. James Lyons-Weiler Centers for Pathology and Oncology Informatics

Algorithm simulation

After statistical model is decided on: find algorithms which solve model – given complexity of good algorithms, we will use Monte Carlo to gauge efficiency via microarray simulator.

Page 31: MICROARRAY DATA ANALYSIS Mark Kon, Boston University Presentation of joint work with Dr. James Lyons-Weiler Centers for Pathology and Oncology Informatics

Further study:

Automated algorithm development via search methods within algorithm classes

Page 32: MICROARRAY DATA ANALYSIS Mark Kon, Boston University Presentation of joint work with Dr. James Lyons-Weiler Centers for Pathology and Oncology Informatics

Co-regulation of genes: • Many new methods (e.g. graph-theoretic

methods of cataloguing coregulation from published literature)

• Will study automated methods of incrementing statistical models with information, and incorporating models into simulator. Simulator will allow testing models, via comparisons of simulator and biological data. Such objective tests of models do not presently exist.

Page 33: MICROARRAY DATA ANALYSIS Mark Kon, Boston University Presentation of joint work with Dr. James Lyons-Weiler Centers for Pathology and Oncology Informatics

More specific aims:• 1: Develop tests of statistical models of microarrays

through comparisons with computational simulations, and develop new models and methodologies on this basis, and discover and modify algorithms for these models.

• 2: Develop optimal robust classification algorithms for microarrays based on models, with probability estimates of classification membership and confidence bounds, and develop statistical methods for reducing patient sample sizes necessary.

Page 34: MICROARRAY DATA ANALYSIS Mark Kon, Boston University Presentation of joint work with Dr. James Lyons-Weiler Centers for Pathology and Oncology Informatics

More specific aims• 3: Test classification algorithms and improve

statistical properties through Monte Carlo simulation of accuracies, and use (low and eventually high dimensional) search techniques to find better algorithms.

• 4: Study search algorithms for discovering subarrays containing patterns not visible in full arrays.

• 5: Test and apply these methodologies to existing cancer databases for differentiating cancer gene expression information.

• 6: Apply these methods to develop software for practitioners using microarrays.

Page 35: MICROARRAY DATA ANALYSIS Mark Kon, Boston University Presentation of joint work with Dr. James Lyons-Weiler Centers for Pathology and Oncology Informatics

Outcomes: new tools for differentiating microarray clusters will be available

• Information based on clustering of interacting genes and sub-populations affected by them will be obtainable from microarray analysis.

• More accurate statistical models of microarrays will be implementable and testable using microarray simulator under development at the University of Pittsburgh.

• Classification algorithms with tunable parameters (appropriate to different biological models) will be available, with class probability estmates and confidence bounds.

Page 36: MICROARRAY DATA ANALYSIS Mark Kon, Boston University Presentation of joint work with Dr. James Lyons-Weiler Centers for Pathology and Oncology Informatics

Outcomes: new tools for differentiating microarray clusters will be available

• Applications of the above techniques to the development of diagnostic tools for differentiating cancer gene expression information will be developed

• Open source software implementing this work will be available.

Page 37: MICROARRAY DATA ANALYSIS Mark Kon, Boston University Presentation of joint work with Dr. James Lyons-Weiler Centers for Pathology and Oncology Informatics

Outcomes:• Emphasize implementable algorithms for

diagnosis, classification, and prediction. • Differentiation of cancer gene expression

profiles has the potential to greatly improve the use of cancer therapies.

• Extend also to psychiatric drugs in which responsiveness to therapies seems often to be individual parameter.

Page 38: MICROARRAY DATA ANALYSIS Mark Kon, Boston University Presentation of joint work with Dr. James Lyons-Weiler Centers for Pathology and Oncology Informatics

Approach: separation of model from algorithm.

• Modeling: biological problem • Once model is found, finding algorithm which decides

which class (e.g., metastatic or non-metastatic tumors) microarray comes from becomes a purely statistical and computational; notions of complexity and optimality then become appropriate and well-defined.

• Correspondingly, errors from classification algorithms can be broken down into two parts: model error and algorithmic error.

Page 39: MICROARRAY DATA ANALYSIS Mark Kon, Boston University Presentation of joint work with Dr. James Lyons-Weiler Centers for Pathology and Oncology Informatics

Separation of model from algorithm:• Model error: biology not correctly modeled• Algorithmic error: correct statistical model exists, but

classification algorithms developed for model have associated error making them worse than optimal algorithms.

• Jim Lyons-Weiler and team (Pittsburgh) have developed microarray simulation tool (located at http://bioinformatics.upmc.edu/GE2/index.html), in which model can be adjusted, and algorithms can be simulated.

• More complicated algorithms Claudio Rebbi, director of BU’s Center for Scientific Computation.

Page 40: MICROARRAY DATA ANALYSIS Mark Kon, Boston University Presentation of joint work with Dr. James Lyons-Weiler Centers for Pathology and Oncology Informatics

Gene Expression Data Simulator:• Fig. 1. A Case vs.

Control Pattern of Inheritance Model in Detail. Between group correlations specified by DrAB and within group correlations are specified by DrA for group A and DrB for group B.

Page 41: MICROARRAY DATA ANALYSIS Mark Kon, Boston University Presentation of joint work with Dr. James Lyons-Weiler Centers for Pathology and Oncology Informatics

Outcome of jittering process

Page 42: MICROARRAY DATA ANALYSIS Mark Kon, Boston University Presentation of joint work with Dr. James Lyons-Weiler Centers for Pathology and Oncology Informatics

Outcome of jittering process• FIG. 2. Outcome of a jittering process to produce correlations between

two arbitrary samples i and j (1,000 genes). Each biplot represents expression levels for i and j drawn from two gamma distribution shape parameter values (1= skewed; 20 = normal) over the range of expected correlation between i and j (determined by Drij) to demonstrate the type of data that can be generated by the Gene Expression Data Simulator. In jittering, random genes are selected to be changed stochastically by a maximum amount v1. The between-sample correlation is measured, and if the target r is achieved, jittering stops. If not, then another gene is selected to be changed. The process continues until the target correlation is achieved.

• Example of the bivariate output. Modeled expression intensities were generated for two samples for three levels of Dr under two gamma distribution shapes (Fig. 2). This result demonstrates that the simulator can approximate very well biologically realistic data sets with stochastic error and the desired correlation.

Page 43: MICROARRAY DATA ANALYSIS Mark Kon, Boston University Presentation of joint work with Dr. James Lyons-Weiler Centers for Pathology and Oncology Informatics

Sample output:

• Example of the bivariate output. Modeled expression intensities were generated for two samples for three levels of Dr under two gamma distribution shapes (Fig. 2). This result demonstrates that the simulator can approximate very well biologically realistic data sets with stochastic error and the desired correlation.

Page 44: MICROARRAY DATA ANALYSIS Mark Kon, Boston University Presentation of joint work with Dr. James Lyons-Weiler Centers for Pathology and Oncology Informatics

Sample output:

Page 45: MICROARRAY DATA ANALYSIS Mark Kon, Boston University Presentation of joint work with Dr. James Lyons-Weiler Centers for Pathology and Oncology Informatics

PREDICTIVE MEDICINE:

• Cancer markers: Size of tumor, past historical information, patient biomarkers, genomic information

• Microarray markup language biomarker markup language (need for NIH-approved standardized language)

Page 46: MICROARRAY DATA ANALYSIS Mark Kon, Boston University Presentation of joint work with Dr. James Lyons-Weiler Centers for Pathology and Oncology Informatics

PREDICTIVE MEDICINE:

• Goal: database into which all kinds of information can be integrated.

• Inference engine: dichotomy – mini-engine and meta-engine (boosting and bagging algorithms)

Page 47: MICROARRAY DATA ANALYSIS Mark Kon, Boston University Presentation of joint work with Dr. James Lyons-Weiler Centers for Pathology and Oncology Informatics

Medical applications: patient state is time dependent

• x = uncontrolled variables (e.g., cancer etiology, individual biomarkers and genetic markers)

• y = controlled variables (patient treatment, drugs administered, etc.)

• z = (x,y)

• z(t+1) = f(z(t))

Page 48: MICROARRAY DATA ANALYSIS Mark Kon, Boston University Presentation of joint work with Dr. James Lyons-Weiler Centers for Pathology and Oncology Informatics

Other connections:• Learning: Discover the function f(t) from databases

of examples• Control theory: how to adjust y(t) (controlled

variables) so that disease history z(t) progresses as well as possible?

• Financial mathematics – algorithms there also apply to control theory aspects here

• Stochastic differential equations• dx/dt = B’(t) + b(x)• C. Rebbi: simulations (Matlab nlinfit program

suffices) – psychiatric data, simulated cancer data