58
Systems Approaches to Disease Stratification Nathan Price Introduction to Systems Biology Short Course August 20, 2012

Systems Approaches to Disease Stratification Nathan Price Introduction to Systems Biology Short Course August 20, 2012

Embed Size (px)

Citation preview

Page 1: Systems Approaches to Disease Stratification Nathan Price Introduction to Systems Biology Short Course August 20, 2012

Systems Approaches to Disease Stratification

Nathan Price

Introduction to Systems Biology Short Course

August 20, 2012

Page 2: Systems Approaches to Disease Stratification Nathan Price Introduction to Systems Biology Short Course August 20, 2012

Goals and Motivation

Currently most diagnoses based on symptoms and visual features (pathology, histology)

However, many diseases appear deceptively similar, but are, in fact, distinct entities from the molecular perspective

Drive towards personalized medicine

Page 3: Systems Approaches to Disease Stratification Nathan Price Introduction to Systems Biology Short Course August 20, 2012

Outline Molecular signature classifiers: main issues

Signal to noise Small sample size issues Error estimation techniques Phenotypes and sample heterogeneity Example study

Advanced topics Network-based classification Importance of broad disease context

Page 4: Systems Approaches to Disease Stratification Nathan Price Introduction to Systems Biology Short Course August 20, 2012

Molecular signature classifiers

Overall strategy

Page 5: Systems Approaches to Disease Stratification Nathan Price Introduction to Systems Biology Short Course August 20, 2012

Molecular signatures for diagnosis

The goals of molecular classification of tumors: Identify subpopulations of cancer Inform choice of therapy

Generally, a set of microarray experiments is used with ~100 patient samples ~ 104 transcripts (genes)

This very small number of samples relative to the number of transcripts is a key issue Feature selection & model selection Small sample size issues dominate Error estimation techniques

Also, the microarray platform used can have a significant effect on results

Page 6: Systems Approaches to Disease Stratification Nathan Price Introduction to Systems Biology Short Course August 20, 2012

Randomness

Expression values have randomness arising from both biological and experimental variability.

Design, performance evaluation, and application of classifiers must take this randomness into account.

Page 7: Systems Approaches to Disease Stratification Nathan Price Introduction to Systems Biology Short Course August 20, 2012

Three critical issues arise…

Given a set of variables, how does one design a classifier from the sample data that provides good classification over the general population?

How does one estimate the error of a designed classifier when data is limited?

Given a large set of potential variables, such as the large number of expression levels provided by each microarray, how does one select a set of variables as the input vector to the classifier?

Page 8: Systems Approaches to Disease Stratification Nathan Price Introduction to Systems Biology Short Course August 20, 2012

Small sample issues Our task is to predict future events

Thus, we must avoid overfittingIt is easy (if the model is complicated enough) to fit data

we haveSimplicity of model vital when data is sparse and

possible relationships are large This is exactly the case in virtually all microarray studies,

including ours

In the clinicAt the end, want a test that can easily be implemented

and actually benefit patients

Page 9: Systems Approaches to Disease Stratification Nathan Price Introduction to Systems Biology Short Course August 20, 2012

Error estimation and variable selection An error estimator may be unbiased but have a

large variance, and therefore often be low. This can produce a large number of gene sets and

classifiers with low error estimates. For a small sample, one can end up with

thousands of gene sets for which the error estimate from the sample data is near zero!

Page 10: Systems Approaches to Disease Stratification Nathan Price Introduction to Systems Biology Short Course August 20, 2012

Overfitting

Complex decision boundary may be unsupported by the data relative to the feature-label distribution.

Relative to the sample data, a classifier may have small error; but relative to the feature-label distribution, the error may be severe!

Classification rule should not cut up the space in a manner too complex for the amount of sample data available.

Page 11: Systems Approaches to Disease Stratification Nathan Price Introduction to Systems Biology Short Course August 20, 2012

Overfitting: example of KNN rule

test sample; k = 3

N = 30

N = 90

Page 12: Systems Approaches to Disease Stratification Nathan Price Introduction to Systems Biology Short Course August 20, 2012

Example: How to identify appropriate models

(regression… but the issues are the same)

nxfy

noise

learn f from data

Page 13: Systems Approaches to Disease Stratification Nathan Price Introduction to Systems Biology Short Course August 20, 2012

Linear…

Page 14: Systems Approaches to Disease Stratification Nathan Price Introduction to Systems Biology Short Course August 20, 2012

Quadratic…

Page 15: Systems Approaches to Disease Stratification Nathan Price Introduction to Systems Biology Short Course August 20, 2012

Piecewise linear interpolation…

Page 16: Systems Approaches to Disease Stratification Nathan Price Introduction to Systems Biology Short Course August 20, 2012

Which one is best?

Page 17: Systems Approaches to Disease Stratification Nathan Price Introduction to Systems Biology Short Course August 20, 2012

Cross-validation

Page 18: Systems Approaches to Disease Stratification Nathan Price Introduction to Systems Biology Short Course August 20, 2012

Cross-validation

Page 19: Systems Approaches to Disease Stratification Nathan Price Introduction to Systems Biology Short Course August 20, 2012

Cross-validation

Simple: just choose the classifier with the best cross-validation error

But… (there is always a but) we are training on even less data, so the classifier

design is worse if sample size is small, test set is small and error

estimator has high variance so we may be fooling ourselves into thinking we have a

good classifier…

Page 20: Systems Approaches to Disease Stratification Nathan Price Introduction to Systems Biology Short Course August 20, 2012

LOOCV (leave-one-out cross validation)

Page 21: Systems Approaches to Disease Stratification Nathan Price Introduction to Systems Biology Short Course August 20, 2012

mean square error: 2.12mean square error: 0.96

mean square error: 3.33

best

Page 22: Systems Approaches to Disease Stratification Nathan Price Introduction to Systems Biology Short Course August 20, 2012

Estimating Error on Future Cases

Methodology Best case: have an

independent test set Resampling

techniques Use cross validation

to estimate accuracy on future cases

Feature selection and model selection must be within loop to avoid overly optimistic estimates

Training Set

Test Set

Data Set

Resampling: Shuffled repeatedly into training and test sets.

NO

info

rmat

ion

pass

age

Average performance on test set provides estimate for behavior on future cases

Can be MUCH different than behavior on training set

Page 23: Systems Approaches to Disease Stratification Nathan Price Introduction to Systems Biology Short Course August 20, 2012

Classification methods k-nearest neighbor Support vector machine (SVM) Linear, quadratic Perceptrons, neural networks Decision trees k-Top Scoring Pairs Many others

Page 24: Systems Approaches to Disease Stratification Nathan Price Introduction to Systems Biology Short Course August 20, 2012

Molecular signature classifiers

Example Study

Page 25: Systems Approaches to Disease Stratification Nathan Price Introduction to Systems Biology Short Course August 20, 2012

Diagnosing similar cancers with different treatments Challenge in medicine: diagnosis, treatment, prevention

of disease suffer from lack of knowledge

Gastrointestinal Stromal Tumor (GIST) and Leiomyosarcoma (LMS) morphologically similar, hard to distinguish using current methods different treatments, correct diagnosis is critical studying genome-wide patterns of expression aids clinical diagnosis

Goal: Identify molecular signature that will accurately differentiate these two cancers

??GIST Patient LMS Patient

Page 26: Systems Approaches to Disease Stratification Nathan Price Introduction to Systems Biology Short Course August 20, 2012

Relative Expression Reversal Classifiers

Find a classification rule as follows: IF gene A > gene B THEN class1, ELSE class2

Classifier is chosen finding the most accurate and robust rule of this type from all possible pairs in the dataset

If needed, a set of classifiers of the above form can be used, with final classification resulting from a majority vote (k-TSP)

• Geman, D., et al. Stat. Appl. Geneti. Mol. Biol., 3, Article 19, 2004• Tan et al., Bioinformatics, 21:3896-904, 2005

Page 27: Systems Approaches to Disease Stratification Nathan Price Introduction to Systems Biology Short Course August 20, 2012

Rationale for k-TSP

Based on concept of relative expression reversals Advantages

Does not require data normalization Does not require population-wide cutoffs or weighting functions Has reported accuracies in literature comparable to SVMs, PAM,

other state-of-the art classification methods Results in classifiers that are easy to implement Designed to avoid overfitting

n = number of genes, m = number of samples For the example I will show, this equation yields: 10^9 << 10^20

mn2

2

Page 28: Systems Approaches to Disease Stratification Nathan Price Introduction to Systems Biology Short Course August 20, 2012

Diagnostic Marker Pair

• Price, N.D. et al, PNAS 104:3414-9 (2007)

101

102

103

104

105

101

102

103

104

105

C9orf65 expression

OB

SC

N e

xpre

ssio

n

Clinicopathological DiagnosisX – GISTO - LMS

Classified as GIST

Classified as LMS

Accuracy on data = 99% Predicted accuracy on future data (LOOCV) = 98%

Page 29: Systems Approaches to Disease Stratification Nathan Price Introduction to Systems Biology Short Course August 20, 2012

RT-PCR Classification Results

100% Accuracy 19 independent samples 20 samples from microarray study

including previously indeterminate case

• Price, N.D. et al, PNAS 104:3414-9 (2007)

OBSCN-c9orf65

69

49

15

70

29

71

13

72

73

74

75

76

7 77

78

79

80

41

81

37

82

10

2 83 84 20

22

85

58

86 33

26

61

62

87

52

19

40

14

-20

-15

-10

-5

0

5

10

sample

dif

fere

nc

e o

f C

t av

era

ge

OBSCN

c9orf65

OBSCN

c9orf65

c9orf65

OBSCN

Sample 79

Sample 62

LMS

GIST

Price, N.D. et al, PNAS 104:3414-9 (2007)

Page 30: Systems Approaches to Disease Stratification Nathan Price Introduction to Systems Biology Short Course August 20, 2012

Comparative biomarker accuracies

• Price, N.D. et al, PNAS 104:3414-9 (2007)

10-3

10-2

10-1

100

101

102

101

102

103

104

105

106

OBSCN expression / C9orf65 expression

c-ki

t exp

ress

ion

GIST – XLMS – O

2-gene relative expression classifier

C-ki

t gen

e ex

pres

sion

Price, N.D. et al, PNAS 104:3414-9 (2007)

Page 31: Systems Approaches to Disease Stratification Nathan Price Introduction to Systems Biology Short Course August 20, 2012

Kit Protein Staining of GIST-LMS

Top Row – GIST Positive Staining

Bottom Row – GIST negative staining

• Price, N.D. et al, PNAS 104:3414-9 (2007)

Blue arrows - GIST Red arrows - LMS

Accuracy as a classifier ~ 87%.

Page 32: Systems Approaches to Disease Stratification Nathan Price Introduction to Systems Biology Short Course August 20, 2012

A few general lessons

Choosing markers based on relative expression reversals of gene pairs has proven to be very robust with high predictive accuracy in sets we have tested so far Simple and independent of normalization

Easy to implement clinical test ultimately All that’s needed is RT-PCR on two genes

Advantages of this approach may be even more applicable to proteins in the blood Each decision rule requiring the measurement of the

relative concentration of 2 proteins

Page 33: Systems Approaches to Disease Stratification Nathan Price Introduction to Systems Biology Short Course August 20, 2012

Network-based classification

Page 34: Systems Approaches to Disease Stratification Nathan Price Introduction to Systems Biology Short Course August 20, 2012

Network-based classification

• Chuang, Lee, Liu, Lee, Ideker, Molecular Systems Biology 3:40

Can modify feature selection methods based on networks

Can improve performance (not always)

Generally improves biological insight by integrating heterogeneous data

Shown to improve prediction of breast cancer metastasis (complex phenotype)

Page 35: Systems Approaches to Disease Stratification Nathan Price Introduction to Systems Biology Short Course August 20, 2012

Rationale: Differential Rank Analysis (DIRAC)

Networks or pathways inform best targets for therapies Cancer is a multi-genic disease

Analyze high-throughput data to identify aspects of the genome-scale network that are most affected

Initial version uses a priori defined gene sets BioCarta, KEGG, GO, etc.

Differential rank conservation (DIRAC) for studying Expression rank conservation for

pathways within a phenotype Pathways that discriminate well between

phenotypes

101

102

103

104

105

101

102

103

104

105

C9orf65 expression

OB

SC

N e

xpre

ssio

n

Clinicopathological DiagnosisX – GISTO - LMS

Classified as GIST

Classified as LMS

Accuracy on data = 99% Predicted accuracy on future data (LOOCV) = 98%

Price, N.D. et al, PNAS, 2007

Eddy, J.A. et al, PLoS Computational Biology (2010)

Page 36: Systems Approaches to Disease Stratification Nathan Price Introduction to Systems Biology Short Course August 20, 2012

Differential Rank Conservation

tightly regulated pathway

weakly regulated pathway

shuffled pathway ranking between phenotypes

GIST LMS

…across pathways in a phenotype

…across phenotypes for a pathway

Highest conservation

Lowest conservation

g4

g1

g2

g3

g4

g1

g2

g3

g4

g2

g1

g3

g4

g1

g2

g3

g5

g6

g8

g7

g5

g7

g8

g6

g8

g6

g7

g5

g5

g8

g6

g7

g4

g1

g2

g3

g2

g3

g1

g4

Page 37: Systems Approaches to Disease Stratification Nathan Price Introduction to Systems Biology Short Course August 20, 2012

Network name. Num. genes µR

GS 6 1.000FOSB 4 0.981AKAP13 7 0.955AGPCR 11 0.955RNA 8 0.948CACAM 12 0.947NDKDYNAMIN 17 0.946ETC 8 0.946SET 11 0.945ALTERNATIVE 8 0.847ALK 34 0.845LAIR 14 0.844PITX2 16 0.840METHIONINE 5 0.839IL5 10 0.833STEM 15 0.829ION 5 0.806CYTOKINE 21 0.805IL18 6 0.763LEPTIN 8 0.728

Visualizing global network rank conservation

… … …

Page 38: Systems Approaches to Disease Stratification Nathan Price Introduction to Systems Biology Short Course August 20, 2012

Network name. Num. genes µR

GS 6 1.000FOSB 4 0.981AKAP13 7 0.955AGPCR 11 0.955RNA 8 0.948CACAM 12 0.947NDKDYNAMIN 17 0.946ETC 8 0.946SET 11 0.945ALTERNATIVE 8 0.847ALK 34 0.845LAIR 14 0.844PITX2 16 0.840METHIONINE 5 0.839IL5 10 0.833STEM 15 0.829ION 5 0.806CYTOKINE 21 0.805IL18 6 0.763LEPTIN 8 0.728

Visualizing global network rank conservation

… … …Average rank conservation across all 248

networks: 0.903

Page 39: Systems Approaches to Disease Stratification Nathan Price Introduction to Systems Biology Short Course August 20, 2012

Global regulation of networks across phenotypes

Highest rank conservation

Lowest rank conservation

Eddy et al, PLoS Computational Biology, (2010)

Page 40: Systems Approaches to Disease Stratification Nathan Price Introduction to Systems Biology Short Course August 20, 2012

Global regulation of networks across phenotypes

Highest rank conservation

Lowest rank conservation

Tighter network regulation:normal prostate

Looser network regulation:primary prostate cancer

Loosest network regulation:metastatic prostate cancer

Eddy et al, PLoS Computational Biology, (2010)

Page 41: Systems Approaches to Disease Stratification Nathan Price Introduction to Systems Biology Short Course August 20, 2012

Differential Rank Conservation

tightly regulated pathway

weakly regulated pathway

shuffled pathway ranking between phenotypes

GIST LMS

…across pathways in a phenotype

…across phenotypes for a pathway

Highest conservation

Lowest conservation

g4

g1

g2

g3

g4

g1

g2

g3

g4

g2

g1

g3

g4

g1

g2

g3

g5

g6

g8

g7

g5

g7

g8

g6

g8

g6

g7

g5

g5

g8

g6

g7

g4

g1

g2

g3

g2

g3

g1

g4

Page 42: Systems Approaches to Disease Stratification Nathan Price Introduction to Systems Biology Short Course August 20, 2012

Differential rank conservation of the MAPK network

Page 43: Systems Approaches to Disease Stratification Nathan Price Introduction to Systems Biology Short Course August 20, 2012

DIRAC classification is comparable to other methods

Cross validation accuracies in prostate cancer

Page 44: Systems Approaches to Disease Stratification Nathan Price Introduction to Systems Biology Short Course August 20, 2012

Differential Rank Conservation (DIRAC): Key Features Independent of data normalization Independent of genes/proteins outside of network Can show massive/complete perturbations

Unlike Fischer’s exact test (e.g. GO enrichment) Measures the “shuffling” of the network in terms of the hierarchy

of expression of he components Distinct from enrichment or GSEA

Provides a distinct mathematically classifier to yield measurement of predictive accuracy on test data Stronger than p-value for determining signal

Code for the method can be found at our website:http://price.systemsbiology.net

• Eddy et al, PLoS Computational Biology, (2010)

Page 45: Systems Approaches to Disease Stratification Nathan Price Introduction to Systems Biology Short Course August 20, 2012

Global Analysis of Human Disease

Importance of broad context to disease diagnosis

Page 46: Systems Approaches to Disease Stratification Nathan Price Introduction to Systems Biology Short Course August 20, 2012

The envisioned future of blood diagnostics

Page 47: Systems Approaches to Disease Stratification Nathan Price Introduction to Systems Biology Short Course August 20, 2012

Next generation molecular disease-screening

Page 48: Systems Approaches to Disease Stratification Nathan Price Introduction to Systems Biology Short Course August 20, 2012

Why global disease analyses are essential

Organ-specificity: separating signal from noise Hierarchy of classification

Context-independent classifiers Based on organ-specific markers

Context-dependent classifiers Based on excellent markers once organ-specificity defined

Provide context for how disease classifiers should be defined

Provide broad perspective into how separable diseases are and if disease diagnosis categories seem appropriate

Page 49: Systems Approaches to Disease Stratification Nathan Price Introduction to Systems Biology Short Course August 20, 2012

GLOBAL ANALYSIS OF DISEASE-PERTURBED TRANSCRIPTOMES IN

THE HUMAN BRAIN

Example case study

04/19/23 49

Page 50: Systems Approaches to Disease Stratification Nathan Price Introduction to Systems Biology Short Course August 20, 2012

AIALZGBMMDLMNGNBOLGPRKnormal

Multidimensional scaling plot of brain disease data

Page 51: Systems Approaches to Disease Stratification Nathan Price Introduction to Systems Biology Short Course August 20, 2012

• At each class in the decision tree, a test sample is either allowed to pass down the tree for further classification or rejected (i.e. 'does not belong to this class') and thus unable to pass

Identification of Structured Signatures And Classifiers (ISSAC)

Page 52: Systems Approaches to Disease Stratification Nathan Price Introduction to Systems Biology Short Course August 20, 2012

100

80

60

40

20

0AI ALZ GBM MDL MNG NB OLG PRK norma

l/control

10090

84.2

97.699.0 98.2

81.8

10094.7

Average accuracy of all class samples: 93.9 %

Accuracy on randomly split test setscla

ssifi

cati

on

accu

racy

(%)

Page 53: Systems Approaches to Disease Stratification Nathan Price Introduction to Systems Biology Short Course August 20, 2012

The challenge of ‘Lab Effects’

Sample heterogeneity issues in personalized medicine

53

Page 54: Systems Approaches to Disease Stratification Nathan Price Introduction to Systems Biology Short Course August 20, 2012

100%

80%

60%

40%

20%

0%

accu

racy

EPN

(G

SE16

155)

EPN

(G

SE21

687)

GBM

(G

SE44

12)

GBM

(G

SE42

71)

GBM

(G

SE86

92)

GBM

(G

SE91

71)

GBM

(G

SE42

90)

MD

L (G

SE10

327)

MD

L (G

SE12

992)

MN

G

(GSE

4780

)M

NG

(G

SE94

38)

MN

G

(GSE

1658

1)O

LG

(GSE

4412

)O

LG

(GSE

4290

)PA

(G

SE56

75)

PA

(GSE

1290

7)N

orm

al

(GSE

3526

)N

orm

al

(GSE

7307

)

Independent hold-out trials for 18 GSE datasets

Page 55: Systems Approaches to Disease Stratification Nathan Price Introduction to Systems Biology Short Course August 20, 2012

Leave-batch-out validation shows impact of other batch effects

Page 56: Systems Approaches to Disease Stratification Nathan Price Introduction to Systems Biology Short Course August 20, 2012

Take home messages

There is tremendous promise in high-throughput approaches to identify biomarkers Significant challenges remain to their broad

success Integrative systems approaches are essential

that link together data very broadly If training set is representative of population,

there are robust signals in the data and excellent accuracy is possible

Forward designs and partnering closely with clinical partners is essential, as is standardization of data collection and analysis

Page 57: Systems Approaches to Disease Stratification Nathan Price Introduction to Systems Biology Short Course August 20, 2012

Summary Molecular signature classifiers provide a promising avenue

for disease stratification Machine-learning approaches are key

Goal is optimal prediction of future data Must avoid overfitting

Model complexity Feature selection & model selection

Technical challenges Measurement platforms

Network-based classification Global disease context is key Lab and batch effects critical to overcome Sampling of heterogeneity for some disease now sufficient to

achieve stability in classification accuracies

Page 58: Systems Approaches to Disease Stratification Nathan Price Introduction to Systems Biology Short Course August 20, 2012

Price Lab MembersSeth Ament, PhDDaniel BakerMatthew BenedictJulie Bletz, PhDVictor CassenSriram ChandrasekaranNicholas Chia, PhD (now Ast. Prof. at Mayo Clinic)John EarlsJames EddyCory Funk, PhDPan Jun Kim, PhD (now Ast. Prof. at POSTECH)Alexey Kolodkin, PhDCharu Gupta Kumar, PhDRamkumar Hariharan, PhDBen Heavner, PhDPiyush LabhsetwarAndrew MagisCaroline MilneShuyi MaBeth PapanekMatthew RichardsAreejit Samal, PhDVineet Sangar, PhDBozenza SawickaEvangelos SimeonidisJaeyun SungChunjing Wang

Funding•NIH / National Cancer Institute - Howard Temin Pathway to Independence Award•NSF CAREER•Department of Energy•Energy Biosciences Institute (BP)•Department of Defense (TATRC)•Luxembourg-ISB Systems Medicine Program•Roy J. Carver Charitable Trust Young Investigator Award•Camille Dreyfus Teacher-Scholar Award

CollaboratorsDon Geman (Johns Hopkins)Wei Zhang (MD Anderson)

AcknowledgmentsNathan D. Price Research LaboratoryInstitute for Systems Biology, Seattle, WA | University of Illinois, Urbana-Champaign, IL