84
Gene Expression Profiling Illustrated Using BRB-ArrayTools

Gene Expression Profiling Illustrated Using BRB-ArrayTools

  • Upload
    liesel

  • View
    44

  • Download
    3

Embed Size (px)

DESCRIPTION

Gene Expression Profiling Illustrated Using BRB-ArrayTools. Good Microarray Studies Have Clear Objectives. Class Comparison (gene finding) Find genes whose expression differs among predetermined classes Class Prediction - PowerPoint PPT Presentation

Citation preview

Page 1: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

Gene Expression Profiling Illustrated Using BRB-ArrayTools

Page 2: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

Good Microarray Studies Have Clear Objectives

• Class Comparison (gene finding)– Find genes whose expression differs among predetermined

classes

• Class Prediction– Prediction of predetermined class using information from

gene expression profile• Response vs no response

• Class Discovery– Discover clusters of specimens having similar expression

profiles– Discover clusters of genes having similar expression profiles

Page 3: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

Class Comparison and Class Prediction

• Not clustering problems

• Supervised methods

Page 4: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

Levels of Replication

• Technical replicates– RNA sample divided into multiple aliquots

• Biological replicates– Multiple subjects – Multiple animals– Replication of the tissue culture experiment

Page 5: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

• Comparing classes or developing classifiers requires independent biological replicates. The power of statistical methods for microarray data depends on the number of biological replicates

Page 6: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

Microarray Platforms

• Single label arrays– Affymetrix GeneChips

• Dual label arrays– Common reference design– Other designs

Page 7: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

Common Reference Design

A1

R

A2 B1 B2

R R R

RED

GREEN

Array 1 Array 2 Array 3 Array 4

Ai = ith specimen from class A

R = aliquot from reference pool

Bi = ith specimen from class B

Page 8: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

• The reference generally serves to control variation in the size of corresponding spots on different arrays and variation in sample distribution over the slide.

• The reference provides a relative measure of expression for a given gene in a given sample that is less variable than an absolute measure.

• The reference is not the object of comparison.

Page 9: Gene Expression Profiling  Illustrated Using BRB-ArrayTools
Page 10: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

• The relative measure of expression (e.g. log ratio) will be compared among biologically independent samples from different classes for class comparison and as the predictor variables for class prediction

• Dye swap technical replicates of the same two rna samples are not necessary with the common reference design.

Page 11: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

• For two-label direct comparison designs for comparing two classes, dye bias is of concern.

• It is more efficient to balance the dye-class assignments for independent biological specimens than to do dye swap technical replicates

Page 12: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

Balanced Block Design

A1

A2

B2 A3 B4

B1 B3 A4

RED

GREEN

Array 1 Array 2 Array 3 Array 4

Ai = ith specimen from class A

Bi = ith specimen from class B

Page 13: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

Class ComparisonGene Finding

Page 14: Gene Expression Profiling  Illustrated Using BRB-ArrayTools
Page 15: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

t-test Comparisons of Gene Expression for gene j

(1) (2)

1 21/ 1/j j

j

j

m mt

s n n

Page 16: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

Estimation of Within-Class Variance

• Estimate separately for each gene– Limited degrees-of-freedom (precision) unless number of

samples is large– Gene list dominated by genes with small fold changes and

small variances

• Assume all genes have same variance– Poor assumption

• Random (hierarchical) variance model– Wright G.W. and Simon R. Bioinformatics19:2448-2455,2003 – Variances are independent samples from a common distribution; Inverse gamma

distribution used

– Results in exact F (or t) distribution of test statistics with increased degrees of freedom for error variance

– For any normal linear model

2j

Page 17: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

Randomized Variance t-test

• Pr(-2=x) = xa-1exp(-x/b)/(a)ba

1 2

1 2

1 1

ˆ ˆi i

n n

t

2 12

ˆ( 2) 2

2 2pooledn b

n a

Page 18: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

Controlling for Multiple Comparisons

• Bonferroni type procedures control the probability of making any false positive errors

• Overly conservative for the context of DNA microarray studies

Page 19: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

Simple Procedures

• Control expected the number of false discoveries to u– Conduct each of k tests at level u/k– e.g. To limit of 10 false discoveries in 10,000

comparisons, conduct each test at p<0.001 level

• Control FDR– Benjamini-Hochberg procedure

Page 20: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

False Discovery Rate (FDR)

• FDR = Expected proportion of false discoveries among the genes declared differentially expressed

Page 21: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

If you analyze n probe sets and select as “significant” the k genes

whose p ≤ p*

• FDR = number of false positives divided by number of positives ~ n p* / k

Page 22: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

Multivariate Permutation Procedures(Korn et al. Stat Med 26:4428,2007)

Allows statements like:FD Procedure: We are 95% confident that the

(actual) number of false discoveries is no greater than 5.

FDP Procedure: We are 95% confident that the (actual) proportion of false discoveries does not exceed .10.

Page 23: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

Multivariate Permutation Tests

• Distribution-free – even if they use t statistics

• Preserve/exploit correlation among tests by permuting each profile as a unit

• More effective than univariate permutation tests especially with limited number of samples

Page 24: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

Additional Procedures

• “SAM” - Significance Analysis of Microarrays

Page 25: Gene Expression Profiling  Illustrated Using BRB-ArrayTools
Page 26: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

Class Prediction

Page 27: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

Components of Class Prediction

• Feature (gene) selection– Which genes will be included in the model

• Select model type – E.g. Diagonal linear discriminant analysis,

Nearest-Neighbor, …

• Fitting parameters (regression coefficients) for model– Selecting value of tuning parameters

Page 28: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

Feature Selection

• Genes that are differentially expressed among the classes at a significance level (e.g. 0.01) – The level is selected only to control the number of genes in the

model– For class comparison false discovery rate is important– For class prediction, predictive accuracy is important

Page 29: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

Complex Gene Selection

• Small subset of genes which together give most accurate predictions– Genetic algorithms

• Little evidence that complex feature selection is useful in microarray problems

Page 30: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

Linear Classifiers for Two Classes

( )

vector of log ratios or log signals

features (genes) included in model

weight for i'th feature

decision boundary ( ) > or < d

i ii F

i

l x w x

x

F

w

l x

Page 31: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

Linear Classifiers for Two Classes• Fisher linear discriminant analysis

– Requires estimating correlations among all genes selected for model

• Diagonal linear discriminant analysis (DLDA) assumes features are uncorrelated

• Compound covariate predictor (Radmacher) and Golub’s method are similar to DLDA in that they can be viewed as weighted voting of univariate classifiers

(1) (2) 1( ) 'w x x S

Page 32: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

Linear Classifiers for Two Classes

• Compound covariate predictor

Instead of for DLDA

(1) (2)

ˆi i

ii

x xw

(1) (2)

2ˆi i

ii

x xw

Page 33: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

Linear Classifiers for Two Classes

• Support vector machines with inner product kernel are linear classifiers with weights determined to separate the classes with a hyperplane that minimizes the length of the weight vector

Page 34: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

Support Vector Machine

2i

i

( )

j

minimize w

subject to ' 1

where y 1 for class 1 or 2.

jjy w x b

Page 35: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

Other Linear Methods

• Perceptrons

• Principal component regression

• Supervised principal component regression

• Partial least squares

• Stepwise logistic regression

Page 36: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

Other Simple Methods

• Nearest neighbor classification

• Nearest k-neighbors

• Nearest centroid classification

• Shrunken centroid classification

Page 37: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

Nearest Neighbor Classifier

• To classify a sample in the validation set, determine it’s nearest neighbor in the training set; i.e. which sample in the training set is its gene expression profile is most similar to.– Similarity measure used is based on genes

selected as being univariately differentially expressed between the classes

– Correlation similarity or Euclidean distance generally used

• Classify the sample as being in the same class as it’s nearest neighbor in the training set

Page 38: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

Nearest Centroid Classifier

• For a training set of data, select the genes that are informative for distinguishing the classes

• Compute the average expression profile (centroid) of the informative genes in each class

• Classify a sample in the validation set based on which centroid in the training set it’s gene expression profile is most similar to.

Page 39: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

When p>>n The Linear Model is Too Complex

• It is always possible to find a set of features and a weight vector for which the classification error on the training set is zero.

• Why consider more complex models?

Page 40: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

Other Methods

• Top-scoring pairs– Claim that it gives accurate prediction with few

pairs because pairs of genes are selected to work well together

• Random Forest– Very popular in machine learning community– Complex classifier

Page 41: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

• Comparative studies indicate that linear methods and nearest neighbor type methods often work as well or better than more complex methods for microarray problems because they avoid over-fitting the data.

Page 42: Gene Expression Profiling  Illustrated Using BRB-ArrayTools
Page 43: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

When There Are More Than 2 Classes

• Nearest neighbor type methods

• Decision tree of binary classifiers

Page 44: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

Decision Tree of Binary Classifiers

• Partition the set of classes {1,2,…,K} into two disjoint subsets S1 and S2

• Develop a binary classifier for distinguishing the composite classes S1 and S2

• Compute the cross-validated classification error for distinguishing S1 and S2

• Repeat the above steps for all possible partitions in order to find the partition S1and S2 for which the cross-validated classification error is minimized

• If S1and S2 are not singleton sets, then repeat all of the above steps separately for the classes in S1and S2 to optimally partition each of them

Page 45: Gene Expression Profiling  Illustrated Using BRB-ArrayTools
Page 46: Gene Expression Profiling  Illustrated Using BRB-ArrayTools
Page 47: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

Evaluating a Classifier

• Fit of a model to the same data used to develop it is no evidence of prediction accuracy for independent data– Goodness of fit vs prediction accuracy

Page 48: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

Simulation Training Validation

1

2

3

4

5

6

7

8

9

10

p=7.0e-05

p=0.70

p=4.2e-07

p=0.54

p=2.4e-13

p=0.60

p=1.3e-10

p=0.89

p=1.8e-13

p=0.36

p=5.5e-11

p=0.81

p=3.2e-09

p=0.46

p=1.8e-07

p=0.61

p=1.1e-07

p=0.49

p=4.3e-09

p=0.09

Page 49: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

Class Prediction

• A set of genes is not a classifier• Testing whether analysis of independent data results in

selection of the same set of genes is not an appropriate test of predictive accuracy of a classifier

Page 50: Gene Expression Profiling  Illustrated Using BRB-ArrayTools
Page 51: Gene Expression Profiling  Illustrated Using BRB-ArrayTools
Page 52: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

• Hazard ratios and statistical significance levels are not appropriate measures of prediction accuracy

• A hazard ratio is a measure of association– Large values of HR may correspond to small improvement in

prediction accuracy• Kaplan-Meier curves for predicted risk groups within

strata defined by standard prognostic variables provide more information about improvement in prediction accuracy

• Time dependent ROC curves within strata defined by standard prognostic factors can also be useful

Page 53: Gene Expression Profiling  Illustrated Using BRB-ArrayTools
Page 54: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

Time-Dependent ROC Curve

• Sensitivity vs 1-Specificity

• Sensitivity = prob{M=1|S>T}

• Specificity = prob{M=0|S<T}

Page 55: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

Pr{ | 1}Pr{ 1}Pr{ 1| }

Pr{ }

Pr{ | 0}Pr{ 0}Pr{ 0 | }

Pr{ }

S T M MM S T

S T

S T M MM S T

S T

Page 56: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

Validation of a Predictor

• Internal validation– Re-substitution estimate

• Very biased

– Split-sample validation– Cross-validation

• Independent data validation

Page 57: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

Simulation Training Validation

1

2

3

4

5

6

7

8

9

10

p=7.0e-05

p=0.70

p=4.2e-07

p=0.54

p=2.4e-13

p=0.60

p=1.3e-10

p=0.89

p=1.8e-13

p=0.36

p=5.5e-11

p=0.81

p=3.2e-09

p=0.46

p=1.8e-07

p=0.61

p=1.1e-07

p=0.49

p=4.3e-09

p=0.09

Page 58: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

Split-Sample Evaluation

• Split your data into a training set and a test set– Randomly (e.g. 2:1)– By center

• Training-set– Used to select features, select model type, determine

parameters and cut-off thresholds• Test-set

– Withheld until a single model is fully specified using the training-set.

– Fully specified model is applied to the expression profiles in the test-set to predict class labels.

– Number of errors is counted

Page 59: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

Leave-one-out Cross Validation

• Leave-one-out cross-validation simulates the process of separately developing a model on one set of data and predicting for a test set of data not used in developing the model

Page 60: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

Leave-one-out Cross Validation

• Omit sample 1– Develop multivariate classifier from scratch on

training set with sample 1 omitted– Predict class for sample 1 and record whether

prediction is correct

Page 61: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

Leave-one-out Cross Validation

• Repeat analysis for training sets with each single sample omitted one at a time

• e = number of misclassifications determined by cross-validation

• Subdivide e for estimation of sensitivity and specificity

Page 62: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

• Cross validation is only valid if the test set is not used in any way in the development of the model. Using the complete set of samples to select genes violates this assumption and invalidates cross-validation.

• With proper cross-validation, the model must be developed from scratch for each leave-one-out training set. This means that feature selection must be repeated for each leave-one-out training set.

• The cross-validated estimate of misclassification error is an estimate of the prediction error for model fit using specified algorithm to full dataset

Page 63: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

Prediction on Simulated Null Data

Generation of Gene Expression Profiles

• 14 specimens (Pi is the expression profile for specimen i)

• Log-ratio measurements on 6000 genes

• Pi ~ MVN(0, I6000)

• Can we distinguish between the first 7 specimens (Class 1) and the last 7 (Class 2)?

Prediction Method

• Compound covariate prediction (discussed later)

• Compound covariate built from the log-ratios of the 10 most differentially expressed genes.

Page 64: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

Number of misclassifications

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Pro

po

rtio

n o

f sim

ula

ted

da

ta s

ets

0.00

0.05

0.10

0.90

0.95

1.00

Cross-validation: none (resubstitution method)Cross-validation: after gene selectionCross-validation: prior to gene selection

Page 65: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

Permutation Distribution of Cross-validated Misclassification Rate of a

Multivariate Classifier• Randomly permute class labels and repeat the

entire cross-validation• Re-do for all (or 1000) random permutations of

class labels• Permutation p value is fraction of random

permutations that gave as few misclassifications as e in the real data

Page 66: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

Gene-Expression Profiles in Hereditary Breast Cancer

• Breast tumors studied:7 BRCA1+ tumors8 BRCA2+ tumors7 sporadic tumors

• Log-ratios measurements of 3226 genes for each tumor after initial data filtering

cDNA MicroarraysParallel Gene Expression Analysis

RESEARCH QUESTIONCan we distinguish BRCA1+ from BRCA1– cancers and BRCA2+ from BRCA2– cancers based solely on their gene expression profiles?

Page 67: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

BRCA1

g

# of

significant genes

# of misclassified

samples (m)

% of random permutations with

m or fewer misclassifications

10-2 182 3 0.4 10-3 53 2 1.0 10-4 9 1 0.2

Page 68: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

BRCA2

g

# of significant

genes

m = # of misclassified elements

(misclassified samples)

% of random permutations with m

or fewer misclassifications

10-2 212 4 (s11900, s14486, s14572, s14324) 0.8 10-3 49 3 (s11900, s14486, s14324) 2.2 10-4 11 4 (s11900, s14486, s14616, s14324) 6.6

Page 69: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

Classification of BRCA2 Germline Mutations

Classification Method LOOCV Prediction Error

Compound Covariate Predictor 14%

Fisher LDA 36%

Diagonal LDA 14%

1-Nearest Neighbor 9%

3-Nearest Neighbor 23%

Support Vector Machine

(linear kernel)

18%

Classification Tree 45%

Page 70: Gene Expression Profiling  Illustrated Using BRB-ArrayTools
Page 71: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

Simulated Data40 cases, 10 genes selected from 5000

Method Estimate Std Deviation

True .078

Resubstitution .007 .016

LOOCV .092 .115

10-fold CV .118 .120

5-fold CV .161 .127

Split sample 1-1 .345 .185

Split sample 2-1 .205 .184

.632+ bootstrap .274 .084

Page 72: Gene Expression Profiling  Illustrated Using BRB-ArrayTools
Page 73: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

Simulated Data40 cases

Method Estimate Std Deviation

True .078

10-fold .118 .120

Repeated 10-fold .116 .109

5-fold .161 .127

Repeated 5-fold .159 .114

Split 1-1 .345 .185

Repeated split 1-1 .371 .065

Page 74: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

Common Problems With Internal Classifier Validation

• Pre-selection of genes using entire dataset

• Failure to consider optimization of tuning parameter part of classification algorithm– Varma & Simon, BMC Bioinformatics 2006

• Erroneous use of predicted class in regression model

Page 75: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

• If you use cross-validation estimates of prediction error for a set of algorithms indexed by a tuning parameter and select the algorithm with the smallest cv error estimate, you do not have an unbiased estimate of the prediction error for the selected model

Page 76: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

Does an Expression Profile Classifier Predict More Accurately Than Standard

Prognostic Variables?

• Not an issue of which variables are significant after adjusting for which others or which are independent predictors– Predictive accuracy and inference are different

• The two classifiers can be compared by ROC analysis as functions of the threshold for classification

• The predictiveness of the expression profile classifier can be evaluated within levels of the classifier based on standard prognostic variables

Page 77: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

Does an Expression Profile Classifier Predict More Accurately Than Standard

Prognostic Variables?

• Some publications fit logistic model to standard covariates and the cross-validated predictions of expression profile classifiers

• This is valid only with split-sample analysis because the cross-validated predictions are not independent

log ( ) ( | )i iit p y x i z

Page 78: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

Survival Risk Group Prediction

• Evaluate individual genes by fitting single variable proportional hazards regression models to log signal or log ratio for gene

• Select genes based on p-value threshold for single gene PH regressions

• Compute first k principal components of the selected genes

• Fit PH regression model with the k pc’s as predictors. Let b1 , …, bk denote the estimated regression coefficients

• To predict for case with expression profile vector x, compute the k supervised pc’s y1 , …, yk and the predictive index = b1 y1 + … + bk yk

Page 79: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

Survival Risk Group Prediction

• LOOCV loop:– Create training set by omitting i’th case

• Develop supervised pc PH model for training set• Compute cross-validated predictive index for i’th

case using PH model developed for training set• Compute predictive risk percentile of predictive

index for i’th case among predictive indices for cases in the training set

Page 80: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

Survival Risk Group Prediction

• Plot Kaplan Meier survival curves for cases with cross-validated risk percentiles above 50% and for cases with cross-validated risk percentiles below 50%– Or for however many risk groups and

thresholds is desired

• Compute log-rank statistic comparing the cross-validated Kaplan Meier curves

Page 81: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

Survival Risk Group Prediction

• Repeat the entire procedure for all (or large number) of permutations of survival times and censoring indicators to generate the null distribution of the log-rank statistic– The usual chi-square null distribution is not valid

because the cross-validated risk percentiles are correlated among cases

• Evaluate statistical significance of the association of survival and expression profiles by referring the log-rank statistic for the unpermuted data to the permutation null distribution

Page 82: Gene Expression Profiling  Illustrated Using BRB-ArrayTools

Survival Risk Group Prediction

• Other approaches to survival risk group prediction have been published

• The supervised pc method is implemented in BRB-ArrayTools

• BRB-ArrayTools also provides for comparing the risk group classifier based on expression profiles to one based on standard covariates and one based on a combination of both types of variables

Page 83: Gene Expression Profiling  Illustrated Using BRB-ArrayTools
Page 84: Gene Expression Profiling  Illustrated Using BRB-ArrayTools