31
Application of Class Discovery and Class Prediction Methods to Microarray Data Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics [email protected]

Application of Class Discovery and Class Prediction Methods to Microarray Data

  • Upload
    verena

  • View
    27

  • Download
    0

Embed Size (px)

DESCRIPTION

Application of Class Discovery and Class Prediction Methods to Microarray Data. Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics [email protected]. Basis of Cancer Diagnosis. Pathologist makes an interpretation based upon a compendium of knowledge which may include - PowerPoint PPT Presentation

Citation preview

Page 1: Application of Class Discovery and Class Prediction Methods to Microarray Data

Application of Class Discovery and Class

Prediction Methods to Microarray DataKellie J. Archer, Ph.D.Assistant Professor

Department of [email protected]

Page 2: Application of Class Discovery and Class Prediction Methods to Microarray Data

Basis of Cancer Diagnosis

• Pathologist makes an interpretation based upon a compendium of knowledge which may include– Morphological appearance of the tumor– Histochemistry– Immunophenotyping– Cytogenetic analysis– etc.

Page 3: Application of Class Discovery and Class Prediction Methods to Microarray Data
Page 4: Application of Class Discovery and Class Prediction Methods to Microarray Data

Clinically Distinct DLBCL Subgroups

Page 5: Application of Class Discovery and Class Prediction Methods to Microarray Data

Improved Cancer Diagnosis: Identify sub-classes

• Divide morphologically similar tumors into different groups based on response.

• Application of microarrays: Characterize molecular variations among tumors by monitoring gene expression

• Goal: microarrays will lead to more reliable tumor classification and sub-classification (therefore, more appropriate treatments will be administered resulting in improved outcomes)

Page 6: Application of Class Discovery and Class Prediction Methods to Microarray Data

Distinguishing two types of acute leukemia (AML vs. ALL)

• Golub, T.R. et al 1999. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286: 531-537.

• http://www-genome.wi.mit.edu/cgi-bin/cancer/datasets.cgi (near bottom of page)

Page 7: Application of Class Discovery and Class Prediction Methods to Microarray Data

Distinguishing AML vs. ALL

• 38 BM samples (27 childhood ALL, 11 adult AML) were hybridized to Affymetrix GeneChips– GeneChip included 6,817 human genes.– Affymetrix MAS 4.0 software was used to

perform image analysis.– MAS 4.0 Average Difference expression

summary method was applied to the probe level data to obtain probe set expression summaries.

– Scaling factor was used to normalize the GeneChips.

– Samples were required to meet quality control criteria.

Page 8: Application of Class Discovery and Class Prediction Methods to Microarray Data

Distinguishing AML vs. ALL

• Class comparison– Neighborhood analysis

• Class prediction – Weighted voting

Page 9: Application of Class Discovery and Class Prediction Methods to Microarray Data

Class Discovery: Distinguishing AML vs. ALL

• The mean of a random variable X is a measure of central location of the density of X.

• The variance of a random variable is a measure of spread or dispersion of the density of X.

• Var(X)=E[(X-)2] =∑(X - )2/(n-1)• Standard deviation = =(X)Var(X)

X

Page 10: Application of Class Discovery and Class Prediction Methods to Microarray Data

Class Discovery: Distinguishing AML vs. ALL

• For each gene, compute the log of the expression values. For a given gene g,

2 g

1 g

1 g

Let represent the mean log expression value;

represent the stdev log expression value.

represent the mean log expression value;

represent the stdev log expression value.

For AML

Let

For ALL

Let

Let 2 g

Page 11: Application of Class Discovery and Class Prediction Methods to Microarray Data

Class Discovery: Distinguishing AML vs. ALL

Illustration usingALL AML example.xls

Page 12: Application of Class Discovery and Class Prediction Methods to Microarray Data

• For each gene, compute a relative class separation (quasi-correlation measure) as follows

• Define neighborhoods of radius r about classes 1 and 2 such that P(g,c) > r or

P(g,c) < -r. r was chosen to be 0.3

Class Discovery: Distinguishing AML vs. ALL

1 2

1 2

,g g

P g cg g

Page 13: Application of Class Discovery and Class Prediction Methods to Microarray Data

Aside• This differs from Pearson’s correlation and is

therefore not confined to [-1,1] interval

G

1i

2.2i2

G

1i

2.1i1

G

1i.2i2.1i1

1,2

gggg

ggggρ

Page 14: Application of Class Discovery and Class Prediction Methods to Microarray Data

Aside Illustration using

Correlation.xls

Page 15: Application of Class Discovery and Class Prediction Methods to Microarray Data

Class Discovery: Distinguishing AML vs. ALL

• A permutation test was used to calculate whether the observed number of genes in a neighborhood was significantly higher than expected.

Page 16: Application of Class Discovery and Class Prediction Methods to Microarray Data

Permutation based methods

• Permutation based adjusted p-values– Under the complete null, the joint

distribution of the test statistics can be estimated by permuting the columns of the gene expression matrix

– Permuting entire columns creates a situation in which membership to the Class 1 and Class 2 groups is independent of gene expression but preserves the dependence structure between genes

Page 17: Application of Class Discovery and Class Prediction Methods to Microarray Data

Gene g on Chip i Expression Diagnosis 1 2 3 B

1 2013.7 ALL AML ALL ALL … AML2 2141.9 ALL ALL AML ALL … AML3 2040.2 ALL ALL ALL AML … ALL4 1973.3 ALL ALL ALL ALL … ALL5 2162.2 ALL ALL ALL ALL … AML6 1994.8 ALL ALL ALL ALL … ALL7 1913.3 ALL ALL ALL ALL … AML8 2068.7 ALL ALL ALL ALL … ALL9 1974.6 AML ALL ALL ALL … AML

10 2027.6 AML AML AML AML … AML11 1914.8 AML AML AML AML … AML12 1955.8 AML AML AML AML … ALL13 1963 AML AML AML AML … ALL14 2025.5 AML AML AML AML … ALL15 1865.1 AML AML AML AML … AML16 1922.4 AML AML AML AML … ALL

Example Permutations

Permutation based methods

Page 18: Application of Class Discovery and Class Prediction Methods to Microarray Data

Permutation based methods• Permutation algorithm for the bth

permutation, b=1,…,B– 1) Permute the n labels of the data matrix

X– 2) Compute relative class separation

P(g1,c)b,…, P(gp,c)b for each gene gi.

• The permutation distribution of the relative class separation P(g,c) for gene gi, i=1,…,p is given by the empirical distribution of P(g,c)j,1,…, P(g,c)j,B.

Page 19: Application of Class Discovery and Class Prediction Methods to Microarray Data

Distinguishing AML vs. ALL

• Class comparisons using neighborhood analysis revealed approximately 1,100 genes were correlated with class (AML or ALL) than would be expected by chance.

Page 20: Application of Class Discovery and Class Prediction Methods to Microarray Data

Class Prediction: Distinguishing AML vs. ALL

• For set of informative genes, each expression value xi votes for either ALL or AML, depending on whether its expression value is closer to μALL or μAML

– Let μALL represent the mean expression value for ALL– Let μAML represent the mean expression value for

AML

• Informative genes were the n/2 genes with the largest P(g,c) and the n/2 genes with the smallest P(g,c)

• Golub et al choose n = 50

2

AML ALLi iv x

Page 21: Application of Class Discovery and Class Prediction Methods to Microarray Data

Class Prediction: Distinguishing AML vs. ALL

• wi is a weighting factor that reflects how well the gene is correlated with class distinction; wivi is the weighted vote

• For each sample, the weighted votes for each class are summed to get VALL and VAML

• The sample is assigned to the class with the higher total, provided the Prediction Strength (PS) > 0.3 where

PS = (Vwin – Vlose)/ (Vwin + Vlose)

Page 22: Application of Class Discovery and Class Prediction Methods to Microarray Data

Class Prediction: Distinguishing AML vs. ALL

Gene g, P(g,c) > 0.3 w = P(g,c) v w*v

Gene g, P(g,c) < 0.3 w = P(g,c) v w*v

1 12 23 34 45 5

… …25 25

Sum(w*v)

VALL

Sum(w*v)

VAML

Page 23: Application of Class Discovery and Class Prediction Methods to Microarray Data

Class Prediction: Distinguishing AML vs. ALL

• Checking model adequacy– Cross-validation of training dataset– Applied model to an independent

dataset of 34 samples

Page 24: Application of Class Discovery and Class Prediction Methods to Microarray Data

Class Discovery

• Determine whether the samples can be divided based only on gene expression without regard to the class labels– Self-organizing maps

Page 25: Application of Class Discovery and Class Prediction Methods to Microarray Data

Hypothesis Testing

• The hypothesis that two means 1 and 2 are equal is called a null hypothesis, commonly abbreviated H0.

• This is typically written as H0: 1 = 2

• Its antithesis is the alternative hypothesis, HA: 1 2

Page 26: Application of Class Discovery and Class Prediction Methods to Microarray Data

Hypothesis Testing

• A statistical test of hypothesis is a procedure for assessing the compatibility of the data with the null hypothesis.– The data are considered compatible with

H0 if any discrepancy from H0 could readily be due to chance (i.e., sampling error).

– Data judged to be incompatible with H0 are taken as evidence in favor of HA.

Page 27: Application of Class Discovery and Class Prediction Methods to Microarray Data

Hypothesis Testing• If the sample means calculated are

identical, we would suspect the null hypothesis is true.

• Even if the null hypothesis is true, we do not really expect the sample means to be identically equal because of sampling variability.

• We would feel comfortable concluding H0 is true if the chance difference in the sample means should not exceed a couple of standard errors.

Page 28: Application of Class Discovery and Class Prediction Methods to Microarray Data

T-test• In testing H0: 1 = 2 against HA: 1 2 note

that we could have restated the null hypothesis as

H0: 1 - 2 = 0 and HA: 1 - 2 0

• To carry out the t-test, the first step is to compute the test statistic and then compare the result to a t-distribution with the appropriate degrees of freedom (df)

1 2

1 2 1 2g 2 2

y y 1 2

1 2

y y 0 y yt

SE

n n

1nSE

1nSE

SESEdf

2

42

1

41

222

21

Page 29: Application of Class Discovery and Class Prediction Methods to Microarray Data

T-test

• Data must be independent random samples from their respective populations

• Sample size should either be large or, in the case of small sample sizes, the population distributions must be approximately normally distributed.

• When assumptions are not met, non-parametric alternatives are available (Wilcoxon Rank Sum/Mann-Whitney Test)

Page 30: Application of Class Discovery and Class Prediction Methods to Microarray Data

T-test: Probe set 208680_atSample number ALL AML

1 2013.7 1974.6

2 2141.9 2027.6

3 2040.2 1914.8

4 1973.3 1955.8

5 2162.2 1963.0

6 1994.8 2025.5

7 1913.3 1865.1

8 2068.7 1922.4

2038.5 1956.1

s2 7051.284 3062.991

n 8 8

y

Page 31: Application of Class Discovery and Class Prediction Methods to Microarray Data

T-test: Probe set 208680_at

21 yy

21g SE

0yyt

317.2

899.3062

87051.3

01.19655.2038tg

-3 -2 -1 0 1 2 30

.00

.10

.20

.30

.4

t

Pro

ba

bili

ty

22 2

7051.3 3062.99df 12.116

7051.3 3062.998 1 8 1

P=0.039