29
Classification of Microarray Data

Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical

  • View
    223

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical

Classification of Microarray Data

Page 2: Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical

Sample PreparationHybridization

Array designProbe design

QuestionExperimental Design

Buy Chip/Array

Statistical AnalysisFit to Model (time series)

Expression IndexCalculation

Advanced Data Analysis

Clustering PCA Classification Promoter Analysis

Meta analysis Survival analysis Regulatory Network

Normalization

Image analysis

The DNA Array Analysis Pipeline

ComparableGene Expression Data

Page 3: Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical

Outline

• Example: Childhood leukemia

• Classification– Method selection

– Feature selection

– Cross-validation

– Training and testing

• Example continued: Results from a leukemia study

Page 4: Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical

Childhood Leukemia

• Cancer in the cells of the immune system

• Approx. 35 new cases in Denmark every year

• 50 years ago – all patients died

• Today – approx. 78% are cured

• Risk groups:

– Standard, Intermediate, High, Very high, Extra high

• Treatment

– Chemotherapy

– Bone marrow transplantation

– Radiation

Page 5: Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical

Prognostic Factors

Good prognosis Poor prognosis

Immuno-phenotype precursor B T

Age 1-9 ≥10

Leukocyte count Low (<50*109/L) High (>100*109/L)

Number of chromosomes Hyperdiploidy (>50)

Hypodiploidy (<46)

Translocations t(12;21) t(9;22), t(1;19)

Treatment response Good response:Low MRD

Poor response:High MRD

Page 6: Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical

Prognostic factors:

ImmunophenotypeAgeLeukocyte countNumber of chromosomesTranslocationsTreatment response

Risk Classification Today

Risk group:

StandardIntermediateHighVery highExtra High

Patient:

Clinical data

Immuno-phenotype

Morphology

Genetic measurements

Microarraytechnology

Page 7: Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical

Study of Childhood Leukemia

• Diagnostic bone marrow samples from leukemia patients

• Platform: Affymetrix Focus Array– 8793 human genes

• Immunophenotype– 18 patients with precursor B immunophenotype

– 17 patients with T immunophenotype

• Outcome 5 years from diagnosis– 11 patients with relapse

– 18 patients in complete remission

Page 8: Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical

Reduction of data

• Dimension reduction– PCA

• Feature selection (gene selection)– Significant genes: t-test

– Selection of a limited number of genes

Page 9: Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical

Microarray DataClass precursorB T ...

Patient1 Patient2 Patient3 Patient4 ...

Gene1 1789.5 963.9 2079.5 3243.9 ...

Gene2 46.4 52.0 22.3 27.1 ...

Gene3 215.6 276.4 245.1 199.1 ...

Gene4 176.9 504.6 420.5 380.4

Gene5 4023.0 3768.6 4257.8 4451.8

Gene6 12.6 12.1 37.7 38.7

... ... ... ... ... ...

Gene8793 312.5 415.9 1045.4 1308.0

Page 10: Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical

Outcome: PCA on all Genes

Page 11: Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical

Outcome Prediction: CC against the Number of Genes

-0.1

0.1

0.3

0.5

0.7

0.9

0 20 40 60 80 100 120 140

Number of top ranking genes

Co

rrel

atio

n c

oef

fici

ent

(CC

)

Nearest centroid

Page 12: Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical

• Assumptions:– Data e.g.. Gaussian distributed

– Variance and covariance the same for all classes

• Calculation of class probability

Linear discriminant analysis

Page 13: Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical

• Calculation of a centroid for each class

• Calculation of the distance between a test sample and each class centroid

• Class prediction by the nearest centroid method

Nearest Centroid

kCj kijik nxx /

Page 14: Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical

• Based on distance measure– For example Euclidian distance

• Parameter k = number of nearest neighbors– k=1– k=3– k=... – Always an odd number

• Prediction by majority vote

K-Nearest Neighbor

Page 15: Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical

Support Vector Machines

• Machine learning

• Relatively new and highly theoretic

• Works on non-linearly separable data

• Finding a hyperplane between the two classes by minimizing of the distance between the hyperplane and closest points

Page 16: Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical

Neural Networks

Gene1 Gene2 Gene3 ...

B T

Page 17: Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical

Comparison of Methods

Linear discriminant analysis

Nearest centroid

KNN

Neural networks

Support vector machines

Simple method

Based on distance calculation

Good for simple problems

Good for few training samples

Distribution of data assumed

Advanced methods

Involve machine learning

Several adjustable parameters

Many training samples required

(e.g.. 50-100)

Flexible methods

Page 18: Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical

Cross-validation

Given data with 100 samples

Cross-5-validation:Training: 4/5 of data (80 samples)Testing: 1/5 of data (20 samples)- 5 different models and performance results

Leave-one-out cross-validation (LOOCV)Training: 99/100 of data (99 samples)Testing: 1/100 of data (1 sample)- 100 different models

Page 19: Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical

Validation

• Definition of – True and False positives

– True and False negatives

Actual class B B T T

Predicted class B T T B

TP FN TN FP

Page 20: Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical

Sensitivity: the ability to detect "true positives"

TP

TP + FN

Specificity: the ability to avoid "false positives"

TN

TN + FP

Positive Predictive Value (PPV)

TP

TP + FP

Sensitivity and Specificity

Agnieszka
The most sensitive search finds all true matches, but might have lots of false positivesthe probability that the test is positive given that the patient is sick.
Agnieszka
The most specific search will return only true matches, but might have lots of false negativesthe probability that the test is negative given that the patient is not sick.
Agnieszka
In practice, there often is a tradeoff, and you can't achieve both.When one chooses which algorithm to use, there is a trade off between these two characters. It quite trivial to create an algorithms which will optimize one of these properties, the problem is to create algorithm which will achieve both of them.
Page 21: Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical

Accuracy

• Definition: TP + TN

TP + TN + FP + FN

• Range: [0 … 1]

Page 22: Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical

• Definition: TP·TN - FN·FP

√(TN+FN)(TN+FP)(TP+FN)(TP+FP)

• Range: [-1 … 1]

Matthews correlation coefficient

Page 23: Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical

Overview of Classification

Expression data

Subdivision of data for cross-validationinto training sets and test sets

Feature selection (t-test)Dimension reduction (PCA)

Training of classifier: - using cross-validation

- choice of method- choice of optimal parameters

Testing of classifier Independent test set

Page 24: Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical

Important Points

• Avoid over fitting

• Validate performance– Test on an independent test set

– Use cross-validation

• Include feature selection in cross-validation

Why?

– To avoid overestimation of performance!

– To make a general classifier

Page 25: Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical

Multiclass classification

• Same prediction methods– Multiclass prediction often implemented

– Often bad performance

• Solution– Division in small binary (2-class) problems

Page 26: Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical

Study of Childhood Leukemia: Results

• Classification of immunophenotype (precursorB and T)– 100% accuracy

• During the training • When testing on an independent test set

– Simple classification methods applied• K-nearest neighbor• Nearest centroid

• Classification of outcome (relapse or remission)– 78% accuracy (CC = 0.59)– Simple and advanced classification methods applied

Page 27: Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical

Summary

• When do we use classification?• Reduction of dimension often necessary

– T-test, PCA

• Methods– K nearest neighbor– Nearest centroid– Support vector machines– Neural networks

• Validation– Cross-validation– Accuracy and correlation coefficient

Page 28: Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical

Coffee break

Next: Exercise

Page 29: Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical

Prognostic factors:

ImmunophenotypeAgeLeukocyte countNumber of chromosomesTranslocationsTreatment response

Risk classification in the future ?

Risk group:

StandardIntermediateHighVery highEkstra high

Patient:

Clincal data

Immunopheno-typing MorphologyGenetic measurements

Microarraytechnology

Customized treatment