Clustering megavariate data Dhammika Amaratunga · microarray with 45101 genes. C2 C3 C4 C5 C6 KO: T1 T2 T3 T4 T5 T6 Note 1: Data available for early stage and late stage development

1

Clustering megavariate data

Dhammika AmaratungaTeam Leader - Statistics in Drug Discovery

Senior Research Fellow - Nonclinical Statistics

Rutgers Biostatistics Day, April 2010

Joint work with

Javier Cabrera, Yauheniya Cherkas, Vladimir Kovtun, YungSeop Lee, and others

2

Cluster analysis

Data collected for N samples.

For each sample, measurements made on G variables.

Data represented as a GxN matrix.

The objective is to cluster

the N samples into a few

classes in such a way that

samples within a class are

collectively more similar to

each other than to samples

in any other class.

C5C6

C3

C4

C2

C1

3

Cluster analysis methods

There are many standard approaches available (e.g., partitioning methods such as K-means, hierarchical methods such as average linkage, machine learning methods such as self organizing maps)

For example, hierarchical clustering is one of the more popular clustering methods.

-- Define an inter-sample dissimilarity

(e.g., Euclidean distance, 1-Correlation)

-- Define an inter-cluster dissimilarity

(e.g., Dissimilarity between a pair of clusters is the average dissimilarity between a sample in one cluster and a sample in the other cluster)

-- Combine “close” samples/clusters sequentially

4

12

3

4

7

6

5

SA

MPL

E 1

SA

MPL

E 2

SA

MPL

E 3

SA

MPL

E 4

SA

MPL

E 5

SA

MPL

E 6

SA

MPL

E 7

Hierarchical clustering: how it works

5

The catch

In many contemporary settings, the data are megavariate, i.e., N<<G (e.g., in high throughput gene expression studies G is around 1,000-50,000 while N is around 10-500); in such cases, most predictors are noninformative and could overwhelm the dissimilarity estimates.

Example: Use gene expression data to discover unexpected novel classes among the samples (e.g., in leukemia patients, subtypes of leukemia).

6

WT:

C1

Case study

Experiment: Compare the gene expression profiles of 6 KO mice vs 6 WT mice using a microarray with 45101 genes.

C2 C3 C4 C5 C6

KO:

T1 T2 T3 T4 T5 T6

Note 1: Data available for early stage and late stage development of these mice. Note 2: This data is useful for illustration but is not representative of a cluster analysis situation as here the classes are known.

7

Gene expression data

Gene expression levels (measured via microarrays) for G genes in N samples:

C1 C2 C3 C4 C5 C6 …

G1 83 94 82 111 130 122

G2 16 14 7 2 11 33

G3 490 879 193 604 1031 962

G4 46458 49268 74059 44849 42235 44611

G5 32 70 185 20 25 19

G6 1067 891 546 906 1038 1098

G7 118 111 95 896 536 695

G8 10 30 25 24 31 28

G9 166 132 162 27 109 213

G10 136 139 44 62 23 135

. . . . . . . . . . . .

. . . . . . . . . . . .Preprocess and analyze

8

Biplots of data from knockout experiment

Early stage Late stage

9

Clustering of data from knockout experiment


MR=5/12 MR=0/12

10

Filtering

Problem: With megavariate data, most predictors are noninformative and will overwhelm the dissimilarity estimates.

Usual (partial) resolution: Filter the genes based on variance or coefficient of variation to reduce the error rates (but which genes are informative?).

Resolution: Ensemble approach: Filter genes repeatedly and apply an ensemble technique.

11

Similari ty S1 S2 S3 S4 S5 S6

S1 0 1 0 0 0 0 S2 1 0 0 0 0 0 S3 0 0 0 0 0 0 S4 0 0 0 0 1 1 S5 0 0 0 1 0 1 S6 0 0 0 1 1 0

Similarity S1 S2 S3 S4 S5 S6

S1 0 1 1 1 0 0 S2 1 0 0 0 0 0 S3 1 0 0 1 0 0 S4 1 0 1 0 1 1 S5 0 0 0 1 0 2 S6 0 0 0 1 2 0


S1 0 1 1 1 0 0 S2 1 0 0 0 1 1 S3 1 0 0 2 0 0 S4 1 0 2 0 1 1 S5 0 1 0 1 0 3 S6 0 1 0 1 3 0


S1 0 1 2 2 0 0 S2 1 0 0 0 1 1 S3 2 0 0 3 0 0 S4 2 0 3 0 1 1 S5 0 1 0 1 0 4 S6 0 1 0 1 4 0


S1 0 2 3 3 0 0 S2 2 0 1 1 1 1 S3 3 1 0 4 0 0 S4 3 1 4 0 1 1 S5 0 1 0 1 0 5 S6 0 1 0 1 5 0


S1 0 6 7 7 0 0 S2 6 0 5 5 1 1 S3 7 5 0 8 0 0 S4 7 5 8 0 2 2 S5 0 2 0 2 0 10 S6 0 2 0 2 10 0

S1 S2 S4 S5 S6

G8523 680 749 669 724 643

G8524 262 311 1677 1286 1486

G8528 2571 1929 2439 1613 5074

G8530 1640 1693 1731 1861 1550

G8537 4077 2557 3394 2926 2755

G8545 1652 1799 254 383 258

G8547 2607 3394 2755 3077 2227

Select n samples and g genesGene expression matrix

{S1,S2,S3,S4} {S5,S6}

Final Clusters

Compute similarity

S1 S2 S3 S4 S5 S6

G8521 1003 1306 713 1628 1268 1629

G8522 890 705 566 975 883 1005

G8523 680 749 811 669 724 643

G8524 262 311 336 1677 1286 1486

G8525 254 383 258 1652 1799 1645

G8526 81 140 288 298 241 342

G8527 4077 2557 2600 3394 2926 2755

G8528 2571 1929 1406 2439 1613 5074

G8529 55 73 121 22 141 44

G8530 1640 1693 1517 1731 1861 1550

G8531 168 229 284 220 310 315

G8532 323 258 359 345 308 315

G8533 12131 11199 14859 11544 11352 11506

G8534 11544 11352 12131 11199 14859 12529

G8535 1929 1406 2439 254 383 258

G8536 191 140 288 298 241 342

G8537 4077 2557 2600 3394 2926 2755

G8538 2571 1613 5074 1652 1799 1645

G8539 55 73 121 22 91 24

G8540 1640 1693 1517 1731 1861 1750

G8541 168 229 284 220 312 335

G8542 323 258 359 345 298 325

G8543 2007 1878 1502 1758 2480 1731

G8544 2480 1731 2007 1878 1502 1758

G8545 1652 1799 1645 254 383 258

G8546 298 241 342 81 150 298

G8547 2607 3394 2926 2755 3077 2227

G8548 2571 1929 1406 2439 1613 5074

G8549 121 22 55 730 201 35

G8550 1640 1693 1517 1731 1861 1550

12

Data

Simple random sample of cases

Random sample of genes

Cluster analysis

Iterate

ABC dissimilarities

ABC(i,j) = 1-relative frequency

of how often samples i and j

cluster together

Ref: Amaratunga, Cabrera and Kovtun (Biostatistics, 2007)

Simple or weighted

based on variance

HC (Ave, Ward’s),

Kmeans, …

Input to clustering

algorithm

13

ABC clustering of data from knockout experiment


MR=2/12 MR=0/12

14


ABC-MDS plot of data from knockout experiment

15

Within-cluster and between-cluster dissimilarities

16

More proof-of-concept examples

Try on data in which the clusters are known.

Misclassification Rates

Method Golub AMS ALL Colon

Ward's with ABC 18.1 1.4 0.0 9.7

Ward’s with 1-Cor 23.6 9.7 2.3 48.4

Single Linkage 47.0 47.0 25.0 37.0

Complete Linkage 37.5 23.6 41.4 45.0

Average Linkage 47.2 27.8 26.5 38.7

K-means 20.8 5.5 42.2 48.4

PAM 23.6 8.3 2.3 16.1

Random Forest 43.0 26.4 48.0 43.5

17

More proof-of-concept examples (ctd)

… with feature selection

Misclassification Rates

Method Golub AMS ALL Colon

Ward's with ABC 18.1 1.4 0.0 9.7

Ward’s with 1-Cor 6.9 13.9 0.0 24.2

Single Linkage 45.8 58.3 26.6 35.5

Complete Linkage 29.2 13.9 0.0 27.4

Average Linkage 5.6 30.6 0.0 37.1

K-means 6.9 6.9 0.0 14.5

PAM 8.3 13.9 0.0 12.9

Random Forest 23.6 12.5 0.0 11.3

18

Hepatotoxicity example (1)

In this experiment N=87 compounds were tested in rats for a certain type of hepatotoxicity.

19


ABC was run on this dataset.

-0.4 -0.2 0.0 0.2 0.4

-0.4

-0.2

0.0

0.2

0.4

cmdobj2[,1]

cm

do

bj2

[,2

]

Eth

Ery

Rif

Ani

Met

Sul ANIGli

Ami

Adr

AmiChoSpi

Sta

Tes

Per

Val

Pur

ParFlu

Tet

Dis

Asp

Cap

But

FurPip

Met

Nia

Vit

Fam

Rot

Car

Ral

Cy p

Ran

Iso

KetSim

Bro

Dap

DipMeb

Met

Cy c

Bro

Eto

Ace

Flu

Hy d

Tac

Dic

Ams

Cis

Dac

Dox

MetIsoStr

Phe

BusChl

Gen

Car

Die

NimPhe

Tan

CadDig

Dex

Mif

Sul

Met

BusMetAce

Chl

Pro

Tam

Ver

CloMy c

Nal

Niz

Ate

Dan

20

In this case, it was known that there are 3 genes thought to be implicated with the toxicity of interest.



-0.2 0.0 0.2 0.4

-0.4

-0.2

0.0

0.2

cmdobj2[,1]

cm

do

bj2

[,2

]

Ethi

Ery t

Rif a

Anil

Meth

Suli

ANIT

Glib

Amio

Adre

AminCholSpir

StanTest

Perh

Valp

Puro

Para

Fluo

Tetr

Disu

Aspi

Capt

Buty

Furo

PipeMeth

Niac

VitaFamo

Rote

Carb

Ralo

Cy pr

Rani

Ison

Keto

Simv

Brom

Daps

Dipy

Mebe

Meth

Cy cl

Brom

Etop

AcetFlut

Hy dr

Tacr

Dich

Amsa

Cisp

Daca

DoxoMethIsop

StrePhen

BusuChlo

Gent

Carm

Diel

NimePhenTann

Cadm

DigoDexa

Mif e

Sulf

Meto

Busp

Metf

Acet

Chlo

Prog

Tamo

Vera

Cloz

My co

Nalt

NizaAten

Dant

Running ABC with weights proportional to the maximum correlation to these 3 genes gave a much more interesting result.

Data

Simple random sample of subjects

Simple random sample of genes

Construct classifier

Collate results

Extension: ensemble classifiers

Ref: Breiman (Machine Learning, 2001), Amaratunga et al (2009)

Tree ( Random

Forest*), LDA, …

22

Predict using classifier

Prediction:

Majority Vote

23

Case study: KO experiment

Try on data in which the classes are known.

Out-of-bag error rates

Ref: Amaratunga, Cabrera & Lee (Bioinformatics, 2008)

RF RF(p) ERFE-

LDA

EE-

LDA

Slc17A5 Day 0 0.583 0.583 0.167 0.583 0.083

Slc17A5 Day 18 0.083 0.083 0.000 0.000 0.000

Slc17A5 Day 0

(scrambled)0.750 0.750 0.833 0.833 0.833

Slc17A5 Day 18

(scrambled)0.583 0.667 0.667 0.583 0.583

24

Megavariate data are becoming more and more prevalent

Megavariate data introduce special challenges - overparametrized and undersampled- overfitting and redundancy- computationally challengingIn this setting, ensemble methods are among the best choices for classification.

Wrap Up

25

Wrap Up

Scientific collaborators: Michael McMillian, Jennifer Sasaki

References:

D Amaratunga and J Cabrera (2004) Exploration and Analysis of DNA Microarray and Protein Array Data. John Wiley.

D Amaratunga, J Cabrera and V Kovtun (2008) Microarray learning with ABC, Biostatistics.

D Amaratunga, J Cabrera and Y S Lee (2008) Enriched random forests, Bioinformatics.

D Amaratunga, J Cabrera, Y Cherkas and Y S Lee (2009) Ensemble classifiers, in review.

Website (recent papers and software):www. amaratunga.comwww.rci.rutgers.edu/~cabrera/DNAMR

Email:[email protected]

Documents

Clustering megavariate data Dhammika Amaratunga · microarray with 45101 genes. C2 C3 C4 C5 C6 KO: T1 T2 T3 T4 T5 T6 Note 1: Data available for early stage and late stage development