Classification of Drugs by SVM

Preview:

DESCRIPTION

CZ3253: Computer Aided Drug design Lecture 7: Drug Design Methods II: SVM Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg http://xin.cz3.nus.edu.sg Room 07-24, level 7, SOC1, National University of Singapore. Classification of Drugs by SVM. - PowerPoint PPT Presentation

Citation preview

CZ3253: Computer Aided Drug designCZ3253: Computer Aided Drug design

Lecture 7: Drug Design Methods II: SVM Lecture 7: Drug Design Methods II: SVM

Prof. Chen Yu ZongProf. Chen Yu Zong

Tel: 6874-6877Tel: 6874-6877Email: Email: csccyz@nus.edu.sgcsccyz@nus.edu.sghttp://xin.cz3.nus.edu.sghttp://xin.cz3.nus.edu.sg

Room 07-24, level 7, SOC1, Room 07-24, level 7, SOC1, National University of SingaporeNational University of Singapore

22

Classification of Drugs by SVMClassification of Drugs by SVM

• A drug is classified as either belong (+) or not belong (-) to a class

Examples of drug class: inhibitor of a protein, BBB penetrating, genotoxicExamples of protein class: enzyme EC3.4 family, DNA-binding

• By screening against all classes, the property of a drug or the function of a protein can be identified

Drug

Class-1 SVM

Class-2 SVM

Class-3 SVM

Drugbelongs toFamily-3

-

-

+

--

33

Classification of Drugs or Proteins by SVMClassification of Drugs or Proteins by SVM

What is SVM?

• Support vector machines, a machine learning method, learning by examples, statistical learning, classify objects into one of the two classes.

Advantages of SVM:

• Diversity of class members (no racial discrimination).

• Use of structure-derived physico-chemical features as basis for drug classification (no structure-similarity required in the algorithm).

44

SVM ReferencesSVM References• C. Burges, "A tutorial on support vector machines for pattern recognition",

Data Mining and Knowledge Discovery, Kluwer Academic Publishers,1998 (on-line).

• R. Duda, P. Hart, and D. Stork, Pattern Classification, John-Wiley, 2nd edition, 2001 (section 5.11, hard-copy).

• S. Gong et al. Dynamic Vision: From Images to Face Recognition, Imperial College Pres, 2001 (sections 3.6.2, 3.7.2, hard copy).

• Online lecture notes (http://www.cs.unr.edu/~bebis/MathMethods/SVM/lecture.pdf )

• Publications of SVM drug prediction: – J. Chem. Inf. Comput. Sci. 44,1630 (2004) – J. Chem. Inf. Comput. Sci. 44, 1497 (2004) – Toxicol. Sci. 79,170 (2004).

55

Machine Learning MethodMachine Learning Method Inductive learning:

Example-based learning

Descriptor

Positive examples

Negative examples

66

Machine Learning MethodMachine Learning Method

A=(1, 1, 1)B=(0, 1, 1)C=(1, 1, 1)D=(0, 1, 1)E=(0, 0, 0)F=(1, 0, 1)

Feature vectors: Descriptor

Feature vector

Positive examples

Negative examples

77

SVM MethodSVM Method Feature vectors in input space:

A=(1, 1, 1)B=(0, 1, 1)C=(1, 1, 1)D=(0, 1, 1)E=(0, 0, 0)F=(1, 0, 1)

Z

Input space

X

Y

BAE

F

Feature vector

88

SVM MethodSVM Method

BorderNew border

Project to a higher dimensional space

Protein familymembers

Nonmembers

Protein familymembers

Nonmembers

99

SVM methodSVM method

Support vector

Support vector

New border

Protein familymembers

Nonmembers

1010

SVM MethodSVM Method

Protein familymembers

Nonmembers

New border

Support vector

Support vector

1111

Best Linear Separator?Best Linear Separator?

1212

Best Linear Separator?Best Linear Separator?

1313

Find Closest Points in Convex Find Closest Points in Convex HullsHulls

c

d

1414

Plane Bisect Closest Points Plane Bisect Closest Points

x w b

w d c

d

c

1515

Find using quadratic programFind using quadratic program

21

2

1 1

1 1

min

1 1. .

0 1,...,

i i i ii i

i ii i

i

c d

c x d x

s t

i

Many existing and new solvers.

1616

Best Linear Separator:Best Linear Separator:Supporting Plane MethodSupporting Plane Method

1x w b

1x w b

Maximize distanceBetween two parallel supporting planes

Distance = “Margin” = ||||

2

w

1717

Best Linear Separator?Best Linear Separator?

1818

SVM MethodSVM Method

Border line is nonlinear

1919

SVM methodSVM method

Non-linear transformation: use of kernel function

2020

SVM methodSVM method

Non-linear transformation

2121

SVM MethodSVM Method

2222

SVM MethodSVM Method

2323

SVM MethodSVM Method

2424

SVM MethodSVM Method

2525

SVM for Classification of DrugsSVM for Classification of DrugsHow to represent a drug?

• Each structure represented by specific feature vector assembled from structural, physico-chemical properties:– Simple molecular properties (molecular weight, no. of rotatable bonds

etc. 18 in total)– Molecular Connectivity and shape (28 in total)– Electro-topological state polarity (84 in total)– Quantum chemical properties (electric charge, polaritability etc. 13 in

total)– Geometrical properties (molecular size vector, van der Waals volume,

molecular surface etc. 16 in total)

J. Chem. Inf. Comput. Sci. 44,1630 (2004) J. Chem. Inf. Comput. Sci. 44, 1497 (2004)

Toxicol. Sci. 79,170 (2004).

2626

-8 -7 -6 -5 -4 -3

-8

-7

-6

-5

-4

-3

Observed RT (min)

Pre

dict

ed R

T (

min

)SVM Feature SelectionSVM Feature Selection

CACO2 - 718 descriptorsCACO2 - 718 descriptorsAverage of 10 ModelsAverage of 10 Models

Test Q2 = .7073

Q2 is MSE scaled by variance:

= (mean square error) / (true variance)

2727

Feature SelectionFeature Selection

Using subset of descriptors might greatly improve results.

• Do feature selection using Linear SVM with 1-norm regularization

1-norm2-norm

2828

Feature Selection via Feature Selection via Sparse SVM/LPSparse SVM/LP

• Construct linear -SVM using 1-norm LP:

• Pick best C, for SVM• Keep descriptors

with nonzero coefficients

* 1*

, , , , 1

*

*

min

.

, , 0 1,.

|| ||

.,

i iw b z z i

i i i

i i i

i i

Cz z C

x b y zs

w

tx b y z

z z

w

w

i

| | 0iw

2929

Bagged Feature SelectionBagged Feature SelectionPartition Training Data

Training Set Validation Set

Linear SVM AlgorithmFor Feature Selection

A Linear Regression Model

Bag B Models and Obtain Subset of Features

Repeat B times

1 2 7181 2 718

i r

Make 20 models of the form

- ...

with only a few 0

Keep attributes with w w

r

i

w x b w x w x w x w r b

w

Random Variable - r

3030

-8 -7 -6 -5 -4 -3

-8

-7

-6

-5

-4

-3

Observed RT (min)

Pre

dict

ed R

T (

min

)

Bagged SVM (RBF)Bagged SVM (RBF)CACO2 - 31 DescriptorsCACO2 - 31 Descriptors

Test Q2 = .134

3131

Starplot Caco2 - 31 DescriptorsStarplot Caco2 - 31 Descriptors

 

ABSDRN6

a.don

KB54

SMR.VSA2

BNP8

DRNB10

KB11

PEOE.VSA.FPPOS

ANGLEB45

PIPB53

DRNB00

PEOE.VSA.4

SlogP.VSA6

apol

ABSFUKMIN

PIPB04

PEOE.VSA.FPOL

PIPMAX

BNPB50

BNPB21

PEOE.VSA.FHYD

PEOE.VSA.PPOS

EP2

SlogP.VSA9

ABSKMIN

PEOE.VSA.FNEG

BNPB31

FUKB14

pmiZ

SIKIA

SlogP.VSA0

3232

Chemistry In/Out ModelingChemistry In/Out Modeling

Feature Selection

Visualize Features

Assess Chemistry

Construct SVM Nonlinear model

Data +Descriptors

SVM Model

Test Data

Predict bioactivities

ChemistryInterpretation

3333

-8 -7 -6 -5 -4 -3

-8

-7

-6

-5

-4

-3

Observed RT (min)

Pre

dict

ed R

T (

min

)

Bagged SVM (RBF)Bagged SVM (RBF)CACO2 - 15 DescriptorsCACO2 - 15 Descriptors

Test Q2 = .166

3434

CACO2 – 15 Variables CACO2 – 15 Variables 

a.don

KB54

SMR.VSA2

ANGLEB45

DRNB10

ABSDRN6

PEOE.VSA.FPPOS

DRNB00

PEOE.VSA.FNEG

ABSKMIN

SIKIA

pmiZ

BNPB31

FUKB14

SlogP.VSA0

3535

Chemical InsightsChemical Insights

• Hydrophobicity  - a.don• SIZE and Shape ABSDRN6, SMR.VSA2,  ANGLEB45, PmiZ  Large is bad. Flat is bad. Globular is good.• Polarity – PEOE.VSA.FPPOS, PEOE.VSA.FNEG:

negative partial charge good.

Correspond to conventional wisdom – rule of 5.

3636

Hybrid TAE/SHAPEHybrid TAE/SHAPE

• Shape important overall factor– DRNB10, DRNB00: del rho dot N– BNP31: bare nuclear potential – KB54: kinetic energy descriptors very large lipophilic molecules don’t work– FUKB14: Fukui Surface

• Interpretations difficult• Point to chemistry challenges/hypotheses

3737

Final SVM ApproachFinal SVM Approach

• Construct large set of descriptors.• Perform feature selection:

– Sensitivity Analysis or SVM-LP

• Construct many SVM models– Optimize using QP or LP– Evaluate by Validation Set or Leave-one-out – Select best models by grid or pattern search

• Bag best k models to create final function

3838

Drug Discovery Results (LOO)Drug Discovery Results (LOO)

Data # Sampl

e

# Var.

Full

# Var.

FS (Avg)

Q2

Full

Q2

FS

Caco2 27 713 41 0.33 0.29

Barrier 62 569 51 0.31 0.28

HIV 64 561 17 0.46 0.40

Cancer 46 362 34 0.50 0.16LCCK 66 350 69 0.40 0.37

Aquasol 197 525 57 0.08 0.06

SVM-based drug design and property prediction softwareSVM-based drug design and property prediction softwareUseful for inhibitor/activator/substrate prediction, drug safety and pharmacokinetic prediction.

Computer loaded Computer loaded with SVMProtwith SVMProt

Support vector machinesSupport vector machinesclassifier for every classifier for every

Drug classDrug class

Identified Identified classesclasses

Drug designed Drug designed or property or property predicted predicted

Send structure to classifierSend structure to classifier

J. Chem. Inf. Comput. Sci. 44,1630 (2004) J. Chem. Inf. Comput. Sci. 44, 1497 (2004)

Toxicol. Sci. 79,170 (2004).

Input structurethrough internet

Option 2Option 1

Input structureon local machine

http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi

Your drug structure

Which class your Which class your drug belongs to?drug belongs to?

Drug

Chemical Structure Chemical

Structure

SVM Drug Prediction ResultsSVM Drug Prediction Results

Protein inhibitor/activator/substrate prediction:

• 86% of the 129 estrogen receptor activators and 84% of 101 non-activators correctly predicted.

• 81% of 116 P-glycoprotein substrates and 79% of 85 non-substrates correctly predicted

Drug Toxicity Prediction:

• 97% of 102 TdP+ and 84% of 243 TdP- agents correctly predicted • 73% of 229 genotoxic and 93% of 631 non-genotoxic agents correctly predicted

Pharmacokinetics prediction:

• 95% of 276 BBB+ and 82% of 139 BBB- agents correctly predicted• 90% of 131 human intestine absorption and 80% of 65 non-absoption agents

correctly predicted.

J. Chem. Inf. Comput. Sci. 44,1630 (2004) J. Chem. Inf. Comput. Sci. 44, 1497 (2004)

Toxicol. Sci. 79,170 (2004).

Recommended