Informed by Informatics?

Informed by Informatics?

Dr John MitchellUnilever Centre for Molecular Science InformaticsDepartment of ChemistryUniversity of Cambridge, U.K.

Modelling in Chemistry

Density Functional Theoryab initio

Molecular Dynamics

Monte Carlo

Docking

PHYSICS-BASED

EMPIRICALATOMISTIC

Car-Parrinello

NON-ATOMISTIC

DPD

CoMFA

2-D QSAR/QSPR

Machine Learning

AM1, PM3 etc.Fluid Dynamics


Molecular Dynamics

Monte Carlo

Docking

Car-Parrinello

DPD

CoMFA

2-D QSAR/QSPRMachine Learning

AM1, PM3 etc.

HIGH THROUGHPUT

LOW THROUGHPUT

Fluid Dynamics


Molecular Dynamics

Monte Carlo

Docking

Car-Parrinello

DPD

CoMFA


AM1, PM3 etc.

INFORMATICS

THEORETICAL CHEMISTRY

NO FIRM BOUNDARIES!

Fluid Dynamics


Molecular Dynamics

Monte Carlo

Docking

Car-Parrinello

DPD

CoMFA


AM1, PM3 etc.Fluid Dynamics

Informatics and Empirical Models

• In general, Informatics methods represent phenomena mathematically, but not in a physics-based way.

• Inputs and output model are based on an empirically parameterised equation or more elaborate mathematical model.

• Do not attempt to simulate reality. • Usually High Throughput.

QSPR

• Quantitative Structure Property Relationship

• Physical property related to more than one other variable

• Hansch et al developed QSPR in 1960’s, building on Hammett (1930’s).

• Property-property relationships from 1860’s

• General form (for non-linear relationships):y = f (descriptors)

QSPRY X1 X2 X3 X4 X5 X6

Molecule 1 Property 1 – – – – – –Molecule 2 Property 2 – – – – – –Molecule 3 Property 3 – – – – – –Molecule 4 Property 4 – – – – – –Molecule 5 Property 5 – – – – – –Molecule 6 Property 6 – – – – – –Molecule 7 Property 7 – – – – – –

Y = f (X1, X2, ... , XN )

• Optimisation of Y = f(X1, X2, ... , XN) is called regression.

• Model is optimised upon N “training molecules” and then tested upon M “test” molecules.



• Quality of the model is judged by three parameters:

n

i

predi

obsi yy

nBias

1

)(1

n

i

predi

obsi yy

nRMSE

1

2)(1

2

1

2

1

2 )(/)(1 averagen

i

obsi

predi

n

i

obsi yyyyr



• Different methods for carrying out regression:

• LINEAR - Multi-linear Regression (MLR), Partial Least Squares (PLS), Principal Component Regression (PCR), etc.

• NON-LINEAR - Random Forest, Support Vector Machines (SVM), Artificial Neural Networks (ANN), etc.



• However, this does not guarantee a good predictive model….



• Problems with experimental error.• QSPR only as accurate as data it is trained upon.• Therefore, we are need accurate experimental data.



• Problems with “chemical space”.• “Sample” molecules must be representative of “Population”.• Prediction results will be most accurate for molecules similar

to training set.• Global or Local models?

1. Solubility

Dave Palmer

• Pfizer Institute for Pharmaceutical Materials Science

• http://www.msm.cam.ac.uk/pfizer

D.S. Palmer, et al., J. Chem. Inf. Model., 47, 150-158 (2007)

D.S. Palmer, et al., Molecular Pharmaceutics, ASAP article (2008)

Solubility is an important issue in drug discovery and a major source of attrition

This is expensive for the industry

A good model for predicting the solubility of druglike molecules would be very valuable.

DatasetsPhase 1 – Literature Data• Compiled from Huuskonen dataset and AquaSol database – pharmaceutically relevant molecules• All molecules solid at room temperature• n = 1000 molecules

Phase 2 – Our Own Experimental Data● Measured by Toni Llinàs using CheqSol Machine● pharmaceutically relevant molecules● n = 135 molecules

● Aqueous solubility – the thermodynamic solubility in unbuffered water (at 25oC)

Diversity-Conserving Partitioning

● MACCS Structural Key fingerprints

● Tanimoto coefficient

● MaxMin Algorithm

Full dataset n = 1000 molecules

Training n = 670 molecules

Testn = 330 molecules

Diversity Analysis

Structures & Descriptors

● 3D structures from Concord

● Minimised with MMFF94

● MOE descriptors 2D/ 3D

● Separate analysis of 2D and 3D descriptors

● QuaSAR Contingency Module (MOE)

● 52 descriptors selected

Machine Learning Method

Random Forest

Random Forest

● Introduced by Briemann and Cutler (2001)

● Development of Decision Trees (Recursive Partitioning):

● Dataset is partitioned into consecutively smaller subsets

● Each partition is based upon the value of one descriptor

● The descriptor used at each split is selected so as to optimise splitting

● Bootstrap sample of N objects chosen from the N available objects with replacement

Random Forest

● Random Forest is a collection of Decision or Regression Trees grown with the CART algorithm.

● Standard Parameters:

● 500 decision trees

● No pruning back: Minimum node size = 5

● “mtry” descriptors (square root of number of total number) tried at each split

Important features:

● Incorporates descriptor selection

● Incorporates “Out-of-bag” validation – using those molecules not in the bootstrap samples

Random Forest for Solubility Prediction

A Forest of Regression Trees

• Dataset is partitioned into consecutively smaller subsets (of similar solubility)

• Each partition is based upon the value of one descriptor

• The descriptor used at each split is selected so as to minimise the MSE

Leo Breiman, "Random Forests“, Machine Learning 45, 5-32 (2001).

Random Forest for Predicting Solubility

• A Forest of Regression Trees• Each tree grown until terminal

nodes contain specified number of molecules

• No need to prune back• High predictive accuracy• Includes method for descriptor

selection• No training problems – largely

immune from overfitting.

Random Forest: Solubility Results

RMSE(te)=0.69r2(te)=0.89Bias(te)=-0.04

RMSE(tr)=0.27r2(tr)=0.98Bias(tr)=0.005

RMSE(oob)=0.68r2(oob)=0.90Bias(oob)=0.01

DS Palmer et al., J. Chem. Inf. Model., 47, 150-158 (2007)

Drug Disc.Today, 10 (4), 289 (2005)

Can we use theoretical chemistry to calculate solubility via a thermodynamic cycle?

Gsub from lattice energy & an entropy term

Ghydr from a semi-empirical solvation model

(i.e., different kinds of theoretical/computational methods)

… or this alternative cycle?

Gsub from lattice energy (DMAREL) plus entropy

Gsolv from SCRF using the B3LYP DFT functional

Gtr from ClogP

(i.e., different kinds of theoretical/computational methods)

● Nice idea, but didn’t quite work – errors larger than QSPR

● “Why not add a correction factor to account for the difference between the theoretical methods?”

…

…

● Within a week this had become a hybrid method, essentially a QSPR with the theoretical energies as descriptors

This regression equation gave r2=0.77 and RMSE=0.71

Solubility by TD Cycle: Conclusions

● We have a hybrid part-theoretical, part-empirical method.

● An interesting idea, but relatively low throughput as a crystal structure is needed.

● Slightly more accurate than pure QSPR for a druglike set.

● Instructive to compare with literature of theoretical solubility studies.

2. Bioactivity

Florian Nigsch

F. Nigsch, et al., J. Chem. Inf. Model., 48, 306-318 (2008)

Feature Space - Chemical Space

m = (f1,f2,…,fn)

f1

f2

f3

Feature spaces of high dimensionality

COX2

f2

f3

f1

DHFR

CDK1CDK2

Properties of Drugs

• High affinity to protein target

• Soluble• Permeable• Absorbable• High bioavailability• Specific rate of

metabolism• Renal/hepatic clearance?

• Volume of distribution?• Low toxicity• Plasma protein binding?• Blood-Brain-Barrier

penetration?• Dosage (once/twice

daily?)• Synthetic accessibility• Formulation (important in

development)

Multiobjective Optimisation

Bioactivity Synthetic accessibility

Permeability

Toxicity

Metabolism

Solubility

Huge number of candidates …

Multiobjective Optimisation

Bioactivity Synthetic accessibility

Permeability

Toxicity

Metabolism

Solubility

U S E L

E S S

Drug

Huge number of candidates most of which

are useless!

Spam

• Unsolicited (commercial) email

• Approx. 90% of all email traffic is spam

• Where are the legitimate messages?

• Filtering

Analogy to Drug Discovery

• Huge number of possible candidates

• Virtual screening to help in selection process

Winnow Algorithm

• Invented in late 1980s by Nick Littlestone to learn Boolean functions

• Name from the verb “to winnow”– High-dimensional input data

• Natural Language Processing (NLP), text classification, bioinformatics

• Different varieties (regularised, Sparse Network Of Winnow - SNOW, …)

• Error-driven, linear threshold, online algorithm

Machine Learning Methods

Winnow (“Molecular Spam Filter”)

Training Example

Features of Molecules

Based on circular fingerprints

Combinations of Features

Combinations of molecular features

to account for synergies.

Pairing Molecular Features Gives Bigrams

Orthogonal Sparse Bigrams

• Technique used in text classification/spam filtering

• Sliding window process

• Sparse - not all combinations

• Orthogonal - non-redundancy

Workflow

Protein Target Prediction

• Which proteins does a given molecule bind to?• Virtual Screening• Multiple endpoint drugs - polypharmacology• New targets for existing drugs• Prediction of adverse drug reactions (ADR)

– Computational toxicology

Predicted Protein Targets

• Selection of ~230 classes from the MDL Drug Data Report

• ~90,000 molecules• 15 independent

50%/50% splits into training/test set

Predicted Protein Targets

Cumulative probability of correct prediction within the three top-ranking predictions: 82.1% (±0.5%)

3. Melting Points

Florian Nigsch

F. Nigsch, et al., J. Chem. Inf. Model., 46, 2412-2422 (2006)

Interest in Modelling

Interest in modelling properties of molecules

- solubility, toxicity, drug-drug interactions, …

Virtual screening has become a necessity in terms of time and cost efficiency

ADMET integration into drug development process

- many fewer late stage failures

- continuous refinement with new experimental dataAbsorption, Distribution, Metabolism, Excretion and Toxicity

Melting Points are Important

Melting point facilitates prediction of solubility

- General Solubility Equation (Yalkowsky)

Notorious challenge (unknown crystal packing, interactions in liquid state)

Goal: Stepwise model for bioavailability, from solid compound to systemic availability

Can Current Models be Improved?

Bergström: 277 total compounds, PLS, RMSE=49.8 ºC, q2=0.54; ntest:ntrain=185:92

Karthikeyan: 4173 total compounds, ANN, RMSE=49.3 ºC, q2=0.66; ntest:ntrain=1042:2089

Jain: 1215 total compounds, MLR, AAE=33.2 ºC, no q2 reported; ntest:ntrain=119:1094

RMSE slightly below 50 ºC!

q2 just above 0.50!

Machine Learning Method

k-Nearest Neighbours

k-Nearest Neighbours

Used for classification and regression

Suitable for large and diverse datasets

• Local (tuneable via k)• Non-linearCalculation of distance matrix

instead of training step

Distance matrix depends on descriptors used

Classification

Regression

Molecules and Numbers

Datasets4119 organic molecules of

diverse complexity, MPs from 14 to 392.5 ºC (dataset 1)

277 drug-like molecules, MPs from 40 to 345 ºC (dataset 2)

Both datasets available in the recent literature

DescriptorsQuasar descriptors from MOE

(Molecular Operating Environment), 146 2D and 57 3D descriptors

Sybyl Atom Type Counts (53 different types, including information about hybridisation state)

Finding the Neighbours

Distances

Euclidean distance between point i and point j in r dimensions

dij Xmi Xm

j 2

m1

r

1/ 2

Set of k nearest neighbours

N = {Ti, di} k = 1,5,10,15

Calculation of Predictions

Predictions: Tpred = f(N)

unweighted weighted

Arithmetic (1) and geometric (2) average

Inverse distance (3)

Exponential decrease (4)

Tpreda

1

kTn

n1

k

Tpredg Tn

1/ k

n1

k

Tpredi

1

1

dnn1

k

Tn

1

dnn1

k

Tprede

1

e dnn1

k

Tne

dn

n1

k

1

3

2

4

Assessment by Cross-validation

Select m random molecules for test set

N-m remaining molecules form training set

Organics Drugs

N: 4119

m: 1000

m/N: 24.3%

N: 277

m: 80

m/N: 28.9%

Data

Random splitTraining Test

Predict test set

(from training set)

RMSEP

Repeat 25 tim

es

RMSECV 1

25RMSEP 2

i1

25

Cross-validated root-mean-square error

Assessment by Cross-validation

Organics Drugs

N: 4119

m: 1000

m/N: 24.3%

N: 277

m: 80

m/N: 28.9%

Data

Random splitTraining Test

Predict test set(from training set)

RMSEP

Re

pea

t 25 tim

es

RMSECV 1

25RMSEP 2

i1

25

Cross-validated root-mean-square error

Not “25-fold CV” where 1/25th is predicted based on the rest

Prediction of 25 random sets of 1000 molecules

Monte Carlo Cross-validation

Results

NNMethod

A G I E

RMSE r2 q2 RMSE r2 q2 RMS

E r2 q2 RMSE r2 q2

1 57.6 0.36 0.21 - - - - - - - - -

5 48.9 0.44 0.43 50.0 0.43 0.40 48.2 0.45 0.45 48.8 0.45 0.43

10 48.8 0.44 0.43 50.2 0.43 0.40 47.7 0.46 0.46 47.6 0.47 0.46

15 49.3 0.43 0.42 50.9 0.42 0.38 48.1 0.46 0.45 47.3 0.48 0.47

Using a combination of 2D and 3D descriptors:

Training set

Test set 1000 molecules (organic)

3119 molecules (organic)

Results: Organic Dataset

NNMethod

A G I E


E r2 q2 RMSE r2 q2

1 57.6 0.36 0.21 - - - - - - - - -

5 48.9 0.44 0.43 50.0 0.43 0.40 48.2 0.45 0.45 48.8 0.45 0.43

10 48.8 0.44 0.43 50.2 0.43 0.40 47.7 0.46 0.46 47.6 0.47 0.46

15 49.3 0.43 0.42 50.9 0.42 0.38 48.1 0.46 0.45 47.3 0.48 0.47

RMSE = 47.2 ºC

r2 = 0.47

q2 = 0.47

2D onlyRMSE = 50.7 ºC

r2 = 0.39

q2 = 0.39

3D onlyUsing 15 nearest neighbours

and exponential weighting

Exponential weighting

performs best!

Results: Drugs Dataset

NNMethod

A G I E


E r2 q2 RMSE r2 q2

1 57.1 0.19 - - - - - - - - - -

5 47.7 0.27 0.25 48.8 0.26 0.22 47.1 0.29 0.27 48.5 0.28 0.22

10 46.9 0.29 0.28 47.8 0.30 0.25 46.3 0.30 0.29 47.4 0.29 0.26

15 48.0 0.25 0.24 49.0 0.26 0.21 47.2 0.28 0.27 47.2 0.29 0.27

RMSE = 46.5 ºC

r2 = 0.30

q2 = 0.29

2D onlyRMSE = 48.4 ºC

r2 = 0.24

q2 = 0.23

3D onlyUsing 10 nearest neighbours

and inverse distance weighting

Inverse distance weighting performs

best!

Summary of Results

Organics: RMSE = 46.2 ºC, r2 = 0.49, ntest:ntrain = 1000:3119

Drugs: RMSE = 42.2 ºC, r2 = 0.42, ntest:ntrain = 80:197

Difficulties at extremities of temperature scale

due to: distances in descriptor space and temperature

Summary of Results

Nonetheless: Comparison with other models in the literature shows that the presented model:

- performs better in terms of RMSE under more stringent conditions (even unoptimised)

- parameterless; optimised model many fewer parameters than, for example, in an ANN

Conclusions

Information content of descriptors

Exponential weighting performs best

Genetic algorithm is able to improve performance, especially for atom type counts as descriptors

Remaining error

- interactions in liquid and/or solid state not captured (by single-molecule descriptors)

- kNN method in this setting (datasets + descriptors)

Thanks

• Pfizer & PIPMS

• Unilever

• Dave Palmer

• Florian Nigsch

All slides after here are for information only

The Two Faces of Computational Chemistry

Theory Empiricism

Informatics and Empirical Models

• In general, Informatics methods represent phenomena mathematically, but not in a physics-based way.

• Inputs and output model are based on an empirically parameterised equation or more elaborate mathematical model.

• Do not attempt to simulate reality. • Usually High Throughput.

Theoretical Chemistry

• Calculations and simulations based on real physics.

• Calculations are either quantum mechanical or use parameters derived from quantum mechanics.

• Attempt to model or simulate reality.

• Usually Low Throughput.

Optimisation with Genetic Algorithm

Transformation of data to yield lower global RMSEP

- global means: for whole data set

Three operations to transform the columns of data matrix for optimal weighting

One bit per descriptor on chromosomes: (op; param) consisting of an operation and a parameter

{multiplication

power

loge

{(a,b) = (0, 30) for mult.

(a,b) = (0, 3) for power

none for loge

Iterative Genetic Optimisation

Initial data Initial chromosomes

Transformation of data

Calculation of distance matrix

Evaluation of models

Survival of the fittest

Genetic operations

Stop if criterion reached Set of optimal transformations

Genetic Algorithm Able to Improve Results

Data set Descriptors RMSE (ºC) r2 q2

1 2D 46.2 0.49 0.49

2 2D 42.2 0.42 0.40

1 SAC 49.1 0.40 0.39

2 SAC 46.7 0.29 0.27

SAC: Sybyl Atom Counts

Application of optimal transformations to initial data

Reassessment of models

Averages over 25 runs!

Errors at Low/High-Melting Ends…

ei x i y iA ei

2 sgn(ei)

SRMSE sgn(A)AN

Signed RMSE analysis Definition of signed RMSE

Details

In 30 ºC steps across whole range

All 4119 molecules of data set 1

… Result From Large Distances

Distance in descriptor space

NNs of high-melting compounds typically further away

Distance in temperature

{< 0: negative neighbour

> 0: positive neighbourTtest-TNN

Over- & Underpredictions at Extremities

Distance in descriptor space

NNs of high-melting compounds typically further away

Distance in temperature

{< 0: negative neighbour

> 0: positive neighbourTtest-TNN

Underprediction!

Overprediction!

Documents

Informed by Informatics?