16
Domain of Applicability A Cluster-Based Measure of Domain of Applicability of a QSAR Model Robert Stanforth 6 September 2005

Domain of Applicability A Cluster-Based Measure of Domain of Applicability of a QSAR Model Robert Stanforth 6 September 2005

Embed Size (px)

Citation preview

Page 1: Domain of Applicability A Cluster-Based Measure of Domain of Applicability of a QSAR Model Robert Stanforth 6 September 2005

Domain of Applicability

A Cluster-Based Measure of Domain of Applicability of a QSAR Model

Robert Stanforth

6 September 2005

Page 2: Domain of Applicability A Cluster-Based Measure of Domain of Applicability of a QSAR Model Robert Stanforth 6 September 2005

© IDBS 2005

DC = DD + DM + DA - c

What is QSAR?

Motivation

Modelling the Dataset

Measure of Distance from Domain

Validation

Overview

Page 3: Domain of Applicability A Cluster-Based Measure of Domain of Applicability of a QSAR Model Robert Stanforth 6 September 2005

© IDBS 2005

DC = DD + DM + DA - c

What is QSAR?

Quantitative Structure-Activity Relationships BiologicalActivity = f ( ChemicalStructure ) + Error

Descriptor-based QSAR Descriptors measure chemical structure

E.g. topological indices of chemical graph

Use Multivariate Linear Regression Regress activity onto high-dimensional descriptor space

Problem of extrapolation

3c=0 3c=0.289 3c=0.408 3c=0.667 3c=1.802

Page 4: Domain of Applicability A Cluster-Based Measure of Domain of Applicability of a QSAR Model Robert Stanforth 6 September 2005

© IDBS 2005

DC = DD + DM + DA - c

Motivation

QSAR model only valid in domain of its training set

Measure membership of this ‘domain of applicability’

Provides assurance of: External test set

k-fold cross validation

Prediction

??

Page 5: Domain of Applicability A Cluster-Based Measure of Domain of Applicability of a QSAR Model Robert Stanforth 6 September 2005

© IDBS 2005

DC = DD + DM + DA - c

Bounding Box

Convex Hull

Distance to Centroid

Nearest Neighbour and k-NN Methods

Existing Methods

?

?

Page 6: Domain of Applicability A Cluster-Based Measure of Domain of Applicability of a QSAR Model Robert Stanforth 6 September 2005

© IDBS 2005

DC = DD + DM + DA - c

Use ‘clusters’ to model the shape of the dataset

K-Means algorithm iteratively adjusts partitioning into clusters to increase accuracy of the model

Computationally feasible

K-Means for Clustering

Page 7: Domain of Applicability A Cluster-Based Measure of Domain of Applicability of a QSAR Model Robert Stanforth 6 September 2005
Page 8: Domain of Applicability A Cluster-Based Measure of Domain of Applicability of a QSAR Model Robert Stanforth 6 September 2005

© IDBS 2005

DC = DD + DM + DA - c

Use the K-Means Model Base on distances to cluster centroids

Fuzzy cluster membership

Weighted average of distances to cluster centroids,

weighted according to cluster membership

Computationally efficient

Measure of Distance

Page 9: Domain of Applicability A Cluster-Based Measure of Domain of Applicability of a QSAR Model Robert Stanforth 6 September 2005

© IDBS 2005

DC = DD + DM + DA - c

Contour Plot First contour defines boundary of applicability domain

Measure of Distance

Page 10: Domain of Applicability A Cluster-Based Measure of Domain of Applicability of a QSAR Model Robert Stanforth 6 September 2005
Page 11: Domain of Applicability A Cluster-Based Measure of Domain of Applicability of a QSAR Model Robert Stanforth 6 September 2005
Page 12: Domain of Applicability A Cluster-Based Measure of Domain of Applicability of a QSAR Model Robert Stanforth 6 September 2005

© IDBS 2005

DC = DD + DM + DA - c

Assess stability of distance measure

Use k-fold cross validation

Leave out one group at a time

Retrain distance measure

Mean relative change in distance of compounds left out

Internal Validation

Page 13: Domain of Applicability A Cluster-Based Measure of Domain of Applicability of a QSAR Model Robert Stanforth 6 September 2005

© IDBS 2005

DC = DD + DM + DA - c

Internal Validation

Method Averaged Relative Deviation

Bounding Box 53.2%

Leverage 80.5%

k-NN 83.1%

Cluster-based 43.2%

Page 14: Domain of Applicability A Cluster-Based Measure of Domain of Applicability of a QSAR Model Robert Stanforth 6 September 2005

© IDBS 2005

DC = DD + DM + DA - c

External Validation

Assess relationship between distance and prediction error

Analyse mean-square prediction error over: 50 ‘new’ compounds

Those inside domain

Those outside domain

Page 15: Domain of Applicability A Cluster-Based Measure of Domain of Applicability of a QSAR Model Robert Stanforth 6 September 2005

© IDBS 2005

DC = DD + DM + DA - c

External Validation

Mean Square Prediction Error

Method All(50)

Inside Domain

Outside Domain

Bounding Box 2.76 3.08(27)

2.40(23)

Leverage 2.76 2.81(48)

1.61(2)

k-NN 2.76 2.73(45)

3.11(5)

Cluster-based 2.76 2.70(46)

3.58(4)

Page 16: Domain of Applicability A Cluster-Based Measure of Domain of Applicability of a QSAR Model Robert Stanforth 6 September 2005

© IDBS 2005

DC = DD + DM + DA - c

Need quantitative measure of applicability of a descriptor-based QSAR model to a structure

Existing methods are all either too crude or too slow

Our new method is computationally efficient, and copes well

with non-convex domains

Conclusions