View
213
Download
0
Category
Tags:
Preview:
Citation preview
Domain of Applicability
A Cluster-Based Measure of Domain of Applicability of a QSAR Model
Robert Stanforth
6 September 2005
© IDBS 2005
DC = DD + DM + DA - c
What is QSAR?
Motivation
Modelling the Dataset
Measure of Distance from Domain
Validation
Overview
© IDBS 2005
DC = DD + DM + DA - c
What is QSAR?
Quantitative Structure-Activity Relationships BiologicalActivity = f ( ChemicalStructure ) + Error
Descriptor-based QSAR Descriptors measure chemical structure
E.g. topological indices of chemical graph
Use Multivariate Linear Regression Regress activity onto high-dimensional descriptor space
Problem of extrapolation
3c=0 3c=0.289 3c=0.408 3c=0.667 3c=1.802
© IDBS 2005
DC = DD + DM + DA - c
Motivation
QSAR model only valid in domain of its training set
Measure membership of this ‘domain of applicability’
Provides assurance of: External test set
k-fold cross validation
Prediction
??
© IDBS 2005
DC = DD + DM + DA - c
Bounding Box
Convex Hull
Distance to Centroid
Nearest Neighbour and k-NN Methods
Existing Methods
?
?
© IDBS 2005
DC = DD + DM + DA - c
Use ‘clusters’ to model the shape of the dataset
K-Means algorithm iteratively adjusts partitioning into clusters to increase accuracy of the model
Computationally feasible
K-Means for Clustering
© IDBS 2005
DC = DD + DM + DA - c
Use the K-Means Model Base on distances to cluster centroids
Fuzzy cluster membership
Weighted average of distances to cluster centroids,
weighted according to cluster membership
Computationally efficient
Measure of Distance
© IDBS 2005
DC = DD + DM + DA - c
Contour Plot First contour defines boundary of applicability domain
Measure of Distance
© IDBS 2005
DC = DD + DM + DA - c
Assess stability of distance measure
Use k-fold cross validation
Leave out one group at a time
Retrain distance measure
Mean relative change in distance of compounds left out
Internal Validation
© IDBS 2005
DC = DD + DM + DA - c
Internal Validation
Method Averaged Relative Deviation
Bounding Box 53.2%
Leverage 80.5%
k-NN 83.1%
Cluster-based 43.2%
© IDBS 2005
DC = DD + DM + DA - c
External Validation
Assess relationship between distance and prediction error
Analyse mean-square prediction error over: 50 ‘new’ compounds
Those inside domain
Those outside domain
© IDBS 2005
DC = DD + DM + DA - c
External Validation
Mean Square Prediction Error
Method All(50)
Inside Domain
Outside Domain
Bounding Box 2.76 3.08(27)
2.40(23)
Leverage 2.76 2.81(48)
1.61(2)
k-NN 2.76 2.73(45)
3.11(5)
Cluster-based 2.76 2.70(46)
3.58(4)
© IDBS 2005
DC = DD + DM + DA - c
Need quantitative measure of applicability of a descriptor-based QSAR model to a structure
Existing methods are all either too crude or too slow
Our new method is computationally efficient, and copes well
with non-convex domains
Conclusions
Recommended