Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
A new user's first adventures in ADME QSAR modelling with KNIME
Mark Gardner
1
Identification of a novel oral clinical candidate superior to praziquantel
to treat schistosomiasis
4 year project, MRC DPFS scheme
3http://www.mmv.org/research-development/computationalchemistry
Content
• Provide access & guidance to ADMET QSAR model(s) useful for ‘Global Health’ drug discovery
• Objectives of this work
• KNIME approach
• Your thoughts on any of this are very welcome
4
ADMET QSAR models• Internal comparison (unpublished) suggests pharma
ADMET models (solubility, hERG, microsomal stability) generally outperform non-pharma models• Assumption – chemical space covered and/or consistency
of assays
• Many of these use molecular fingerprints eg ECFP4
• There are LOTS of molecular fingerprints• you need LOTS of data to avoid overtraining*
• We would love to be able to provide access to pharma company ADMET models for Global Health Drug Discovery Projects
5
*
Potential ‘honest broker’ hosting
6
Securehostingandaccess
0 1 0 1 1 0 11 1 0 1 1 1 00 1 0 1 1 0 1 0 0 1 1 0 1 1
1 0 1 1 0 1 0
0 1 0 1 1 0 11 1 0 1 1 1 00 1 0 1 1 0 1 0 0 1 1 0 1 1
1 0 1 1 0 1 0
models
compounds
predictions
Friendly big pharma company
X (+ Y + Z)
Approaches to provide security• Models don’t reveal training data
• Employees not able to access external structures• models hosted outside pharma by
honest broker
• Models can’t be taken & used/commercialised by others
• Restricted and controlled access
• Ts & Cs accepted for each use
7
Avoid:• disclosure• contamination• competition
Meanwhile…• Can we do something useful with data in the public
domain?
• Working assumption…• not enough data to do de novo ADMET prediction• is there enough to carry out relative ADMET endpoint
prediction?
8
Predict property of XGiven measured property of analogue Y
Pilot with KNIME1 & AZ solubility data2
9
1. https://www.knime.org/2. Mark Wenlock and Nicholas Tomkinson https://www.ebi.ac.uk/chembl/doc/inspect/CHEMBL3301361
• need to calculate log solubility
10
distribution of solubility distribution of log solubility
it may seem obvious but…
ApproachAim to calculate a QSAR relating ∆ descriptor values to ∆
solubility
11
but I’ve been very selective above…
Workflow• Read in compounds
• Calculate charge state (from simple SMARTS queries - CDK)
• Cluster within each charge state (ECFP4)• best clustering?
• Calculate ‘relevant’ descriptors• aggregated VSA into 3 bins
• lower, middle, higher• probably a good pKa prediction would
help• count HBD types (using SMARTS)
• java snippet very helpful here• took me a while to work out how to use it
12
Workflow 2• count HBD types (using SMARTS)• java snippet very helpful
here• took me a while to work
out how to use it (and how java scripting online related to java snippet use)
13
workflow 3
14
• normalize descriptors & logS onto 0-1 scale• for each cluster (with at least 5 compounds – reject the rest)
• calculate mean & for each normalized descriptor, for each compound ∆ from mean
• split clusters into training & test• used random number assignment & row filter• split 65:35
• calculate linear regression equation using a selection of properties• predict ∆ logS & compare with actual
workflow 4• use a loop to do the training/test split 50
times & graph the output
15
coefficients in linear regression equation (training set)
∆logS = a.∆AlogP + b.∆base_count + c.∆numArom + d.∆amideH_count
16
delta_AlogPcoeff
delta base_countcoeff
delta numAromrings coeff
delta amideHcoeff
0.00
valu
e o
f co
effi
cien
t
P values – how significant are the descriptors in each (of 50) model?
17
delta_AlogPcoeff
delta base_countcoeff
delta numAromrings coeff
delta amideHcoeff
• AlogP & number of amide H’s significant for all 50 models (fits with previous plot)
so how about predicting logS?(rather than ∆logS)
• test set of 258 compounds• r2 = 0.65
18but mean logS for cluster almost as good… r2 = 0.59
some more ideas…
• there are always more descriptors• pKa, dipole moment,
• randomize Y values (within clusters) to check for potential over-training
• try another endpoint
• is there a loop node that helps with descriptor selection/model optimisation?
• …
19
acknowledgements• KNIME, for KNIME
• Thomas Sander & Actelion for DataWarrior
• Mark Wenlock, Nicholas Tomkinson, AZ & ChEMBL for the solubility data
• Alexander Alex & Caroline Low for patiently listening & constructive criticism
• Everyone involved in supporting UK-QSAR
“For a successful technology, reality must take precedence over public relations, for Nature cannot be fooled”Richard Feynman
20