A new user's first adventures in ADME QSAR modelling with ... · A new user's first adventures in ADME QSAR modelling with KNIME Mark Gardner 1. Identification of a novel oral clinical

A new user's first adventures in ADME QSAR modelling with KNIME

Mark Gardner

1

Identification of a novel oral clinical candidate superior to praziquantel

to treat schistosomiasis

4 year project, MRC DPFS scheme

3http://www.mmv.org/research-development/computationalchemistry

http://www.mmv.org/research-development/computationalchemistry

Content

• Provide access & guidance to ADMET QSAR model(s) useful for ‘Global Health’ drug discovery

• Objectives of this work

• KNIME approach

• Your thoughts on any of this are very welcome

4

ADMET QSAR models• Internal comparison (unpublished) suggests pharma

ADMET models (solubility, hERG, microsomal stability) generally outperform non-pharma models• Assumption – chemical space covered and/or consistency

of assays

• Many of these use molecular fingerprints eg ECFP4

• There are LOTS of molecular fingerprints• you need LOTS of data to avoid overtraining*

• We would love to be able to provide access to pharma company ADMET models for Global Health Drug Discovery Projects

5

*

Potential ‘honest broker’ hosting

6

Securehostingandaccess

0 1 0 1 1 0 11 1 0 1 1 1 00 1 0 1 1 0 1 0 0 1 1 0 1 1

1 0 1 1 0 1 0

0 1 0 1 1 0 11 1 0 1 1 1 00 1 0 1 1 0 1 0 0 1 1 0 1 1

1 0 1 1 0 1 0

models

compounds

predictions

Friendly big pharma company

X (+ Y + Z)

Approaches to provide security• Models don’t reveal training data

• Employees not able to access external structures• models hosted outside pharma by

honest broker

• Models can’t be taken & used/commercialised by others

• Restricted and controlled access

• Ts & Cs accepted for each use

7

Avoid:• disclosure• contamination• competition

Meanwhile…• Can we do something useful with data in the public

domain?

• Working assumption…• not enough data to do de novo ADMET prediction• is there enough to carry out relative ADMET endpoint

prediction?

8

Predict property of XGiven measured property of analogue Y

Pilot with KNIME1 & AZ solubility data2

9

1. https://www.knime.org/2. Mark Wenlock and Nicholas Tomkinson https://www.ebi.ac.uk/chembl/doc/inspect/CHEMBL3301361

https://www.knime.org/

https://www.ebi.ac.uk/chembl/doc/inspect/CHEMBL3301361

• need to calculate log solubility

10

distribution of solubility distribution of log solubility

it may seem obvious but…

ApproachAim to calculate a QSAR relating ∆ descriptor values to ∆

solubility

11

but I’ve been very selective above…

Workflow• Read in compounds

• Calculate charge state (from simple SMARTS queries - CDK)

• Cluster within each charge state (ECFP4)• best clustering?

• Calculate ‘relevant’ descriptors• aggregated VSA into 3 bins

• lower, middle, higher• probably a good pKa prediction would

help• count HBD types (using SMARTS)

• java snippet very helpful here• took me a while to work out how to use it

12

Workflow 2• count HBD types (using SMARTS)• java snippet very helpful

here• took me a while to work

out how to use it (and how java scripting online related to java snippet use)

13

workflow 3

14

• normalize descriptors & logS onto 0-1 scale• for each cluster (with at least 5 compounds – reject the rest)

• calculate mean & for each normalized descriptor, for each compound ∆ from mean

• split clusters into training & test• used random number assignment & row filter• split 65:35

• calculate linear regression equation using a selection of properties• predict ∆ logS & compare with actual

workflow 4• use a loop to do the training/test split 50

times & graph the output

15

coefficients in linear regression equation (training set)

∆logS = a.∆AlogP + b.∆base_count + c.∆numArom + d.∆amideH_count

16

delta_AlogPcoeff

delta base_countcoeff

delta numAromrings coeff

delta amideHcoeff

0.00

valu

e o

f co

effi

cien

t

P values – how significant are the descriptors in each (of 50) model?

17

delta_AlogPcoeff

delta base_countcoeff

delta numAromrings coeff

delta amideHcoeff

• AlogP & number of amide H’s significant for all 50 models (fits with previous plot)

so how about predicting logS?(rather than ∆logS)

• test set of 258 compounds• r2 = 0.65

18but mean logS for cluster almost as good… r2 = 0.59

some more ideas…

• there are always more descriptors• pKa, dipole moment,

• randomize Y values (within clusters) to check for potential over-training

• try another endpoint

• is there a loop node that helps with descriptor selection/model optimisation?

• …

19

acknowledgements• KNIME, for KNIME

• Thomas Sander & Actelion for DataWarrior

• Mark Wenlock, Nicholas Tomkinson, AZ & ChEMBL for the solubility data

• Alexander Alex & Caroline Low for patiently listening & constructive criticism

• Everyone involved in supporting UK-QSAR

“For a successful technology, reality must take precedence over public relations, for Nature cannot be fooled”Richard Feynman

20

Documents

A new user's first adventures in ADME QSAR modelling with ... · A new user's first adventures in ADME QSAR modelling with KNIME Mark Gardner 1. Identification of a novel oral clinical