Download pdf - Similarity search and QSAR modelingfch.upol.cz/wp-content/uploads/2016/11/Similarity_Polishchuk.pdf · Similarity search and QSAR modeling Pavel Polishchuk Institute of Molecular

Similarity search and QSAR modeling

Pavel Polishchuk

Institute of Molecular and Translational Medicine Faculty of Medicine and Dentistry

Palacky University

[email protected] qsar4u.com

November-December 2016, Palacky University, Olomouc, Czech republic for the short course of “introduction to ligand-based drug discovery”

Computer-aided drug design (CADD)

Protein structure

Unknown Known

Liga

nd

str

uct

ure

Unknown CADD of no use De novo design

Known

Ligand-based design

QSAR, pharmacophore modeling,

similarity searching

Structure-based design

Docking, molecular dynamics,

pharmacophore modeling

Similarity principle

Similia similibus curantur

Similar chemicals are solved in similar ones

Similar structures exhibit similar properties

Are they similar?

morphine codeine

S-thalidomide R-thalidomide

tirofibane

eptifibatide

tirofibane ethyl ester

How to measure similarity?

Structure representation (encoding)

Similarity measure

Structure representation and similarity

Structural keys

Feature vectors

Pharmacophore

Shape similarity

Field similarity

Fingerprints

10001001010101110111001

11 4 4 21 … 5 4 0 0 1

Similarity

Structure encoding. Structural similarity.

Nomenclature

Line notation (e.g. SMILES, InChi)

Molecular graph (2D descriptors)

Fragments count (e.g. structural keys, fingerprints)

3D structures (geometrical descriptors)

glucose

OC[C@H]1OC(O)[C@H](O)[C@@H](O)[C@@H]1O

InChI=1S/C6H12O6/c7-1-2-3(8)4(9)5(10)6(11)12-2/h2-11H,1H2/t2-,3-,4+,5-,6?/m1/s1

n(OH) = 5; n(ring6) = 1, etc

shape, volume, surface, etc

topological indices, etc

Structural keys

Structural keys are predefined patterns (fixed length).

• The presence/absence of each element, or if an element is common (nitrogen, for example), several bits might represent "at least 1 N", "at least 2 N", "at least 4 N", and so forth.

• Unusual or important electronic configurations, such as "sp3 carbon" or "triple-bonded nitrogen.“

• Rings and ring systems, such as cyclohexane, pyridine, or napthalene.

• Common functional groups, such as alcohols, amines, hydrocarbons, and so forth.

Examples:

MACCS (166 or 960 bits) BCI CACTVS

MACCS keys (166 structural keys)

…

49: [!+0], # CHARGE

50: [#6]=[#6](~[#6])~[#6], # C=C(C)C

51: [#6]~[#16]~[#8], # CSO

52: [#7]~[#7], # NN

53: [!#6;!#1;!H0]~*~*~*~[!#6;!#1;!H0], # QHAAAQH

54: [!#6;!#1;!H0]~*~*~[!#6;!#1;!H0], # QHAAQH

55: [#8]~[#16]~[#8], # OSO

…

… … … … … … 0 1 1 0 … … … … … … … … …

charge

fixed length, binary or counts

Structural keys: advantages and drawbacks

+ fixed length(easy to store and update data base) - lack of generality (can have little discriminating power for certain data sets due to too dense or too sparse bit strings)

Fingerprints

No predefined patterns, patterns are generated separately for each structure.

Usually binary, but maybe counts (feature vectors)

Atom-pair fingerprints

Atom type: element + # heavy neighbors + # pi electrons

Atom-pair: atom type – topological distance – atom type

O_1_1

O_1_0

N_2_1

C_2_1

C_3_1

N_2_1-4-O_1_1 C_3_1-2-O_1_0 N_2_1-2-C_2_1

R.E. Carhart, D.H. Smith, R. Venkataraghavan JCICS 25: 64-73 (1985)

Sequence-based fingerprints (Daylight)

Generate all paths up to certain length considering atom and bond types (usually length is 2-7)

0-bond paths N C O

1-bond paths N-C C-C C-O C=O

2-bond paths N-C-C C-C-O C-C=O

3-bond paths N-C-C-O N-C-C=O

Atom-centric (augmented atoms) fingerprints

Generate substructures starting from each atom and considering all its neighbors up to the specified distance (radius or diameter). Morgan fingerprints, Extended-connectivity fingerprints (ECFP). Functional-class fingerprints (FCFP) , etc.

radius=2 (diameter=4)

Rogers, D. & Hahn, M. Extended-Connectivity Fingerprints. J. Chem. Inf. Model. 50, 742-754 (2010)

Atom-centric (augmented atoms) fingerprints

Morgan fingerprints like Functional-class fingerprints (FCFP)

Rogers, D. & Hahn, M. Extended-Connectivity Fingerprints. J. Chem. Inf. Model. 50, 742-754 (2010)

Feature Code SMARTS (RDKit implementation)

Donor hd [$([N;!H0;v3,v4&+1]),$([O,S;H1;+0]),n&H1&+0]

Acceptor ha [$([O,S;H1;v2;!$(*-*=[O,N,P,S])]),$([O,S;H0;v2]),$([O,S;-]),$([N;v3;!$(N-*=[O,N,P,S])]),n&H0&+0,$([o,s;+0;!$([o,s]:n);!$([o,s]:c:n)])]

Aromatic a [a]

Halogen h [F,Cl,Br,I]

Basic b [#7;+,$([N;H2&+0][$([C,a]);!$([C,a](=O))]),$([N;H1&+0]([$([C,a]);!$([C,a](=O))])[$([C,a]);!$([C,a](=O))]),$([N;H0&+0]([C;!$(C(=O))])([C;!$(C(=O))])[C;!$(C(=O))])]

Acidic ac [$([C,S](=[O,S,P])-[O;H1,-1])]

Fingerprints

Each molecule has variable length set of substructures – variable length fingerprints

N-C-C C-C-O C-C=O C-C-C C:C:C C:C:N C:N:C

1 1 1 0 0 0 0

1 1 1 1 0 0 0

0 0 0 0 1 1 1

2-bond sequences

Hashed fingerprints

Have fixed length (usually 512, 1024 or 2048 bits)

hash code

N-C-C C-C-O C-C=O C-C-C C:C:C C:C:N

13823 9740 37278 28478 874 283764

0 0 0 1 0 0 1 1 0 1 0 0 1 0 1 0 0 0 1 1

pseudo-random number generator

fixed-length bit string

Each substructure activates several bits (usually 4-5) to avoid collisions and produce bit string of enough density Missing bits mean that certain substructures are not presented, Active bits mean that certain substructure may be present (but due to possible collisions one cannot be sure)

Folding of fingerprints and keys bit strings

To increase density of very sparse bit vectors

0 0 0 1 0 0 1 1 0 1 0 0 1 0 1 0 0 0 1 1

0 0 0 1 0 0 1 1 0 1

0 0 1 0 1 0 0 0 1 1

OR

n bit fingerprints

n/2 bits

n/2 bits

0 0 1 1 1 0 1 1 1 1 n/2 bit fingerprint

Usually optimal density is 0.2-0.3

Feature vectors

May contain counts of substructures as well as other features derived from the structure (calculated logP, MW, etc)

Feature vectors usually used in QSAR modeling than in simple similarity search

N-C-C C-C-O C-C=O C-C-C C:C:C C:C:N C:N:C

1 1 1 0 0 0 0

2 1 1 1 0 0 0

0 0 0 0 3 2 1

2-bond sequences

Maximum common substructure (MCS)

furosemid (diuretic)

sulfadiazine (antibiotic)

aspirin (anti-coagulant, COX inhibitor)

clopidogrel (anti-coagulant, P2Y12 inhibitor)

Similarity indices

Similarity indices for binary fingerprints

Index name Formula Range

Tanimoto (Jaccard) [0; 1]

Tversky depends on α and β

Dice [0; 1]

Cosine [0; 1]

Manhattan distance [0; n]

Euclidian distance [0; n]

cba

cS BA

,

ccbca

cS BA

)()(,

ba

cS BA

2,

ab

cS BA ,

cbaD BA 2,

cbaD BA 2,

a, b – number of bits “on” (1) in A and B bit strings; c – number of common bits “on” (1) in both A and B strings

P. Willet, J.M. Barnard and G.M. Downs. J. Chem. Inf. Comput. Sci., 1998, 38 (6), pp 983–996

Similarity indices for numeric fingerprints

Index name Formula Range

Tanimoto (Jaccard) [-1/3; 1]

Tversky depends on α and β

Dice [-1; 1]

Cosine [-1; 1]

Manhattan distance [0; ∞]

Euclidian distance [0; ∞]

n

i BiAi

n

i Bi

n

i Ai

n

i BiAi

BA

xxxx

xxS

1 ,,1

2

,1

2

,

1 ,,

,

n

i BiAi

n

i Bi

n

i Ai

n

i BiAi

BA

xxxx

xxS

1 ,,1

2

,1

2

,

1 ,,

,

n

i Bi

n

i Ai

n

i BiAi

BA

xx

xxS

1

2

,1

2

,

1 ,,

,

2

n

i Bi

n

i Ai

n

i BiAi

BA

xx

xxS

1

2

,1

2

,

1 ,,

,

n

i

BiAiBA xxD1

,,,

n

i

BiAiBA xxD1

,,,

xi,M – value of i-th descriptor of a molecule M; n – overall number of features in a numeric vector

P. Willet, J.M. Barnard and G.M. Downs. J. Chem. Inf. Comput. Sci., 1998, 38 (6), pp 983–996

Example of Tanimoto similarity calculation

0 0 0 1 0 0 1 1 0 1

0 0 1 0 1 0 1 0 1 1 cba

cS BA

,

A

B

a = 4 b = 5 c = 2

2857.07

2

254

2,

BAS

Example of Tanimoto similarity calculation on MCS

cba

cS BA

,

a = 21 atoms b = 17 atoms

c = 11 atoms

4074.027

11

111721

11,

BAS

Example

Tanimoto Dice

MACCS 0.94 0.969

Atom pairs 0.994 0.997

Morgan (ECFP4-like) 0.759 0.878

Morgan (FCFP4-like) 0.782 0.878

*calculated with RDKit

morphine codeine

Conclusion

Results of similarity search will depend on selected fingerprints and similarity measure It is extremely fast and may be applied towards millions of compounds

QSAR modeling

Activity = F(structure)

M – mapping function E – encoding function

Modeling of compounds properties

Activity = M(E(structure))

D1 D2 D3 D4 D5 D6 … DN

1 0 9 0 11 1 … 1

4 0 1 0 0 0 … 1

0 0 0 0 0 4 … 6

0 2 3 6 0 0 … 3

… … … … … … … …

4 0 0 0 1 2 … 1

QSAR modeling workflow

D1 D2 D3 D4 D5 D6 … DN

1 0 9 0 11 1 … 1

4 0 1 0 0 0 … 1

0 0 0 0 0 4 … 6

0 2 3 6 0 0 … 3

… … … … … … … …

4 0 0 0 1 2 … 1

Structure Descriptors (features) Model

Encoding (represent structure with

numerical features)

Mapping (machine learning)

1) a defined endpoint

2) an unambiguous algorithm

3) a defined domain of applicability

4) appropriate measures of goodness-of–fit, robustness and predictivity

5) a mechanistic interpretation, if possible

OECD principles of QSAR modeling

Overall QSAR workflow

Input data

Bioassays

Databases

Preprocessing Feature

engineering Model

learning Model

validation

Classification

Regression

Clustering

Cross-validation

Bootstrap

Test set

Applicability Domain

Feature selection

Feature combination

Data normalization

Feature extraction

Interpretation

j

j

ii

z

xxx '

Step 1. Data collection

Scientific literature and patents Databases (ChEMBL, PubChem, BindingDB, etc)

Traditionally modeled compounds should have the same mechanism of action, however using of complex non-linear machine learning method allows to model data sets with mixed or even unknown mechanism of action with reasonable accuracy.

Conditions may substantially influence the results of bioassays (change in temperature, activators, detectors, etc)

Units checking

Step 2. Data curation (normalization)

Removal of mixtures, inorganics, metalorganics, etc

Strip of salts, counterions, etc

Ionization, if necessary (at the particular pH level)

Chemotype normalization, resonance structure and tautomers

Duplicates removal

Manual checking

Step 3. Descriptors: classification Object type:

molecular descriptors (single molecules) descriptors of molecular ensemble (mixtures, materials) reaction descriptors (reactions)

Descriptor origin: calculated from the structure empirical (Hammet constants, lipophilicity chemical shifts in NMR, etc)

Locality: local (atom charge) global (molecular weight, molecular volume, lipophilicity, etc)

Dimensionality: 1D (number of methyl groups, molecular weight, etc) 2D (topological indices, fragmental descriptors) 3D (molecular volume, quantum chemical descriptors) 4D (based on a set of conformers)

Calculation method: physico-chemical (lipophilicity, etc) topological (invariants of molecular graph, Randic index, Wiener index, etc) fragmental (fingerprints, etc) pharmacophore spatial (moment of inertia, etc) quantum-chemical (energy of HOMO/LUMO, etc) etc. R. Todeschini and V. Consonni Handbook of Molecular Descriptors, 2008

Step 4. Feature processing

Feature transformations: linear and non-linear scaling

Feature combinations: Feature selection:

removal of correlated descriptors (important e.g. for linear regression models) removal of descriptors with missing values removal constant and near constant features removal of descriptors with low correlation with the target property step-wise feature selection evolutionary feature selection (genetic algorithm, etc)

sd

xxz i

i

n

i

i xxn

nsd

1

2)(1

ixie

z

1

1range (0; 1)

jiij xxz

2

ii xz add quadratic term

Step 5. Model building

Unsupervised clustering

Supervised

Regression Classification

Multiple linear regression (MLR)

Partial linear regression (PLS) Logistic regression

Gaussian Process (GP) Naïve Bayes (NB)

Decision trees (DT)

Support vector machine (SVM)

Neural nets (NN)

Random forest (RF)

k-Nearest neighbors (kNN)

Step 6. Validation

Test set (usually 20-25% of the work set)

1.2 1.3 1.7 2.0 2.2 2.8 3.1 3.2 3.2 3.6 4.7 5.7 5.8 6.4 7.2 8.1 9.0 9.1 9.2 working set

random test set 1.2 1.3 1.7 2.0 2.2 2.8 3.1 3.2 3.2 3.6 4.7 5.7 5.8 6.4 7.2 8.1 9.0 9.1 9.2

stratified test set

1.2 1.3 1.7 2.0 2.2 2.8 3.1 3.2 3.2 3.6 4.7 5.7 5.8 6.4 7.2 8.1 9.0 9.1 9.2

Cross-validation

1.2 1.3 1.7 2.0 2.2 2.8 3.1 3.2 3.2 3.6 4.7 5.7 5.8 6.4 7.2 8.1 9.0 9.1 9.2 working set

1.2 1.3 1.7 2.0 2.2 2.8 3.1 3.2 3.2 3.6 4.7 5.7 5.8 6.4 7.2 8.1 9.0 9.1 9.2

1.2 1.3 1.7 2.0 2.2 2.8 3.1 3.2 3.2 3.6 4.7 5.7 5.8 6.4 7.2 8.1 9.0 9.1 9.2

1.2 1.3 1.7 2.0 2.2 2.8 3.1 3.2 3.2 3.6 4.7 5.7 5.8 6.4 7.2 8.1 9.0 9.1 9.2

fold 1

fold 2

fold 3

predictions of different folds are combined to calculate the final predictive measure

Step 6. Measures of predictive ability of models

Classification

FPTN

TNySpecificit

FNTP

TPySensitivit

2

ySensitivitySpecificitaccuracy Balanced

baseline-1

baseline -accuracy Kappa

N

TNTPaccuracy

2N

FP)FN)(TP(TPFN)FP)(TN(TNbaseline

Confusion matrix

Predicted

positive class (1)

negative class (0)

observed

positive class (1)

true positive

(TP)

false negative

(FN)

negative class (0)

false positive

(FP)

true negative

(TN)

))()()(( FNTNFPTNFNTPFPTP

FNFPTNTPMCC

[-1; 1]

[0; 1]

[0; 1]

[0; 1]

[0; 1]

[0; 1]

Step 6. Measures of predictive ability of models

Regression

i

2

obspredi,

i

2

obsi,predi,2

)y(y

)y(y

1Q

1N

)y(y

RMSE i

2

obsi,predi,

N

1i

obsi,predi, yyN

1MAE

Determination coefficient

Root mean squared error

Mean absolute error

Step 7. Applicability domain (AD) Extrapolation to very distant objects is dangerous

There is a need to define the domain where our model is reliable (models are not universal!)

Only compounds which are similar to the training set compounds should be included in applicability domain of the model. One should estimate similarity of new compounds (test set, etc) to the training set compounds.

Step 7. Applicability domain (AD) measures

Bounding box - based on descriptor range

Desc_1

Desc_2

AD • internal regions are usually empty, especially if the number of descriptors is big

• it doesn’t take into account descriptor correlation

Distance from training set compounds in descriptor space

Desc_1

Desc_2 Distance from training set compounds in model space

Model_1

Model_2

Requires several models (e.g. consensus model, bootstrap models)

One-class SVM

Conformal predictions

Step 7. Applicability domain (AD) example

J. Chem. Inf. Model., Vol. 50, No. 12, 2010

Step 8. Interpretation of QSAR models

1/C = 4.08π – 2.14π2 + 2.78σ + 3.38

π = logPX – logPH

σ - Hammet constant

plant growth inhibition activity of phenoxyacetic acids

electronic factors rate of penetration of membranes in the plant cell

Hansch equation

R is H or CH3;

X is Br, Cl, NO2 and

Y is NO2, NH2, NHC(=O)CH3

Inhibition activity of compounds against Staphylococcus aureus

Act = 75RH – 112RCH3 + 84XCl – 16XBr – 26XNO2 + 123YNH2 + 18YNHC(=O)CH3 – 218YNO2

Free-Wilson models


347 agonists of 5-HT1A receptor

Ar - substituted (hetero)aryls L - polymethylene chain R - various (poly)cyclic residues

O O OMe Cl Cl CF3

N

N

CH3

F

O2N

Cl

PLS 0.84 0.18 0.03 -0.04 -0.06 -0.09 -0.11 -0.66 -0.73 -0.94 -0.96 RF 0.27 0.24 0.04 0.07 -0.02 0.11 0.04 -0.04 -0.55 -0.66 -0.66

Ar

L -(CH2)6- -(CH2)5- -(CH2)4- -(CH2)3- -(CH2)2- -CH2-

PLS 0.8 0.71 0.81 0.08 -0.04 0.06

RF 0.14 0.19 0.14 -0.01 -0.03 0.05


2δ

δ)f(xδ)f(xC

f(x)δ)f(xC

δ)f(xf(x)C

forward difference

backward difference

central difference

Partial derivatives is used to calculate contributions of single features. It is applicable to any model built on meaningful (interpretable) descriptors. May be calculated analytically or numerically.

Estimated contributions of separate substructural features can be mapped back on the structure to reveal contribution of fragments.

Marcou, G.; Horvath, D.; Solov'ev, V.; Arrault, A.; Vayer, P.; Varnek, A. Interpretability of SAR/QSAR Models of any Complexity by Atomic Contributions. Molecular Informatics 2012, 31, 639-642


= –

A B C

Activitypred(A) Activitypred(B) Contribution(C)

f(A) = x f(B) = y W(C) = x – y

Polishchuk, P. G.; Kuz'min, V. E.; Artemenko, A. G.; Muratov, E. N. Universal Approach for Structural Interpretation of QSAR/QSPR Models. Molecular Informatics 2013, 32, 843-853

Universal approach


Polishchuk P, Tinkov O, Khristova T, Ognichenko L, Kosinskaya A, Varnek A, Kuz’min V. Journal of Chemical Information and Modeling 2016, 56, 1455-1469.

Acute oral toxicity on rats

?

?

Overall QSAR workflow

Input data

Bioassays

Databases

Preprocessing Feature

engineering Model

learning Model

validation

Classification

Regression

Clustering

Cross-validation

Bootstrap

Test set

Applicability Domain

Feature selection

Feature combination

Data normalization

Feature extraction

Interpretation

j

j

ii

z

xxx '