Similarity search and QSAR modeling
Pavel Polishchuk
Institute of Molecular and Translational Medicine Faculty of Medicine and Dentistry
Palacky University
[email protected] qsar4u.com
November-December 2016, Palacky University, Olomouc, Czech republic for the short course of “introduction to ligand-based drug discovery”
Computer-aided drug design (CADD)
Protein structure
Unknown Known
Liga
nd
str
uct
ure
Unknown CADD of no use De novo design
Known
Ligand-based design
QSAR, pharmacophore modeling,
similarity searching
Structure-based design
Docking, molecular dynamics,
pharmacophore modeling
Similarity principle
Similia similibus curantur
Similar chemicals are solved in similar ones
Similar structures exhibit similar properties
Are they similar?
morphine codeine
S-thalidomide R-thalidomide
tirofibane
eptifibatide
tirofibane ethyl ester
How to measure similarity?
Structure representation (encoding)
Similarity measure
Structure representation and similarity
Structural keys
Feature vectors
Pharmacophore
Shape similarity
Field similarity
Fingerprints
10001001010101110111001
11 4 4 21 … 5 4 0 0 1
Similarity
Structure encoding. Structural similarity.
Nomenclature
Line notation (e.g. SMILES, InChi)
Molecular graph (2D descriptors)
Fragments count (e.g. structural keys, fingerprints)
3D structures (geometrical descriptors)
glucose
OC[C@H]1OC(O)[C@H](O)[C@@H](O)[C@@H]1O
InChI=1S/C6H12O6/c7-1-2-3(8)4(9)5(10)6(11)12-2/h2-11H,1H2/t2-,3-,4+,5-,6?/m1/s1
n(OH) = 5; n(ring6) = 1, etc
shape, volume, surface, etc
topological indices, etc
Structural keys
Structural keys are predefined patterns (fixed length).
• The presence/absence of each element, or if an element is common (nitrogen, for example), several bits might represent "at least 1 N", "at least 2 N", "at least 4 N", and so forth.
• Unusual or important electronic configurations, such as "sp3 carbon" or "triple-bonded nitrogen.“
• Rings and ring systems, such as cyclohexane, pyridine, or napthalene.
• Common functional groups, such as alcohols, amines, hydrocarbons, and so forth.
Examples:
MACCS (166 or 960 bits) BCI CACTVS
MACCS keys (166 structural keys)
…
49: [!+0], # CHARGE
50: [#6]=[#6](~[#6])~[#6], # C=C(C)C
51: [#6]~[#16]~[#8], # CSO
52: [#7]~[#7], # NN
53: [!#6;!#1;!H0]~*~*~*~[!#6;!#1;!H0], # QHAAAQH
54: [!#6;!#1;!H0]~*~*~[!#6;!#1;!H0], # QHAAQH
55: [#8]~[#16]~[#8], # OSO
…
… … … … … … 0 1 1 0 … … … … … … … … …
charge
fixed length, binary or counts
Structural keys: advantages and drawbacks
+ fixed length(easy to store and update data base) - lack of generality (can have little discriminating power for certain data sets due to too dense or too sparse bit strings)
Fingerprints
No predefined patterns, patterns are generated separately for each structure.
Usually binary, but maybe counts (feature vectors)
Atom-pair fingerprints
Atom type: element + # heavy neighbors + # pi electrons
Atom-pair: atom type – topological distance – atom type
O_1_1
O_1_0
N_2_1
C_2_1
C_3_1
N_2_1-4-O_1_1 C_3_1-2-O_1_0 N_2_1-2-C_2_1
R.E. Carhart, D.H. Smith, R. Venkataraghavan JCICS 25: 64-73 (1985)
Sequence-based fingerprints (Daylight)
Generate all paths up to certain length considering atom and bond types (usually length is 2-7)
0-bond paths N C O
1-bond paths N-C C-C C-O C=O
2-bond paths N-C-C C-C-O C-C=O
3-bond paths N-C-C-O N-C-C=O
Atom-centric (augmented atoms) fingerprints
Generate substructures starting from each atom and considering all its neighbors up to the specified distance (radius or diameter). Morgan fingerprints, Extended-connectivity fingerprints (ECFP). Functional-class fingerprints (FCFP) , etc.
radius=2 (diameter=4)
Rogers, D. & Hahn, M. Extended-Connectivity Fingerprints. J. Chem. Inf. Model. 50, 742-754 (2010)
Atom-centric (augmented atoms) fingerprints
Morgan fingerprints like Functional-class fingerprints (FCFP)
Rogers, D. & Hahn, M. Extended-Connectivity Fingerprints. J. Chem. Inf. Model. 50, 742-754 (2010)
Feature Code SMARTS (RDKit implementation)
Donor hd [$([N;!H0;v3,v4&+1]),$([O,S;H1;+0]),n&H1&+0]
Acceptor ha [$([O,S;H1;v2;!$(*-*=[O,N,P,S])]),$([O,S;H0;v2]),$([O,S;-]),$([N;v3;!$(N-*=[O,N,P,S])]),n&H0&+0,$([o,s;+0;!$([o,s]:n);!$([o,s]:c:n)])]
Aromatic a [a]
Halogen h [F,Cl,Br,I]
Basic b [#7;+,$([N;H2&+0][$([C,a]);!$([C,a](=O))]),$([N;H1&+0]([$([C,a]);!$([C,a](=O))])[$([C,a]);!$([C,a](=O))]),$([N;H0&+0]([C;!$(C(=O))])([C;!$(C(=O))])[C;!$(C(=O))])]
Acidic ac [$([C,S](=[O,S,P])-[O;H1,-1])]
Fingerprints
Each molecule has variable length set of substructures – variable length fingerprints
N-C-C C-C-O C-C=O C-C-C C:C:C C:C:N C:N:C
1 1 1 0 0 0 0
1 1 1 1 0 0 0
0 0 0 0 1 1 1
2-bond sequences
Hashed fingerprints
Have fixed length (usually 512, 1024 or 2048 bits)
hash code
N-C-C C-C-O C-C=O C-C-C C:C:C C:C:N
13823 9740 37278 28478 874 283764
0 0 0 1 0 0 1 1 0 1 0 0 1 0 1 0 0 0 1 1
pseudo-random number generator
fixed-length bit string
Each substructure activates several bits (usually 4-5) to avoid collisions and produce bit string of enough density Missing bits mean that certain substructures are not presented, Active bits mean that certain substructure may be present (but due to possible collisions one cannot be sure)
Folding of fingerprints and keys bit strings
To increase density of very sparse bit vectors
0 0 0 1 0 0 1 1 0 1 0 0 1 0 1 0 0 0 1 1
0 0 0 1 0 0 1 1 0 1
0 0 1 0 1 0 0 0 1 1
OR
n bit fingerprints
n/2 bits
n/2 bits
0 0 1 1 1 0 1 1 1 1 n/2 bit fingerprint
Usually optimal density is 0.2-0.3
Feature vectors
May contain counts of substructures as well as other features derived from the structure (calculated logP, MW, etc)
Feature vectors usually used in QSAR modeling than in simple similarity search
N-C-C C-C-O C-C=O C-C-C C:C:C C:C:N C:N:C
1 1 1 0 0 0 0
2 1 1 1 0 0 0
0 0 0 0 3 2 1
2-bond sequences
Maximum common substructure (MCS)
furosemid (diuretic)
sulfadiazine (antibiotic)
aspirin (anti-coagulant, COX inhibitor)
clopidogrel (anti-coagulant, P2Y12 inhibitor)
Similarity indices
Similarity indices for binary fingerprints
Index name Formula Range
Tanimoto (Jaccard) [0; 1]
Tversky depends on α and β
Dice [0; 1]
Cosine [0; 1]
Manhattan distance [0; n]
Euclidian distance [0; n]
cba
cS BA
,
ccbca
cS BA
)()(,
ba
cS BA
2,
ab
cS BA ,
cbaD BA 2,
cbaD BA 2,
a, b – number of bits “on” (1) in A and B bit strings; c – number of common bits “on” (1) in both A and B strings
P. Willet, J.M. Barnard and G.M. Downs. J. Chem. Inf. Comput. Sci., 1998, 38 (6), pp 983–996
Similarity indices for numeric fingerprints
Index name Formula Range
Tanimoto (Jaccard) [-1/3; 1]
Tversky depends on α and β
Dice [-1; 1]
Cosine [-1; 1]
Manhattan distance [0; ∞]
Euclidian distance [0; ∞]
n
i BiAi
n
i Bi
n
i Ai
n
i BiAi
BA
xxxx
xxS
1 ,,1
2
,1
2
,
1 ,,
,
n
i BiAi
n
i Bi
n
i Ai
n
i BiAi
BA
xxxx
xxS
1 ,,1
2
,1
2
,
1 ,,
,
n
i Bi
n
i Ai
n
i BiAi
BA
xx
xxS
1
2
,1
2
,
1 ,,
,
2
n
i Bi
n
i Ai
n
i BiAi
BA
xx
xxS
1
2
,1
2
,
1 ,,
,
n
i
BiAiBA xxD1
,,,
n
i
BiAiBA xxD1
,,,
xi,M – value of i-th descriptor of a molecule M; n – overall number of features in a numeric vector
P. Willet, J.M. Barnard and G.M. Downs. J. Chem. Inf. Comput. Sci., 1998, 38 (6), pp 983–996
Example of Tanimoto similarity calculation
0 0 0 1 0 0 1 1 0 1
0 0 1 0 1 0 1 0 1 1 cba
cS BA
,
A
B
a = 4 b = 5 c = 2
2857.07
2
254
2,
BAS
Example of Tanimoto similarity calculation on MCS
cba
cS BA
,
a = 21 atoms b = 17 atoms
c = 11 atoms
4074.027
11
111721
11,
BAS
Example
Tanimoto Dice
MACCS 0.94 0.969
Atom pairs 0.994 0.997
Morgan (ECFP4-like) 0.759 0.878
Morgan (FCFP4-like) 0.782 0.878
*calculated with RDKit
morphine codeine
Conclusion
Results of similarity search will depend on selected fingerprints and similarity measure It is extremely fast and may be applied towards millions of compounds
QSAR modeling
Activity = F(structure)
M – mapping function E – encoding function
Modeling of compounds properties
Activity = M(E(structure))
D1 D2 D3 D4 D5 D6 … DN
1 0 9 0 11 1 … 1
4 0 1 0 0 0 … 1
0 0 0 0 0 4 … 6
0 2 3 6 0 0 … 3
… … … … … … … …
4 0 0 0 1 2 … 1
QSAR modeling workflow
D1 D2 D3 D4 D5 D6 … DN
1 0 9 0 11 1 … 1
4 0 1 0 0 0 … 1
0 0 0 0 0 4 … 6
0 2 3 6 0 0 … 3
… … … … … … … …
4 0 0 0 1 2 … 1
Structure Descriptors (features) Model
Encoding (represent structure with
numerical features)
Mapping (machine learning)
1) a defined endpoint
2) an unambiguous algorithm
3) a defined domain of applicability
4) appropriate measures of goodness-of–fit, robustness and predictivity
5) a mechanistic interpretation, if possible
OECD principles of QSAR modeling
Overall QSAR workflow
Input data
Bioassays
Databases
Preprocessing Feature
engineering Model
learning Model
validation
Classification
Regression
Clustering
Cross-validation
Bootstrap
Test set
Applicability Domain
Feature selection
Feature combination
Data normalization
Feature extraction
Interpretation
j
j
ii
z
xxx '
Step 1. Data collection
Scientific literature and patents Databases (ChEMBL, PubChem, BindingDB, etc)
Traditionally modeled compounds should have the same mechanism of action, however using of complex non-linear machine learning method allows to model data sets with mixed or even unknown mechanism of action with reasonable accuracy.
Conditions may substantially influence the results of bioassays (change in temperature, activators, detectors, etc)
Units checking
Step 2. Data curation (normalization)
Removal of mixtures, inorganics, metalorganics, etc
Strip of salts, counterions, etc
Ionization, if necessary (at the particular pH level)
Chemotype normalization, resonance structure and tautomers
Duplicates removal
Manual checking
Step 3. Descriptors: classification Object type:
molecular descriptors (single molecules) descriptors of molecular ensemble (mixtures, materials) reaction descriptors (reactions)
Descriptor origin: calculated from the structure empirical (Hammet constants, lipophilicity chemical shifts in NMR, etc)
Locality: local (atom charge) global (molecular weight, molecular volume, lipophilicity, etc)
Dimensionality: 1D (number of methyl groups, molecular weight, etc) 2D (topological indices, fragmental descriptors) 3D (molecular volume, quantum chemical descriptors) 4D (based on a set of conformers)
Calculation method: physico-chemical (lipophilicity, etc) topological (invariants of molecular graph, Randic index, Wiener index, etc) fragmental (fingerprints, etc) pharmacophore spatial (moment of inertia, etc) quantum-chemical (energy of HOMO/LUMO, etc) etc. R. Todeschini and V. Consonni Handbook of Molecular Descriptors, 2008
Step 4. Feature processing
Feature transformations: linear and non-linear scaling
Feature combinations: Feature selection:
removal of correlated descriptors (important e.g. for linear regression models) removal of descriptors with missing values removal constant and near constant features removal of descriptors with low correlation with the target property step-wise feature selection evolutionary feature selection (genetic algorithm, etc)
sd
xxz i
i
n
i
i xxn
nsd
1
2)(1
ixie
z
1
1range (0; 1)
jiij xxz
2
ii xz add quadratic term
Step 5. Model building
Unsupervised clustering
Supervised
Regression Classification
Multiple linear regression (MLR)
Partial linear regression (PLS) Logistic regression
Gaussian Process (GP) Naïve Bayes (NB)
Decision trees (DT)
Support vector machine (SVM)
Neural nets (NN)
Random forest (RF)
k-Nearest neighbors (kNN)
Step 6. Validation
Test set (usually 20-25% of the work set)
1.2 1.3 1.7 2.0 2.2 2.8 3.1 3.2 3.2 3.6 4.7 5.7 5.8 6.4 7.2 8.1 9.0 9.1 9.2 working set
random test set 1.2 1.3 1.7 2.0 2.2 2.8 3.1 3.2 3.2 3.6 4.7 5.7 5.8 6.4 7.2 8.1 9.0 9.1 9.2
stratified test set
1.2 1.3 1.7 2.0 2.2 2.8 3.1 3.2 3.2 3.6 4.7 5.7 5.8 6.4 7.2 8.1 9.0 9.1 9.2
Cross-validation
1.2 1.3 1.7 2.0 2.2 2.8 3.1 3.2 3.2 3.6 4.7 5.7 5.8 6.4 7.2 8.1 9.0 9.1 9.2 working set
1.2 1.3 1.7 2.0 2.2 2.8 3.1 3.2 3.2 3.6 4.7 5.7 5.8 6.4 7.2 8.1 9.0 9.1 9.2
1.2 1.3 1.7 2.0 2.2 2.8 3.1 3.2 3.2 3.6 4.7 5.7 5.8 6.4 7.2 8.1 9.0 9.1 9.2
1.2 1.3 1.7 2.0 2.2 2.8 3.1 3.2 3.2 3.6 4.7 5.7 5.8 6.4 7.2 8.1 9.0 9.1 9.2
fold 1
fold 2
fold 3
predictions of different folds are combined to calculate the final predictive measure
Step 6. Measures of predictive ability of models
Classification
FPTN
TNySpecificit
FNTP
TPySensitivit
2
ySensitivitySpecificitaccuracy Balanced
baseline-1
baseline -accuracy Kappa
N
TNTPaccuracy
2N
FP)FN)(TP(TPFN)FP)(TN(TNbaseline
Confusion matrix
Predicted
positive class (1)
negative class (0)
observed
positive class (1)
true positive
(TP)
false negative
(FN)
negative class (0)
false positive
(FP)
true negative
(TN)
))()()(( FNTNFPTNFNTPFPTP
FNFPTNTPMCC
[-1; 1]
[0; 1]
[0; 1]
[0; 1]
[0; 1]
[0; 1]
Step 6. Measures of predictive ability of models
Regression
i
2
obspredi,
i
2
obsi,predi,2
)y(y
)y(y
1Q
1N
)y(y
RMSE i
2
obsi,predi,
N
1i
obsi,predi, yyN
1MAE
Determination coefficient
Root mean squared error
Mean absolute error
Step 7. Applicability domain (AD) Extrapolation to very distant objects is dangerous
There is a need to define the domain where our model is reliable (models are not universal!)
Only compounds which are similar to the training set compounds should be included in applicability domain of the model. One should estimate similarity of new compounds (test set, etc) to the training set compounds.
Step 7. Applicability domain (AD) measures
Bounding box - based on descriptor range
Desc_1
Desc_2
AD • internal regions are usually empty, especially if the number of descriptors is big
• it doesn’t take into account descriptor correlation
Distance from training set compounds in descriptor space
Desc_1
Desc_2 Distance from training set compounds in model space
Model_1
Model_2
Requires several models (e.g. consensus model, bootstrap models)
One-class SVM
Conformal predictions
Step 7. Applicability domain (AD) example
J. Chem. Inf. Model., Vol. 50, No. 12, 2010
Step 8. Interpretation of QSAR models
1/C = 4.08π – 2.14π2 + 2.78σ + 3.38
π = logPX – logPH
σ - Hammet constant
plant growth inhibition activity of phenoxyacetic acids
electronic factors rate of penetration of membranes in the plant cell
Hansch equation
R is H or CH3;
X is Br, Cl, NO2 and
Y is NO2, NH2, NHC(=O)CH3
Inhibition activity of compounds against Staphylococcus aureus
Act = 75RH – 112RCH3 + 84XCl – 16XBr – 26XNO2 + 123YNH2 + 18YNHC(=O)CH3 – 218YNO2
Free-Wilson models
Step 8. Interpretation of QSAR models
347 agonists of 5-HT1A receptor
Ar - substituted (hetero)aryls L - polymethylene chain R - various (poly)cyclic residues
O O OMe Cl Cl CF3
N
N
CH3
F
O2N
Cl
PLS 0.84 0.18 0.03 -0.04 -0.06 -0.09 -0.11 -0.66 -0.73 -0.94 -0.96 RF 0.27 0.24 0.04 0.07 -0.02 0.11 0.04 -0.04 -0.55 -0.66 -0.66
Ar
L -(CH2)6- -(CH2)5- -(CH2)4- -(CH2)3- -(CH2)2- -CH2-
PLS 0.8 0.71 0.81 0.08 -0.04 0.06
RF 0.14 0.19 0.14 -0.01 -0.03 0.05
Step 8. Interpretation of QSAR models
2δ
δ)f(xδ)f(xC
f(x)δ)f(xC
δ)f(xf(x)C
forward difference
backward difference
central difference
Partial derivatives is used to calculate contributions of single features. It is applicable to any model built on meaningful (interpretable) descriptors. May be calculated analytically or numerically.
Estimated contributions of separate substructural features can be mapped back on the structure to reveal contribution of fragments.
Marcou, G.; Horvath, D.; Solov'ev, V.; Arrault, A.; Vayer, P.; Varnek, A. Interpretability of SAR/QSAR Models of any Complexity by Atomic Contributions. Molecular Informatics 2012, 31, 639-642
Step 8. Interpretation of QSAR models
= –
A B C
Activitypred(A) Activitypred(B) Contribution(C)
f(A) = x f(B) = y W(C) = x – y
Polishchuk, P. G.; Kuz'min, V. E.; Artemenko, A. G.; Muratov, E. N. Universal Approach for Structural Interpretation of QSAR/QSPR Models. Molecular Informatics 2013, 32, 843-853
Universal approach
Step 8. Interpretation of QSAR models
Polishchuk P, Tinkov O, Khristova T, Ognichenko L, Kosinskaya A, Varnek A, Kuz’min V. Journal of Chemical Information and Modeling 2016, 56, 1455-1469.
Acute oral toxicity on rats
?
?
Overall QSAR workflow
Input data
Bioassays
Databases
Preprocessing Feature
engineering Model
learning Model
validation
Classification
Regression
Clustering
Cross-validation
Bootstrap
Test set
Applicability Domain
Feature selection
Feature combination
Data normalization
Feature extraction
Interpretation
j
j
ii
z
xxx '