Upload
dominick-barber
View
214
Download
0
Tags:
Embed Size (px)
Citation preview
1Chemical Structure Representation
and Search Systems
Lecture 6. Nov 18, 2003
John Barnard
Barnard Chemical Information LtdChemical Informatics Software & Consultancy Services
Sheffield, UK
2 Lecture 6: Topics to be Covered
Similarity searching• similarity search vs. substructure search• similarity and distance metrics• different types of descriptor for similarity
search• choice of descriptors
Chemical Diversity and its measurement
3 Similarity searching
instead of searching for all molecules containing a given substructure, we search for molecules “similar” to a given target molecule
similar property principle:“structurally similar molecules are expected to
exhibit similar properties or biological activities”
Mark Johnson and Gerry Maggiora (Eds.) Concepts and Applications of Molecular Similarity. Wiley, New York, 1990
4 What is similarity?
“Similarity is in the eye of the beholder”
Similarity can be measured in many different ways• equivalence classes
o can say that two molecules are similar, or that they are different
• numerical measureso can say that two molecules have a similarity of, e.g. 0.85o similarity coefficients usually have values between
0.0 (totally different) and 1.0 (identical)
• distance measureso “opposite” of similarity (0.0 = identical; may have no
maximum, or may be normalised to fix maximum limit)
5 Equivalence classes
All molecules which are identical at some level of description are considered equivalent• molecular formula• structure graph (with no distinction between node and
bond types)• reduced graph• same ring systems• same fingerprints
6 Equivalence classes
two different molecules with the same graph• if node and edge labels are ignored
N
OH
O
NCH3
CH3
7 Numerical similarity measures
normally calculate some numerical measure of similarity between molecules
query structure is a “target” molecule database structures can be ranked in decreasing order
of similarity to target• find all molecules with > threshold similarity to target• find N most similar molecules to target
no particular substructure is required in the retrieved molecules• but they will have structural features in common with target
8 Similarity from fingerprints
similarity measures are most commonly calculated from structure fingerprints• count the bits that are “on” in both molecules• count the bits that are “on” in each molecule separately
struct A: 00010100010101000101010011110100 13 bits on (A)struct B: 00000000100101001001000011100000 8 bits on (B)A AND B: 00000000000101000001000011100000 6 bits on (C)
• similarity coefficient can be calculated from A, B and C
A BC
9 Tanimoto coefficient
similarity = CA + B – C
= 6 / (13 + 8 – 6) = 0.4 the number of bits set in both molecules divided
by the number of bits set in either molecule The Tanimoto coefficient is the most commonly
used similarity coefficient in chemical informatics• also called the Jaccard coefficient
A BC
10 Dice coefficient
similarity = 2CA + B
= 12 / (13 + 8) = 0.57 does not give the same values as the Tanimoto
coefficient, but will rank molecules in the same order of similarity to a target• i.e. “monotonic” with the Tanimoto coefficient
also called the Czekanowski or Sørenson coefficient
A BC
11 Cosine coefficient
similarity = C (A B)
= 6 / (13 8) = 0.588 not monotonic with the Tanimoto and Dice
coefficients, but highly correlated with them• also called the Ochiai coefficient
A BC
12 What is similarity?
The three coefficients discussed so far all ignore bits that are off in both molecules• is common absence of features evidence of
similarity between them?• are a camel, a horse and a nematode similar
because they all lack wings?• are a bat, a heron and a dragonfly similar
because they all have wings?
13 What is similarity?
Are these molecules similar because they all lack heteroatoms?
CH
CH
CH
CH
CH
CH
CH2
CH2
CH2
CH2
CH2
CH2
CH2 CH2
CH2
CH2
CH2
14 Simple Matching coefficient
a similarity coefficient that takes into account the fingerprint bits that are off in both molecules (D)
similarity = C + DN
= (6 + 17) / 32 = 0.719 N is length of fingerprint
• N = A + B – C + D
A BCD
15 Asymmetric Similarity
Usually we think that if A is similar to B, then B is similar to A
some coefficients have been defined in which this is not true
S(A,B) S(B,A)• e.g. Tversky similarity
Similarity = C
(A – C) + (B – C) + C where and are user-defined parameters
16 Asymmetric (Tversky) Similarity
T = C
(A – C) + (B – C) + C
if = = 1, equation reduces to Tanimoto coefficient
if = = ½, equation reduces to Dice coefficient if , T becomes asymmetric
• where = 1 and = 0, T = C / Ai.e. the fraction of A which it is has in common with Bo when T = 1.0, it indicates that A is a substructure of B
(at the level of fingerprint matching)o when T 1.0 it indicates that A is almost a substructure of B
A BC
17 Subsimilarity search
This provides a means to substructure similarity search• also possible with maximal common subgraphs• A and B could be number of atoms in each molecule,
and C could be number of atoms in their maximal common substructure
fingerprint-based similarity is generally faster than identifying MCS• but common features (fragments) will be smaller
18 Similarity and Distance
Distance is the opposite of similarity A similarity coefficient in the range 0 to 1 can be
converted to a distance by taking its “complement”• Distance = 1 – Similarity
Sometimes there is a different name for the complement of a similarity coefficient:
• 1 – Tanimoto Coefficient = Soergel Distance• 1 – Simple Matching Coefficient
= Normalised Hamming Distance
19 Distance Coefficients
analogous to distances in multi-dimensional geometric space• not necessarily equivalent to such distances
some distance coefficients are called distance metrics• to be a metric, a distance coefficient has to obey
certain rules
20 Distance metrics
distances must be zero or positive (no upper limit) distance from object to itself must be zero distance between non-identical objects must be
greater than zero distances must be symmetric distances must obey the triangular inequality
• DA,B <= DA,C + DB,CC
A
B
21 Properties of distance coefficients
not all distance coefficients are metrics those that are have certain advantages, and some
assumptions can be made about their behaviour• e.g. plotting in multi-dimensional space
metric distance coefficients include• Hamming Distance• Soergel Distance (= 1– Tanimoto similarity)
non-metric distance coefficients include• 1– Dice coefficient• 1– Cosine coefficient
22 Continuous variables
Similarities and distances can be defined when the descriptors used are continuous variables
• instead of “on/off” fingerprint bits (dichotomous variables)
• these might be a set of property values for a moleculeo molecular weighto number of rotatable bondso number of potential hydrogen-bond donors/acceptorso solubilityo acid dissociation constanto etc.
because these properties have different ranges, they may need to be “normalised” to range 0–1
23 Continuous variables
most similarity coefficients are also defined for continuous variables • dichotomous variables are a special case when all
values are either 0 or 1
but metric properties may not be the same for continuous variables• e.g. 1 – Tanimoto is not metric where continuous
variables are used
24 Euclidean distance
“ordinary distance” each of n variables (descriptors) is a dimension in
n-dimensional space distance between A and B is given by
xjA = value of descriptor j in molecule A
xjB = value of descriptor j in molecule B for dichotomous variables
(Euclidean distance)2 = Hamming Distance
D = (x jA - x jB )2j=1
j=n
25 Correlated variables
Sometimes different variables may be highly correlated with each other
statistical techniques can be used to identify such correlations, and reduce the number of variables• e.g. Principal Components Analysis (PCA)
o combines correlated descriptors into “components”o each component “explains” a certain proportion of the variance
in the dataset
just the first few principal components are used to calculate similarities• also easier to visualise plots in 3 dimensions!
26 Descriptor types used
Many different fragment types have been used for generating fingerprints for use in similarity searching• atom sequence
o linear path of atoms and bonds through moleculeo may generate only paths of certain lengths
• augmented atomo atom and its immediate neighbours
27 Fragment types
• ring compositiono atom/bond sequence around a ringo question of which rings to choose
• ring fusion patternso sequence of ring connectivities around a ring
• for each atom specify number of ring bonds it has
28 Fragment types
• atom pairso pair of atoms in same molecule, with number of
bonds in shortest path between themo additional differentiation between atom types
• number of attached hydrogens / pi-bonds
• topological torsionso connected sequence of 4 atom typeso atom types as described for atom pairs
29 Generalised fragments
sometimes specific fragments (with detailed description of atom and bond types) are too specific to be of much use in fingerprints• very low frequency very sparse fingerprints
atom and bond types can be generalised• any ring bond• any halogen (F, Cl, Br, I)• any chalcogen (O, S, Se, Te)
this gives fragments with higher frequency
30 3D fragments
fragments can be used to describe the 3D structure of a molecule too• usually involve interatomic distances and/or
bond angles• because distance values are continuous
variables, they are “binned”o each bin represents a range of distanceso e.g. distance of 3.000 – 3.999 Å
• each bin corresponds to a fingerprint bit
31 3D fragments
a popular 3D descriptor is the 3-point pharmacophore molecule is analysed to identify “pharmacophoric points”
• points in molecule likely to be involved in binding to a receptor site
o positive chargeso negative chargeso hydrogen-bond donors (e.g. –OH, – NH2) o hydrogen-bond acceptors (e.g. =O)o aromatic groupso hydrophobic groups
pharmacophoric points do not necessarily coincide with the positions of individual atoms
32 3D fragments
each fragment consists of 3 pharmacophoric points• the distances between
each pair of these points are binned and used to set fingerprint bits
4-point pharmacophore fragments are also used
Different people have used slightly different definitions of pharmacophoric points
H B D
Ary
H B A2 .8Å
3 .6Å1 .2Å
33 3D fragments
an issue with 3D fragments is conformational flexibility• a molecule with a single configuration may adopt a
number of different conformations, by rotation about single bonds
• some conformations may be energetically more favourable than others
• some programs for calculation of 3D coordinates from 2D (topological) representations can generate several different conformers
• different pharmacophoric fragments may result• these can be “overcoded” (ORed together in the
fingerprint) but this can cause problems
34 Choice of descriptors
“similarity is in the eye of the beholder” obviously the similarity values obtained will
depend heavily on the set of descriptors used• choose ring-based fragments and the most similar
structures will be those with similar ring systems, irrespective of functional groups attached to them
• choose small (functional group-like) fragments and the most similar structures will be those with the same functional groups, irrespective of the ring systems they are attached to
35 Choice of descriptors
A danger with fragment-based descriptors is redundancy of fragments• different fragments may be representing the same
chemical featureso this will set more bits in the fingerprints when those features
are present in a moleculeo and this may give the molecules a higher similarity than is
warranted
• hashed fingerprints may result in molecules with no features in common appearing to be similar because different fragments collide on the same bit position
36 What makes a good descriptor?
how do we decide which descriptors and similarity or distance measures are “best”?
go back to “similar property” principle:“structurally similar molecules are expected to exhibit
similar properties or biological activities”
we can do some experiments using • various different sets of descriptors• dataset of compounds with known biological activity or
measured physico-chemical property value
37
Evaluation of descriptor sets and similarity measuresfor each compound in the dataset
• “predict” its property value to be the same as those with which it is most similar
o usually take the average of 3 or 5 nearest neighbours(“k-nearest neghbours prediction”)
• calculate the correlation coefficient between the observed and predicted property values
on this basis most similarity coefficients perform about the same• for no very good reason Tanimoto measure has become
the most popular
38 Evaluation of descriptor sets
results on different descriptor sets are more ambiguous• likely to be heavily influenced by the property to be
predicted• most pharmaceutical companies are interested in
predicting which compounds will bind well to a protein receptor site
different types of descriptors each have their own advocates (usually their inventors)
main argument is between “2D” and “3D”
39 2D vs 3D descriptors
advocates of 3D descriptors argue that in binding to a protein receptor it is the 3D arrangement of the molecule that is important
experimental work done at Abbott Laboratories in the mid-1990s found that fingerprints based on 2D fragments performed consistently better• best of all were the fingerprints used in MDL’s ISIS
software (“MACCS keys” or “ISIS keys”)o these are based on simple functional groups and ringso one version has only 166 bits, with substantial redundancy
40 2D vs 3D descriptors
problem may be that 3D descriptors we have (3-point pharmacophores with binned distances) are not good enough• there may be “spurious accuracy” in the detailed
distances involved• conformational flexibility may be causing problems (as
one distance gets larger, another gets smaller)• molecule may change conformation during binding• some improved success has been found by identifying
“projection points” for hydrogen bond donors/acceptors (i.e. where they’re pointing to, not where they are)
41 2D vs 3D descriptors
2D descriptors provide “bounds” on possible 3D conformations• “2½D” descriptors (including some stereochemical
information) may be useful
“Superiority” of 2D descriptors in some studies may be artifact of datasets used• datasets may have large numbers of close analogues• these will have high 2D similarity, as well as correlated
activity
42 Field-based 3D similarity
Another approach to similarity studies Based on overlap of continuously-varying fields
(e.g. electron density) Computationally much more intensive
• Calculation of 3D structures and fields• Alignment of molecules to measure overlap• Use of grid points in overlap calculations
43 Chemical Diversity
important feature of compound collections and combinatorial libraries
idea is to cover as much of “chemical space” as possible• lots of different structural features represented
avoid having too many similar compounds• no point in testing different compounds likely
to have the same properties
44 Chemical diversity measures
numerical measure of the diversity of a set of compounds• several different measures in use• no real agreement on the best measure• frequently based on similarity/distance between pairs of
compoundso average distance between every pair of compounds
• algorithms are available to calculate this in O(N) time
• short distances can be compensated by long distances
o minimum distance between any pair of compounds• requires all pairwise distances to be calculated (O(N2))
o many other more complex measures
45 Cell-based diversity
Usually used with descriptors based on continuous variables• “chemical space” is divided into a “grid”
o one dimension for each descriptoro one grid square (“cell”) for each range of descriptor values
• compounds are placed in appropriate cell (hypercube) for their descriptor values
• diversity an be measured by counting occupied/ unoccupied cells, and calculating average occupancy of each cell
46 Diverse subset selection
Frequently similarity measures (and other criteria) are used to identify, from a large number of compounds that could be synthesised for testing, a suitably diverse subset that will be synthesised and tested• this is particularly important with combinatorial
librarieso first design a large “virtual” libraryo then identify a subset “real” library to synthesise
47 Virtual library subsetting
acid + amine amide
virtual library has 1000 possible acids and 1000 possible amines, giving 1M amides
we only want to make and test 900 amides• we need 30 acids and 30 amines
we select diverse subsets (e.g. to maximise the minimum distance between any pair of compounds) of the acids and the amines• may be better off identifying the most diverse 900-
compound subset of the amides
48 Virtual library subsetting
Number of possible subsets of 900 from 1 million is vast• far too many to try them all
“Genetic algorithms” often used• “chromosomes” represent possible subsets• they are “mutated” etc. at each “generation” to try to
get better subsets• “fitness” functions measure diversity and other
characteristics (cost, “drug-likeness” etc.)
49 Summary of Lecture 6
similarity searching is an important alternative to substructure search there are many different ways of measuring the similarity or distance
between molecules similarity can be measured with respect to many different types of
structure descriptor• there is no general agreement on the “best” descriptors
similarity and distance measures can be used to identify compounds to be synthesised and tested as potential new drugs
similarity is the basis for measuring chemical diversity • useful concept in identifying subsets of a large dataset which cover
as much as possible of chemical space
Barnard, J. M.; Downs, G. M.; Willett, P. J. Chem. Inf. Comput. Sci., 1998, 38, 983-996
Leach and Gillet (2003) Chapters 3, 4, 5 and 6
50 Lecture 7: Topics to be Covered
Clustering• identifying classes of molecules similar to each other,
but different to those in other classes
Topological indexes• numbers that can be calculated from connection tables
Property prediction• predicting physicochemical or biological properties
directly from connection tables
The Drug Discovery Process