1 Chemical Structure Representation and Search Systems Lecture 6. Nov 18, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software

1Chemical Structure Representation

and Search Systems

Lecture 6. Nov 18, 2003

John Barnard

Barnard Chemical Information LtdChemical Informatics Software & Consultancy Services

Sheffield, UK

2 Lecture 6: Topics to be Covered

Similarity searching• similarity search vs. substructure search• similarity and distance metrics• different types of descriptor for similarity

search• choice of descriptors

Chemical Diversity and its measurement

3 Similarity searching

instead of searching for all molecules containing a given substructure, we search for molecules “similar” to a given target molecule

similar property principle:“structurally similar molecules are expected to

exhibit similar properties or biological activities”

Mark Johnson and Gerry Maggiora (Eds.) Concepts and Applications of Molecular Similarity. Wiley, New York, 1990

4 What is similarity?

“Similarity is in the eye of the beholder”

Similarity can be measured in many different ways• equivalence classes

o can say that two molecules are similar, or that they are different

• numerical measureso can say that two molecules have a similarity of, e.g. 0.85o similarity coefficients usually have values between

0.0 (totally different) and 1.0 (identical)

• distance measureso “opposite” of similarity (0.0 = identical; may have no

maximum, or may be normalised to fix maximum limit)

5 Equivalence classes

All molecules which are identical at some level of description are considered equivalent• molecular formula• structure graph (with no distinction between node and

bond types)• reduced graph• same ring systems• same fingerprints

6 Equivalence classes

two different molecules with the same graph• if node and edge labels are ignored

N

OH

O

NCH3

CH3

7 Numerical similarity measures

normally calculate some numerical measure of similarity between molecules

query structure is a “target” molecule database structures can be ranked in decreasing order

of similarity to target• find all molecules with > threshold similarity to target• find N most similar molecules to target

no particular substructure is required in the retrieved molecules• but they will have structural features in common with target

8 Similarity from fingerprints

similarity measures are most commonly calculated from structure fingerprints• count the bits that are “on” in both molecules• count the bits that are “on” in each molecule separately

struct A: 00010100010101000101010011110100 13 bits on (A)struct B: 00000000100101001001000011100000 8 bits on (B)A AND B: 00000000000101000001000011100000 6 bits on (C)

• similarity coefficient can be calculated from A, B and C

A BC

9 Tanimoto coefficient

similarity = CA + B – C

= 6 / (13 + 8 – 6) = 0.4 the number of bits set in both molecules divided

by the number of bits set in either molecule The Tanimoto coefficient is the most commonly

used similarity coefficient in chemical informatics• also called the Jaccard coefficient

A BC

10 Dice coefficient

similarity = 2CA + B

= 12 / (13 + 8) = 0.57 does not give the same values as the Tanimoto

coefficient, but will rank molecules in the same order of similarity to a target• i.e. “monotonic” with the Tanimoto coefficient

also called the Czekanowski or Sørenson coefficient

A BC

11 Cosine coefficient

similarity = C (A B)

= 6 / (13 8) = 0.588 not monotonic with the Tanimoto and Dice

coefficients, but highly correlated with them• also called the Ochiai coefficient

A BC


The three coefficients discussed so far all ignore bits that are off in both molecules• is common absence of features evidence of

similarity between them?• are a camel, a horse and a nematode similar

because they all lack wings?• are a bat, a heron and a dragonfly similar

because they all have wings?


Are these molecules similar because they all lack heteroatoms?

CH

CH

CH

CH

CH

CH

CH2

CH2

CH2

CH2

CH2

CH2

CH2 CH2

CH2

CH2

CH2

14 Simple Matching coefficient

a similarity coefficient that takes into account the fingerprint bits that are off in both molecules (D)

similarity = C + DN

= (6 + 17) / 32 = 0.719 N is length of fingerprint

• N = A + B – C + D

A BCD

15 Asymmetric Similarity

Usually we think that if A is similar to B, then B is similar to A

some coefficients have been defined in which this is not true

S(A,B) S(B,A)• e.g. Tversky similarity

Similarity = C

(A – C) + (B – C) + C where and are user-defined parameters

16 Asymmetric (Tversky) Similarity

T = C

(A – C) + (B – C) + C

if = = 1, equation reduces to Tanimoto coefficient

if = = ½, equation reduces to Dice coefficient if , T becomes asymmetric

• where = 1 and = 0, T = C / Ai.e. the fraction of A which it is has in common with Bo when T = 1.0, it indicates that A is a substructure of B

(at the level of fingerprint matching)o when T 1.0 it indicates that A is almost a substructure of B

A BC

17 Subsimilarity search

This provides a means to substructure similarity search• also possible with maximal common subgraphs• A and B could be number of atoms in each molecule,

and C could be number of atoms in their maximal common substructure

fingerprint-based similarity is generally faster than identifying MCS• but common features (fragments) will be smaller

18 Similarity and Distance

Distance is the opposite of similarity A similarity coefficient in the range 0 to 1 can be

converted to a distance by taking its “complement”• Distance = 1 – Similarity

Sometimes there is a different name for the complement of a similarity coefficient:

• 1 – Tanimoto Coefficient = Soergel Distance• 1 – Simple Matching Coefficient

= Normalised Hamming Distance

19 Distance Coefficients

analogous to distances in multi-dimensional geometric space• not necessarily equivalent to such distances

some distance coefficients are called distance metrics• to be a metric, a distance coefficient has to obey

certain rules

20 Distance metrics

distances must be zero or positive (no upper limit) distance from object to itself must be zero distance between non-identical objects must be

greater than zero distances must be symmetric distances must obey the triangular inequality

• DA,B <= DA,C + DB,CC

A

B

21 Properties of distance coefficients

not all distance coefficients are metrics those that are have certain advantages, and some

assumptions can be made about their behaviour• e.g. plotting in multi-dimensional space

metric distance coefficients include• Hamming Distance• Soergel Distance (= 1– Tanimoto similarity)

non-metric distance coefficients include• 1– Dice coefficient• 1– Cosine coefficient

22 Continuous variables

Similarities and distances can be defined when the descriptors used are continuous variables

• instead of “on/off” fingerprint bits (dichotomous variables)

• these might be a set of property values for a moleculeo molecular weighto number of rotatable bondso number of potential hydrogen-bond donors/acceptorso solubilityo acid dissociation constanto etc.

because these properties have different ranges, they may need to be “normalised” to range 0–1

23 Continuous variables

most similarity coefficients are also defined for continuous variables • dichotomous variables are a special case when all

values are either 0 or 1

but metric properties may not be the same for continuous variables• e.g. 1 – Tanimoto is not metric where continuous

variables are used

24 Euclidean distance

“ordinary distance” each of n variables (descriptors) is a dimension in

n-dimensional space distance between A and B is given by

xjA = value of descriptor j in molecule A

xjB = value of descriptor j in molecule B for dichotomous variables

(Euclidean distance)2 = Hamming Distance

D = (x jA - x jB )2j=1

j=n

25 Correlated variables

Sometimes different variables may be highly correlated with each other

statistical techniques can be used to identify such correlations, and reduce the number of variables• e.g. Principal Components Analysis (PCA)

o combines correlated descriptors into “components”o each component “explains” a certain proportion of the variance

in the dataset

just the first few principal components are used to calculate similarities• also easier to visualise plots in 3 dimensions!

26 Descriptor types used

Many different fragment types have been used for generating fingerprints for use in similarity searching• atom sequence

o linear path of atoms and bonds through moleculeo may generate only paths of certain lengths

• augmented atomo atom and its immediate neighbours

27 Fragment types

• ring compositiono atom/bond sequence around a ringo question of which rings to choose

• ring fusion patternso sequence of ring connectivities around a ring

• for each atom specify number of ring bonds it has

28 Fragment types

• atom pairso pair of atoms in same molecule, with number of

bonds in shortest path between themo additional differentiation between atom types

• number of attached hydrogens / pi-bonds

• topological torsionso connected sequence of 4 atom typeso atom types as described for atom pairs

29 Generalised fragments

sometimes specific fragments (with detailed description of atom and bond types) are too specific to be of much use in fingerprints• very low frequency very sparse fingerprints

atom and bond types can be generalised• any ring bond• any halogen (F, Cl, Br, I)• any chalcogen (O, S, Se, Te)

this gives fragments with higher frequency

30 3D fragments

fragments can be used to describe the 3D structure of a molecule too• usually involve interatomic distances and/or

bond angles• because distance values are continuous

variables, they are “binned”o each bin represents a range of distanceso e.g. distance of 3.000 – 3.999 Å

• each bin corresponds to a fingerprint bit

31 3D fragments

a popular 3D descriptor is the 3-point pharmacophore molecule is analysed to identify “pharmacophoric points”

• points in molecule likely to be involved in binding to a receptor site

o positive chargeso negative chargeso hydrogen-bond donors (e.g. –OH, – NH2) o hydrogen-bond acceptors (e.g. =O)o aromatic groupso hydrophobic groups

pharmacophoric points do not necessarily coincide with the positions of individual atoms

32 3D fragments

each fragment consists of 3 pharmacophoric points• the distances between

each pair of these points are binned and used to set fingerprint bits

4-point pharmacophore fragments are also used

Different people have used slightly different definitions of pharmacophoric points

H B D

Ary

H B A2 .8Å

3 .6Å1 .2Å

33 3D fragments

an issue with 3D fragments is conformational flexibility• a molecule with a single configuration may adopt a

number of different conformations, by rotation about single bonds

• some conformations may be energetically more favourable than others

• some programs for calculation of 3D coordinates from 2D (topological) representations can generate several different conformers

• different pharmacophoric fragments may result• these can be “overcoded” (ORed together in the

fingerprint) but this can cause problems

34 Choice of descriptors

“similarity is in the eye of the beholder” obviously the similarity values obtained will

depend heavily on the set of descriptors used• choose ring-based fragments and the most similar

structures will be those with similar ring systems, irrespective of functional groups attached to them

• choose small (functional group-like) fragments and the most similar structures will be those with the same functional groups, irrespective of the ring systems they are attached to

35 Choice of descriptors

A danger with fragment-based descriptors is redundancy of fragments• different fragments may be representing the same

chemical featureso this will set more bits in the fingerprints when those features

are present in a moleculeo and this may give the molecules a higher similarity than is

warranted

• hashed fingerprints may result in molecules with no features in common appearing to be similar because different fragments collide on the same bit position

36 What makes a good descriptor?

how do we decide which descriptors and similarity or distance measures are “best”?

go back to “similar property” principle:“structurally similar molecules are expected to exhibit

similar properties or biological activities”

we can do some experiments using • various different sets of descriptors• dataset of compounds with known biological activity or

measured physico-chemical property value

37

Evaluation of descriptor sets and similarity measuresfor each compound in the dataset

• “predict” its property value to be the same as those with which it is most similar

o usually take the average of 3 or 5 nearest neighbours(“k-nearest neghbours prediction”)

• calculate the correlation coefficient between the observed and predicted property values

on this basis most similarity coefficients perform about the same• for no very good reason Tanimoto measure has become

the most popular

38 Evaluation of descriptor sets

results on different descriptor sets are more ambiguous• likely to be heavily influenced by the property to be

predicted• most pharmaceutical companies are interested in

predicting which compounds will bind well to a protein receptor site

different types of descriptors each have their own advocates (usually their inventors)

main argument is between “2D” and “3D”

39 2D vs 3D descriptors

advocates of 3D descriptors argue that in binding to a protein receptor it is the 3D arrangement of the molecule that is important

experimental work done at Abbott Laboratories in the mid-1990s found that fingerprints based on 2D fragments performed consistently better• best of all were the fingerprints used in MDL’s ISIS

software (“MACCS keys” or “ISIS keys”)o these are based on simple functional groups and ringso one version has only 166 bits, with substantial redundancy


problem may be that 3D descriptors we have (3-point pharmacophores with binned distances) are not good enough• there may be “spurious accuracy” in the detailed

distances involved• conformational flexibility may be causing problems (as

one distance gets larger, another gets smaller)• molecule may change conformation during binding• some improved success has been found by identifying

“projection points” for hydrogen bond donors/acceptors (i.e. where they’re pointing to, not where they are)


2D descriptors provide “bounds” on possible 3D conformations• “2½D” descriptors (including some stereochemical

information) may be useful

“Superiority” of 2D descriptors in some studies may be artifact of datasets used• datasets may have large numbers of close analogues• these will have high 2D similarity, as well as correlated

activity

42 Field-based 3D similarity

Another approach to similarity studies Based on overlap of continuously-varying fields

(e.g. electron density) Computationally much more intensive

• Calculation of 3D structures and fields• Alignment of molecules to measure overlap• Use of grid points in overlap calculations

43 Chemical Diversity

important feature of compound collections and combinatorial libraries

idea is to cover as much of “chemical space” as possible• lots of different structural features represented

avoid having too many similar compounds• no point in testing different compounds likely

to have the same properties

44 Chemical diversity measures

numerical measure of the diversity of a set of compounds• several different measures in use• no real agreement on the best measure• frequently based on similarity/distance between pairs of

compoundso average distance between every pair of compounds

• algorithms are available to calculate this in O(N) time

• short distances can be compensated by long distances

o minimum distance between any pair of compounds• requires all pairwise distances to be calculated (O(N2))

o many other more complex measures

45 Cell-based diversity

Usually used with descriptors based on continuous variables• “chemical space” is divided into a “grid”

o one dimension for each descriptoro one grid square (“cell”) for each range of descriptor values

• compounds are placed in appropriate cell (hypercube) for their descriptor values

• diversity an be measured by counting occupied/ unoccupied cells, and calculating average occupancy of each cell

46 Diverse subset selection

Frequently similarity measures (and other criteria) are used to identify, from a large number of compounds that could be synthesised for testing, a suitably diverse subset that will be synthesised and tested• this is particularly important with combinatorial

librarieso first design a large “virtual” libraryo then identify a subset “real” library to synthesise

47 Virtual library subsetting

acid + amine amide

virtual library has 1000 possible acids and 1000 possible amines, giving 1M amides

we only want to make and test 900 amides• we need 30 acids and 30 amines

we select diverse subsets (e.g. to maximise the minimum distance between any pair of compounds) of the acids and the amines• may be better off identifying the most diverse 900-

compound subset of the amides

48 Virtual library subsetting

Number of possible subsets of 900 from 1 million is vast• far too many to try them all

“Genetic algorithms” often used• “chromosomes” represent possible subsets• they are “mutated” etc. at each “generation” to try to

get better subsets• “fitness” functions measure diversity and other

characteristics (cost, “drug-likeness” etc.)

49 Summary of Lecture 6

similarity searching is an important alternative to substructure search there are many different ways of measuring the similarity or distance

between molecules similarity can be measured with respect to many different types of

structure descriptor• there is no general agreement on the “best” descriptors

similarity and distance measures can be used to identify compounds to be synthesised and tested as potential new drugs

similarity is the basis for measuring chemical diversity • useful concept in identifying subsets of a large dataset which cover

as much as possible of chemical space

Barnard, J. M.; Downs, G. M.; Willett, P. J. Chem. Inf. Comput. Sci., 1998, 38, 983-996

Leach and Gillet (2003) Chapters 3, 4, 5 and 6

50 Lecture 7: Topics to be Covered

Clustering• identifying classes of molecules similar to each other,

but different to those in other classes

Topological indexes• numbers that can be calculated from connection tables

Property prediction• predicting physicochemical or biological properties

directly from connection tables

The Drug Discovery Process

Documents

1 Chemical Structure Representation and Search Systems Lecture 6. Nov 18, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software