Selecting Diverse Sets of Compounds C371 Fall 2004

Selecting Diverse Sets of Compounds

C371

Fall 2004

Review

• Similar Property Principle: If structurally similar compounds are likely to exhibit similar activity, then maximum coverage of the activity space should be achieved by selecting a structurally diverse set of compounds.

Techniques

• High-Throughput Screening (HTS)

• Combinatorial Chemistry

• Early attempts led to large libraries, but little variability in the molecules created

• Need a way to identify subsets of compounds for synthesis, purchase, or testing

Chemical Diversity

• No unambiguous definition

• Need to quantify the degree of diversity of a subset of compounds

• Four main approaches:– Cluster analysis– Dissimilarity-based methods– Cell-based methods– Use of optimization techniques

CLUSTER ANALYSIS

• Aim is to divide a group into clusters where objects in the cluster are similar, but objects in other clusters are dissimilar

• Many algorithms for doing this– Hierarchical methods seem to be better than non-

hierarchical

• Sometimes called a “distance-based” approach to compound selection, because distance is measured between pairs of compounds

Key Steps in Cluster Analysis

• Generate descriptors for each compound

• Calculate the similarity or distance between all compounds

• Use a clustering algorithm to group the compounds

• Select a representative subset by taking one or more compounds from each cluster

“Distance”

• 1-S, where S is the similarity coefficient– When molecules are represented by binary

descriptors

• Euclidean distance– When molecules are represented by

physicochemical properties

Characteristics of Clustering Methods

• Non-overlapping: each object in one cluster only (Most use this approach)– Hierarchical methods– Non-hierarchical methods

• Overlapping: object can be in more than one cluster

• Efficiency and effectiveness issues: some approaches have very intensive computational requirements

Hierarchical Clustering

• Clusters increase in size, with each compound in a single cluster (a singleton) at one extreme– Agglomerative methods start at the bottom and merge

similar clusters• Ward’s method: clusters are formed to minimize the variance

(i.e., the sum of the squared deviations from the mean)• Others: centroid method and the median method

– Divisive hierarchical clustering starts with all compounds in a single cluster and partitions the data

Selecting the Appropriate Number of Clusters

• Need a cutoff value at which you are going to examine the molecules– Jaccard statistic of two clusters, C1 and C2

a--------------------------a + b + c

Where a is the number of compounds found in both clusters, b is the number that cluster in 1 but not 2, and c is the number in 2 but not 1

– Same as the Tanimoto coefficient

Non-Hierarchical Clustering

• Compounds are clustered without forming a hierarchical relationship

• Methods: – single-pass assigns a compound to a cluster

according to a cut-off value • Problem: doesn’t give same results all of the time,

i.e., dependent on the order of the molecules

– nearest neighbor: Jarvis Patrick clustering– relocation: K-means

DISSIMILARITY-BASED SELECTION METHODS

• Attempt to identify a diverse set of compounds directly

• Based on calculating distances or dissimilarities between compounds

Basic Algorithm for Dissimilarity-Based Selection Methods

• Decide on a desired size, n, of a final subset• Select a compound and place it in the subset• Calculate the dissimilarity between each of the

other compounds and those in the subset• Choose the next compound as the one most

dissimilar to those in the subset• If fewer than n in the subset, repeat the

calculation of the dissimilarity until n is achieved• Complexity varies as the square of n

CELL-BASED METHODS

• Operate within a pre-defined low-dimensional chemistry space, not dependent on the particular set of molecules being examined

• Compounds are allocated to cells according to their molecular properties

• Methods are very fast with a time complexity of O(N), but restricted to low-dimensional space – good for very large data sets– Examples: MW, logP, polarity, shape, hydrogen

bonding, aromatic interactions

BCUT Descriptors

• Matrix representation of molecules

• Atomic properties used for diagonal– Atomic charges, polarizabilities, hydrogen

bonding

• Connectivity used for the off-diagonals– 2D graph or interatomic distances from 3D

Partitioning Using Pharmacophore Keys

• Each potential 3- or 4-point pharmacophore is considered to constitute a cell

• A given molecule could be in more than one cell

• Promiscous molecules: those that contain a large number of pharmacophores, e.g., very flexible molecules

OPTIMIZATION METHODS

• Techniques for sampling large sets of molecules

• May want to spread the compounds evenly in space

• Techniques: Monte Carlo, simulated annealing

• Selective replacement

CONCLUSIONS

• Some research suggests that compounds within 0.85 Tanimoto similarity have between 30% and 80% chance of sharing the same biological activity

• No clear consensus on which screening approach is best

• Faster computer techniques (e.g., parallel computing) may help

• Descriptors used must be related to biological activity

Documents

Selecting Diverse Sets of Compounds C371 Fall 2004