Upload
elizabeth-curtis
View
214
Download
0
Embed Size (px)
Citation preview
Selecting Diverse Sets of Compounds
C371
Fall 2004
Review
• Similar Property Principle: If structurally similar compounds are likely to exhibit similar activity, then maximum coverage of the activity space should be achieved by selecting a structurally diverse set of compounds.
Techniques
• High-Throughput Screening (HTS)
• Combinatorial Chemistry
• Early attempts led to large libraries, but little variability in the molecules created
• Need a way to identify subsets of compounds for synthesis, purchase, or testing
Chemical Diversity
• No unambiguous definition
• Need to quantify the degree of diversity of a subset of compounds
• Four main approaches:– Cluster analysis– Dissimilarity-based methods– Cell-based methods– Use of optimization techniques
CLUSTER ANALYSIS
• Aim is to divide a group into clusters where objects in the cluster are similar, but objects in other clusters are dissimilar
• Many algorithms for doing this– Hierarchical methods seem to be better than non-
hierarchical
• Sometimes called a “distance-based” approach to compound selection, because distance is measured between pairs of compounds
Key Steps in Cluster Analysis
• Generate descriptors for each compound
• Calculate the similarity or distance between all compounds
• Use a clustering algorithm to group the compounds
• Select a representative subset by taking one or more compounds from each cluster
“Distance”
• 1-S, where S is the similarity coefficient– When molecules are represented by binary
descriptors
• Euclidean distance– When molecules are represented by
physicochemical properties
Characteristics of Clustering Methods
• Non-overlapping: each object in one cluster only (Most use this approach)– Hierarchical methods– Non-hierarchical methods
• Overlapping: object can be in more than one cluster
• Efficiency and effectiveness issues: some approaches have very intensive computational requirements
Hierarchical Clustering
• Clusters increase in size, with each compound in a single cluster (a singleton) at one extreme– Agglomerative methods start at the bottom and merge
similar clusters• Ward’s method: clusters are formed to minimize the variance
(i.e., the sum of the squared deviations from the mean)• Others: centroid method and the median method
– Divisive hierarchical clustering starts with all compounds in a single cluster and partitions the data
Selecting the Appropriate Number of Clusters
• Need a cutoff value at which you are going to examine the molecules– Jaccard statistic of two clusters, C1 and C2
a--------------------------a + b + c
Where a is the number of compounds found in both clusters, b is the number that cluster in 1 but not 2, and c is the number in 2 but not 1
– Same as the Tanimoto coefficient
Non-Hierarchical Clustering
• Compounds are clustered without forming a hierarchical relationship
• Methods: – single-pass assigns a compound to a cluster
according to a cut-off value • Problem: doesn’t give same results all of the time,
i.e., dependent on the order of the molecules
– nearest neighbor: Jarvis Patrick clustering– relocation: K-means
DISSIMILARITY-BASED SELECTION METHODS
• Attempt to identify a diverse set of compounds directly
• Based on calculating distances or dissimilarities between compounds
Basic Algorithm for Dissimilarity-Based Selection Methods
• Decide on a desired size, n, of a final subset• Select a compound and place it in the subset• Calculate the dissimilarity between each of the
other compounds and those in the subset• Choose the next compound as the one most
dissimilar to those in the subset• If fewer than n in the subset, repeat the
calculation of the dissimilarity until n is achieved• Complexity varies as the square of n
CELL-BASED METHODS
• Operate within a pre-defined low-dimensional chemistry space, not dependent on the particular set of molecules being examined
• Compounds are allocated to cells according to their molecular properties
• Methods are very fast with a time complexity of O(N), but restricted to low-dimensional space – good for very large data sets– Examples: MW, logP, polarity, shape, hydrogen
bonding, aromatic interactions
BCUT Descriptors
• Matrix representation of molecules
• Atomic properties used for diagonal– Atomic charges, polarizabilities, hydrogen
bonding
• Connectivity used for the off-diagonals– 2D graph or interatomic distances from 3D
Partitioning Using Pharmacophore Keys
• Each potential 3- or 4-point pharmacophore is considered to constitute a cell
• A given molecule could be in more than one cell
• Promiscous molecules: those that contain a large number of pharmacophores, e.g., very flexible molecules
OPTIMIZATION METHODS
• Techniques for sampling large sets of molecules
• May want to spread the compounds evenly in space
• Techniques: Monte Carlo, simulated annealing
• Selective replacement
CONCLUSIONS
• Some research suggests that compounds within 0.85 Tanimoto similarity have between 30% and 80% chance of sharing the same biological activity
• No clear consensus on which screening approach is best
• Faster computer techniques (e.g., parallel computing) may help
• Descriptors used must be related to biological activity