18
Selecting Diverse Sets of Compounds C371 Fall 2004

Selecting Diverse Sets of Compounds C371 Fall 2004

Embed Size (px)

Citation preview

Page 1: Selecting Diverse Sets of Compounds C371 Fall 2004

Selecting Diverse Sets of Compounds

C371

Fall 2004

Page 2: Selecting Diverse Sets of Compounds C371 Fall 2004

Review

• Similar Property Principle: If structurally similar compounds are likely to exhibit similar activity, then maximum coverage of the activity space should be achieved by selecting a structurally diverse set of compounds.

Page 3: Selecting Diverse Sets of Compounds C371 Fall 2004

Techniques

• High-Throughput Screening (HTS)

• Combinatorial Chemistry

• Early attempts led to large libraries, but little variability in the molecules created

• Need a way to identify subsets of compounds for synthesis, purchase, or testing

Page 4: Selecting Diverse Sets of Compounds C371 Fall 2004

Chemical Diversity

• No unambiguous definition

• Need to quantify the degree of diversity of a subset of compounds

• Four main approaches:– Cluster analysis– Dissimilarity-based methods– Cell-based methods– Use of optimization techniques

Page 5: Selecting Diverse Sets of Compounds C371 Fall 2004

CLUSTER ANALYSIS

• Aim is to divide a group into clusters where objects in the cluster are similar, but objects in other clusters are dissimilar

• Many algorithms for doing this– Hierarchical methods seem to be better than non-

hierarchical

• Sometimes called a “distance-based” approach to compound selection, because distance is measured between pairs of compounds

Page 6: Selecting Diverse Sets of Compounds C371 Fall 2004

Key Steps in Cluster Analysis

• Generate descriptors for each compound

• Calculate the similarity or distance between all compounds

• Use a clustering algorithm to group the compounds

• Select a representative subset by taking one or more compounds from each cluster

Page 7: Selecting Diverse Sets of Compounds C371 Fall 2004

“Distance”

• 1-S, where S is the similarity coefficient– When molecules are represented by binary

descriptors

• Euclidean distance– When molecules are represented by

physicochemical properties

Page 8: Selecting Diverse Sets of Compounds C371 Fall 2004

Characteristics of Clustering Methods

• Non-overlapping: each object in one cluster only (Most use this approach)– Hierarchical methods– Non-hierarchical methods

• Overlapping: object can be in more than one cluster

• Efficiency and effectiveness issues: some approaches have very intensive computational requirements

Page 9: Selecting Diverse Sets of Compounds C371 Fall 2004

Hierarchical Clustering

• Clusters increase in size, with each compound in a single cluster (a singleton) at one extreme– Agglomerative methods start at the bottom and merge

similar clusters• Ward’s method: clusters are formed to minimize the variance

(i.e., the sum of the squared deviations from the mean)• Others: centroid method and the median method

– Divisive hierarchical clustering starts with all compounds in a single cluster and partitions the data

Page 10: Selecting Diverse Sets of Compounds C371 Fall 2004

Selecting the Appropriate Number of Clusters

• Need a cutoff value at which you are going to examine the molecules– Jaccard statistic of two clusters, C1 and C2

a--------------------------a + b + c

Where a is the number of compounds found in both clusters, b is the number that cluster in 1 but not 2, and c is the number in 2 but not 1

– Same as the Tanimoto coefficient

Page 11: Selecting Diverse Sets of Compounds C371 Fall 2004

Non-Hierarchical Clustering

• Compounds are clustered without forming a hierarchical relationship

• Methods: – single-pass assigns a compound to a cluster

according to a cut-off value • Problem: doesn’t give same results all of the time,

i.e., dependent on the order of the molecules

– nearest neighbor: Jarvis Patrick clustering– relocation: K-means

Page 12: Selecting Diverse Sets of Compounds C371 Fall 2004

DISSIMILARITY-BASED SELECTION METHODS

• Attempt to identify a diverse set of compounds directly

• Based on calculating distances or dissimilarities between compounds

Page 13: Selecting Diverse Sets of Compounds C371 Fall 2004

Basic Algorithm for Dissimilarity-Based Selection Methods

• Decide on a desired size, n, of a final subset• Select a compound and place it in the subset• Calculate the dissimilarity between each of the

other compounds and those in the subset• Choose the next compound as the one most

dissimilar to those in the subset• If fewer than n in the subset, repeat the

calculation of the dissimilarity until n is achieved• Complexity varies as the square of n

Page 14: Selecting Diverse Sets of Compounds C371 Fall 2004

CELL-BASED METHODS

• Operate within a pre-defined low-dimensional chemistry space, not dependent on the particular set of molecules being examined

• Compounds are allocated to cells according to their molecular properties

• Methods are very fast with a time complexity of O(N), but restricted to low-dimensional space – good for very large data sets– Examples: MW, logP, polarity, shape, hydrogen

bonding, aromatic interactions

Page 15: Selecting Diverse Sets of Compounds C371 Fall 2004

BCUT Descriptors

• Matrix representation of molecules

• Atomic properties used for diagonal– Atomic charges, polarizabilities, hydrogen

bonding

• Connectivity used for the off-diagonals– 2D graph or interatomic distances from 3D

Page 16: Selecting Diverse Sets of Compounds C371 Fall 2004

Partitioning Using Pharmacophore Keys

• Each potential 3- or 4-point pharmacophore is considered to constitute a cell

• A given molecule could be in more than one cell

• Promiscous molecules: those that contain a large number of pharmacophores, e.g., very flexible molecules

Page 17: Selecting Diverse Sets of Compounds C371 Fall 2004

OPTIMIZATION METHODS

• Techniques for sampling large sets of molecules

• May want to spread the compounds evenly in space

• Techniques: Monte Carlo, simulated annealing

• Selective replacement

Page 18: Selecting Diverse Sets of Compounds C371 Fall 2004

CONCLUSIONS

• Some research suggests that compounds within 0.85 Tanimoto similarity have between 30% and 80% chance of sharing the same biological activity

• No clear consensus on which screening approach is best

• Faster computer techniques (e.g., parallel computing) may help

• Descriptors used must be related to biological activity