Parallel tiered clustering for large data sets using a modified Taylors algorithm J. MacCuish 1 N. MacCuish 1, M. Chapman 1 1 Mesa Analytics & Computing,

Parallel tiered clustering for large data sets using a modified Taylor’s algorithmParallel tiered clustering for large data sets using a modified Taylor’s algorithm

J. MacCuish 1

N. MacCuish1, M. Chapman 1

1 Mesa Analytics & Computing, Inc, Santa Fe, New Mexico, USA

Abstract

Clustering large sets has many applications in drug discovery, among them compound acquisition decisions and combinatorial library diversification. Molecular fingerprints (2D) and molecular shape conformers (3D) from PubChem are the basic descriptors comprising the large sets utilized in this study. A parallel tiered clustering algorithm, implementing a modified Taylor’s algorithm, will be described as an efficient method for analyzing datasets of such large scale. Results will be presented in SAESAR (Shape And Electrostatics Structure Activity Relationships).

Motivation

Though leader and related exclusion region clustering Though leader and related exclusion region clustering algorithms such as Taylor/Butinaalgorithms such as Taylor/Butina1,21,2 clustering, are fast and clustering, are fast and can group millions of compound fingerprints in parallel, they can group millions of compound fingerprints in parallel, they suffer from the difficulty of finding an appropriate region suffer from the difficulty of finding an appropriate region threshold for the problem at hand. threshold for the problem at hand. KK-means clustering, also -means clustering, also used for large scale clustering, suffers an analogous problem used for large scale clustering, suffers an analogous problem in the choice of in the choice of KK. Finding an appropriate threshold or . Finding an appropriate threshold or choice of K for the data can be very computationally choice of K for the data can be very computationally expensive, above and beyond the expense of clustering expensive, above and beyond the expense of clustering millions of compoundsmillions of compounds

Methods

Algorithm:Algorithm:We use Taylor’s algorithm, modified to assign false singletons to their nearest We use Taylor’s algorithm, modified to assign false singletons to their nearest respective cluster, and break exclusion region ties (clusters with the greatest respective cluster, and break exclusion region ties (clusters with the greatest membership have the same cardinality) by finding the most compact cluster. membership have the same cardinality) by finding the most compact cluster. Additionally, the input to the algorithm is a sparse matrix composed of only Additionally, the input to the algorithm is a sparse matrix composed of only those values that are reasonable dissimilarities. This helps on two counts, one those values that are reasonable dissimilarities. This helps on two counts, one the generation of the matrix can take into account other efficiencies in the generation of the matrix can take into account other efficiencies in eliminating unnecessary comparisons; second, the sparse matrix greatly eliminating unnecessary comparisons; second, the sparse matrix greatly reduces internal memory and disk storage, often by a large constant factor reduces internal memory and disk storage, often by a large constant factor (e.g., 100 times). The algorithm returns clusters, their respective (e.g., 100 times). The algorithm returns clusters, their respective representative elements (centroids or centrotypes), true singletons, and false representative elements (centroids or centrotypes), true singletons, and false singleton cluster assignments, and ambiguity statistics.singleton cluster assignments, and ambiguity statistics.

Tiered Taylor’s

Taylor’s algorithm can then be used as a base algorithm to Taylor’s algorithm can then be used as a base algorithm to iteratively span a set of regular thresholds, successively iteratively span a set of regular thresholds, successively reducing the size of the sparse matrix used at each step. reducing the size of the sparse matrix used at each step.

Namely, create a base sparse matrix at some broad threshold, Namely, create a base sparse matrix at some broad threshold, M M (e.g., Tanimoto 0.7, or 0.3 dissimilarity), choose a (e.g., Tanimoto 0.7, or 0.3 dissimilarity), choose a minimum threshold (e.g., Tanimoto 0.95), minimum threshold (e.g., Tanimoto 0.95), TT, a step size (e.g., , a step size (e.g., 0.01), 0.01), S, S, and a stopping threshold, N. In principle, these and a stopping threshold, N. In principle, these matrices can be dissimilarity values from any data. Here we matrices can be dissimilarity values from any data. Here we focus on fingerprint and shape Tanimoto values transformed focus on fingerprint and shape Tanimoto values transformed to the Soergel dissimilarity. to the Soergel dissimilarity.

Tiered Algorithm

Preprocess steps:Preprocess steps:Create sparse matrix Create sparse matrix MM (in parallel) at a threshold (in parallel) at a threshold NNRemove singletonsRemove singletonsMM’ = ’ = MM

InputInput M,T, NM,T, N• Cluster in parallel with threshold = Cluster in parallel with threshold = TT..• Pool cluster representatives and singletons into set Pool cluster representatives and singletons into set VV• Collect matrix information for Collect matrix information for VV from from MM and create new and create new MM’’• Calculate the mean of all internal cluster distances for Kelley’s Level Selection, and Calculate the mean of all internal cluster distances for Kelley’s Level Selection, and

output the number of singletons and the number of clusters.output the number of singletons and the number of clusters.• Set Set TT = = TT + + SS• Repeat until Repeat until TT = = NNCompute Kelley Level Selection values over the span of iterations, normalized for the size of Compute Kelley Level Selection values over the span of iterations, normalized for the size of

the data at each iteration.the data at each iteration.

OutputOutput: Each iteration represents a clustering and the full results represents a forest of trees, : Each iteration represents a clustering and the full results represents a forest of trees, the leaves containing the first cluster representatives, and each level, the results of the leaves containing the first cluster representatives, and each level, the results of successive iterations. successive iterations.

Data and Equipment

Data set:Data set:PubChem Kinase Data, FAKPubChem Kinase Data, FAK3 3

96,881 compounds, with 811 actives 96,881 compounds, with 811 actives 90,784 compounds with salts and charged compounds removed90,784 compounds with salts and charged compounds removed89,507 unique fingerprints (1.4% duplicates with Mesa 768 key bit 89,507 unique fingerprints (1.4% duplicates with Mesa 768 key bit fingerprints)fingerprints)

Equipment: Single Alienware workstation with 4 gigabytes of RAM andEquipment: Single Alienware workstation with 4 gigabytes of RAM and4, Intel 3.2 gigahertz XEON cores, running Suse 9.3 64 bit OS.4, Intel 3.2 gigahertz XEON cores, running Suse 9.3 64 bit OS.

Timings

Parallel Matrix Generation for 89,507 fingerprintsParallel Matrix Generation for 89,507 fingerprints24 minutes -- Sparse Matrix with 0.3 dissimilarity for 89,507 fingerprints24 minutes -- Sparse Matrix with 0.3 dissimilarity for 89,507 fingerprintsversus 43 minutes sequentiallyversus 43 minutes sequentially

Parallel Tiered Clustering including Kelley’s Parallel Tiered Clustering including Kelley’s 2.5 minutes -- from 0.1 to 0.3 dissimilarity with 0.01 step size (21 iterations)2.5 minutes -- from 0.1 to 0.3 dissimilarity with 0.01 step size (21 iterations)versus 8 minutes sequentially.versus 8 minutes sequentially.

IO times are included.IO times are included.

Largest single clustering to date with proprietary data is ~6.7 million compounds: 3 weeks on a 32 node cluster Largest single clustering to date with proprietary data is ~6.7 million compounds: 3 weeks on a 32 node cluster running 2.0 gigahertz chips for the matrix generation, and 5 days on a single machine with 12 cores of 3.2 running 2.0 gigahertz chips for the matrix generation, and 5 days on a single machine with 12 cores of 3.2 gigahertz chips with 128 gigabytes of RAM for the clustering. Using MDL 320 MACCS Key fingerprints.gigahertz chips with 128 gigabytes of RAM for the clustering. Using MDL 320 MACCS Key fingerprints.

Kelley’s level selection

Using the Tiers

Tiered output is a forest of general rooted trees

…

…

… …

…… …...

.

.

.

.

.

.… …

Nodes atEach levelare a set ofcentrotypes

Leaves are clusters or singletons containing all compounds. HeightOf trees is the Tanimoto range: for example 0.7 top to 0.9 bottom clusterCentrotypes in levels of 0.01.

…… ……… …… … ……… ………… ……… … ……… …..

Hierarchies

..

Kinase Data: Centroids found at 0.76 Tanimoto with TieredClustering, clustered again with Group Average Hierarchical

Shape Cluster of ActiveCluster

Acknowledgments Software

OpenEye Scientific Software, Inc. OpenBabel Dalke Scientific, LLC

References

Taylor, R. Simulation Analysis of Experimental Design Strategies for Screening Random Compounds as Potential New Drugs and Agrochemicals , J. Chem. Inf. Comput. Sci 1995, 35, 59-67.

Butina, D. Unsupervised Data Base Clustering Based on Daylight's Fingerprint and Tanimoto Similarity: A Fast and Automated Way To Cluster Small and Large Data Sets. J.. Chem. Inf. Comput. Sci, 1999, 39(4), 747-750.

PubChem: http://www.ncbi.nlm.nih.gov/sites/entrez?db=pcassay&term=Kinase Primary biochemical high-throughput screening assay for inhibitors of Focal Adhesion Kinase (FAK)

http://www.ncbi.nlm.nih.gov/sites/entrez?db=pcassay&term=Kinase

Documents

Parallel tiered clustering for large data sets using a modified Taylors algorithm J. MacCuish 1 N. MacCuish 1, M. Chapman 1 1 Mesa Analytics & Computing,