Journal of Cheminformatics...Journal of Cheminformatics Research article Open Access Optimal assignment methods for ligand-based virtual screening Andreas Jahn*, Georg Hinselmann,

Journal of Cheminformatics

Open AcceResearch articleOptimal assignment methods for ligand-based virtual screeningAndreas Jahn Georg Hinselmann Nikolas Fechner and Andreas Zell

Address University of Tuumlbingen Center for Bioinformatics Tuumlbingen (ZBIT) Sand 1 72076 Tuumlbingen Germany

Email Andreas Jahn - andreasjahnuni-tuebingende Georg Hinselmann - georghinselmannuni-tuebingende Nikolas Fechner - nikolasfechneruni-tuebingende Andreas Zell - andreaszelluni-tuebingende

Corresponding author

AbstractBackground Ligand-based virtual screening experiments are an important task in the early drugdiscovery stage An ambitious aim in each experiment is to disclose active structures based on newscaffolds To perform these scaffold-hoppings for individual problems and targets a plethora ofdifferent similarity methods based on diverse techniques were published in the last years Theoptimal assignment approach on molecular graphs a successful method in the field of quantitativestructure-activity relationships has not been tested as a ligand-based virtual screening method sofar

Results We evaluated two already published and two new optimal assignment methods onvarious data sets To emphasize the scaffold-hopping ability we used the information ofchemotype clustering analyses in our evaluation metrics Comparisons with literature results showan improved early recognition performance and comparable results over the complete data set Anew method based on two different assignment steps shows an increased scaffold-hoppingbehavior together with a good early recognition performance

Conclusion The presented methods show a good combination of chemotype discovery andenrichment of active structures Additionally the optimal assignment on molecular graphs has theadvantage to investigate and interpret the mappings allowing precise modifications of internalparameters of the similarity measure for specific targets All methods have low computation timeswhich make them applicable to screen large data sets

BackgroundVirtual screening (VS) approaches are a fundamental partof the drug discovery pipeline [12] The top-ranked struc-tures of VS experiments are further analysed in biologicalassays to elucidate their activities The methods for thispurpose can be divided into structure-based and ligand-based VS techniques Structure-based approaches likeDOCK [34] or PLANTS [5] try to dock the ligand into thebinding pocket of a target protein which requires X-raydiffraction or nuclear magnetic resonance spectroscopy

experiments to obtain three-dimensional coordinates ofthe protein structure Additionally there are doubts aboutthe ability of docking approaches to predict the affinity oreven the rank of the structures [6] In spite of these nega-tive examples there exists a remarkable list of successfulstructure-based VS stories [7]

Ligand-based methods operate only on one or severalknown active ligands Therefore ligand-based approachesare the method of choice if no protein structure is availa-

Published 25 August 2009

Journal of Cheminformatics 2009 114 doi1011861758-2946-1-14

Received 15 May 2009Accepted 25 August 2009

This article is available from httpwwwjcheminfcomcontent1114

copy 2009 Jahn et al licensee BioMed Central Ltd This is an Open Access article distributed under the terms of the Creative Commons Attribution License (httpcreativecommonsorglicensesby20) which permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

of 23(page number not for citation purposes)

Journal of Cheminformatics 2009 114 httpwwwjcheminfcomcontent1114

ble The underlying assumption of ligand-based approachis that similar structures have similar biological activityMaggiora and Johnson [8] introduced the similarity-prop-erty principle that implies that the chemical similarity canbe related to the biological activity of structures AlthoughMartin et al [9] reported that the correlation betweenstructural similarity and biological activity is not sostrong the correlation between activity and similarity is ofcourse the result of the used similarity measure and variesbetween different methods Thus a wide variety of differ-ent similarity measures have been proposed in recentyears [1011] All methods can be divided into differentclasses concerning the type of information used to calcu-late the similarity between two structures A simple andfast approach is to encode the information in fingerprintsbased on features that are included in the structure Thosetopological or structural fingerprints like the MACCS keys[12] or Daylight fingerprints [13] are often used but lackthe ability to perform scaffold-hoppings [14] A moreelaborate approach is based on molecular graphs in com-bination with maximum common substructure or maxi-mum common edge subgraph isomorphism as used bythe RASCAL algorithm or different reduced graphapproaches [15-17] Feature Trees are another reducedgraph method which uses a reduced tree representation ofimportant molecular fragments like hydrophobic parts orfunctional groups [18] The topological basedMOLPRINT2D approach uses count vectors of atom typesin different layers as molecular atom environment finger-prints and has shown to be useful descriptors for naiumlveBayesian classifiers in VS experiments [19]

Similarity measures based on three-dimensional coordi-nates include geometrical information of arbitraryobjects defined on structures The type of an object variesbetween different methods and can be classified into threetypes pharmacophores molecular shapes or volumesand molecular (interaction) fields Pharmacophore basedmethods generate patterns of distances between prede-fined molecular properties like aromatic systems orhydrogen bond acceptorsdonors [2021] and calculate asimilarity value by a comparison of the correspondingpatterns Molecular shape or volume approaches try tomaximize the overlap of shapes or volumes and deter-mine a similarity value based on the overlap Ballester etal introduced a non-superposition comparison algorithmfor molecular shapes called Ultrafast Shape Recognitionand applied it in VS experiments [22] ROCS uses three-dimensional Gaussian functions to describe the volumeof query structures and to screen databases in search ofstructures with similar shapes [23] Molecular interactionfield methods like GRID [24] or CoMFA [25] were origi-nally introduced in the field of quantitative structure-activity relationship modelling (QSAR) but are also suita-ble for VS experiments [26] FieldScreen a recently pub-

lished method by Cheeseright et al [27] uses fourdifferent types of molecular fields and reduces these fieldsto the local maximal values These values referred to asfield points [28] serve as the descriptors This methodhas shown an improved scaffold-hopping behavior onvarious data sets and improved enrichment rates in com-parison to DOCK

The concept of an optimal assignment of arbitrary objectsof molecular graphs was introduced in the field of chem-informatics by Froumlhlich et al [2930] Just like CoMFAstudies [2531] the optimal assignment on moleculargraphs was used to derive QSAR and quantitative struc-ture-property relationship models Despite several proofsof the usability of this concept on several data sets and dif-ferent problems [293032-34] no experiments have beenconducted to evaluate this idea in the field of ligand-basedVS

There are two objectives of this study First we want toevaluate the existing optimal assignment similarity meas-ure and its flexibility extension on a set of different ligand-based VS experiments Second we introduce two newmethods based on the optimal assignments The experi-ments were designed with emphasis on the evaluation ofthe scaffold-hopping behavior of the methods This wasachieved by using recently suggested VS metrics based onchemotype clustering and extended receiver operatingcharacteristics [35-41] To benchmark the optimal assign-ment methods we compared the results with FieldScreenthe 166 bit MACCS keys and DOCK

The results show an improved early recognition perform-ance for the optimal assignment methods in comparisonto selected literature methods The performance over thecomplete data set suffers from late retrievals of dissimilarchemotypes reducing the arithmetic weighted version ofthe area under the receiver operating characteristic curvevalues One of the optimal assignment methods which isbased on two different assignment steps shows anincreased performance on both types of evaluation Thepresented methods have low computation times conse-quently they can be applied to screen large databases

MethodsIn this section we first summarize the optimal assignmentproblem and give a general definition of the problemAfterwards a detailed description of the different meth-ods which are evaluated in the scope of ligand-based VSfollows

The optimal assignment problem is one of the basic dis-crete optimization problems The fundamental task is tofind the set of assignments that minimizes (or maximizes)the overall cost of the assignments A mathematical

description of the optimization problem is to find a min-imum (or maximum) weight matching in a completeweighted bipartite graph Given two disjoint sets of arbi-trary objects X = (x1 x2 xn) and Y = (y1 y2 ym) with nge m without loss of generality the optimization problemcan be defined as

where π is a subset of the indices 1 n of size m describ-ing the assigned objects of the set X and w(xπ(i) yi) is theweight (cost) of mapping the object xπ(i) onto yi TheKuhn-Munkres algorithm (also called HungarianMethod) [4243] is one popular algorithm to solve thisoptimization problem

Froumlhlich et al introduced the concept of the optimalassignment in the field of cheminformatics as a kernelfunction for attributed molecular graphs [2930]Although this function is not a valid kernel function [44]it can be used as a similarity function based on moleculargraphs

Optimal Assignment KernelThe idea of the Optimal Assignment Kernel (OAK)[2930] is to find a mapping of the atoms of the smallermolecule onto the atoms of the other molecule maximiz-ing the sum of the pairwise atom similarities The first stepof this approach is the calculation of all pairwise atomsimilarities For this purpose a radial basis function (RBF)calculates a similarity value based on physico-chemicaldescriptors of the atoms (Equation 2 where ai and bi arethe ith descriptor value of the atom a and b σ2 representsthe variance of a descriptor) To calculate the similaritybetween two atoms the OAK uses 24 atom and 8 bonddescriptors of the expert system of JOELib2 [4546] For acomplete list of the descriptors we refer to Additional file1 of this study

To improve the chemical interpretation of the similaritycalculation the information of the topological neighborsand the bonds up to a predefined depth is integrated intothe atom-wise similarity calculation The integration isrealized by a recursive atom-wise similarity calculation ofthe neighbors In addition a decay parameter reduces theinfluence of atoms with increased topological distanceGiven two molecules A and B with atoms a1 an andb1 bm the primal form of the optimal assignment prob-lem is modified to Equation 3

The result of Equation 3 is the sum of all atom-wise simi-larity values Hence the function yields higher results ifthe molecules have more atoms To reduce the impact ofthe number of atoms on the final similarity value theresult of the optimal assignment is normalized by Equa-tion 4 to a final value in the range of [0 1]

An example of an optimal assignment on moleculargraphs can be seen in Figure 1

Optimal Assignment Kernel with Flexibility ExtensionThe original OAK computes the similarity of two mole-cules using the sets of atoms augmented with their localneighborhood The idea of the OAK can also be applied toapproximate the similarity of the conformational space oftwo molecules [34] This is achieved by breaking downthe overall molecular flexibility into a set of local flexibil-ities similar to the local atom environments used by theOAK The local flexibility is defined as the spatial posi-tions on which the neighbors of a center (core) atomcould be found depending on the length and angles of thebonds by which they are connected to the core (eg Figure2) Only the flexibility of the second and third orderneighbors are considered and the information is stored inthe core atom

The spatial positioning possibilities are expressed by usinginternal coordinate parameters like bond lengths andangles These parameters are regarded as only dependingon the hybridization of the atoms that connect the corewith the neighbor Non-bonding interactions like electro-static and steric interactions which would also influencethe conformational space are not considered Ring bondsand non-single bonds are generally regarded as rigidTherefore the approach only approximates the real spatialpositioning of the neighbor The details of the parametri-zation of the local flexibility can be found in [34]

The overall similarity of the approximated conforma-tional spaces is computed by augmenting the local atomsimilarity in the original OAK formulation with the simi-larity of the local flexibility (ie the parametrization of thespatial positioning of the neighbors) of the center atoms

cost =isin

=summin ( )

( )( )

Π Xi i

w x y1

Simatom( ) expa b

ai bii

minusminus

⎜⎜⎜⎜

⎟⎟⎟⎟

2 2σ(2)

Sim if

max ( )

a b n m

gtsumππ

ππ )))

⎪⎪⎪

otherwise

SimSim

Sim SimNorm( )( )

( ) ( )A B

A BA A B B

=times

(OAKFLEX) The flexibility similarity between two atoms isthe weighted sum of the flexibility similarities of the sec-ond and third order neighbors The weighted sum isadjusted by an internal parameter Additionally the con-tribution of the flexibility similarity to the original OAKsimilarity is set by a second parameter We used the firstdefault parametrization as suggested by Fechner et al [34](OAK weight = 095 flexibility extension weight = 005second order neighbor weight = 03 and third orderneighbor weight = 07)

Two-Step Hierarchical AssignmentAn accurate investigation of the assignments of the OAKdiscloses failures and shows that the OAK is not able togenerate a substructure preserving assignment of theatoms in some cases These topological errors result in anassignment that scatters the atoms of a substructure overthe complete other molecule The assignments maximizethe overall similarity but from a chemical point of viewsome mappings are problematic An example of an OAK

assignment with topological errors can be seen in Figure3

Our analysis uncovered that the occurrence of these errorsis favored by the existence of multiple aromatic or con-densed ring systems with a small number of substitutionsand the presence of conjugated environments To reducethe number of topological errors we developed a two-stephierarchical assignment approach (2SHA) The basic ideais to perform two different assignments on different typesof objects The first assignment step operates on a sub-structure level and maps similar substructures of the mol-ecules onto each other For this purpose a preprocessingstep is necessary to identify condensed aromatic andconjugated systems (eg Figure 4) The ring detection isdone with a modified biconnected component algorithmthat operates on the molecular graph The identificationof the aromaticity and conjugated environments is doneby the expert system of JOELib2 [4546] To compute thesimilarity between two substructures we use a modified

Optimal atom assignmentFigure 1Optimal atom assignment Optimal atom assignment of two angiotensine-converting enzyme molecules The assignments are based on local atom similarity calculations of the OAK The color of the mapping edges indicates the atom similarity green represents a high similarity whereas red edges indicate a low similarity

Local flexibilityFigure 2Local flexibility Visualization of the local flexibility for one core atom The colored shapes represent possible positions of the equal colored atoms The black core atom is the source of the flexibility objects

Optimal atom assignment with topological errorsFigure 3Optimal atom assignment with topological errors Atom mapping of the OAK on two benzodiazepine derivatives dis-closing topological errors Each of the four intersecting edges maps one atom of the aromatic system on the condensed sys-tem

Fragmentation and assignment of fragmentsFigure 4Fragmentation and assignment of fragments Result of the fragmentation algorithm and the first assignment step The aromatic and condensed systems were mapped onto each other The terminal nitro group forms a conjugated fragment but has no assignment partner and remains unassigned

version of the original OAK that computes a similarityscore without performing the normalization as given inEquation 4 The advantage of omitting this step is areduced computation time because the self-similaritiesSim(A A) and Sim(B B) are not necessary and are notcomputed

The algorithm computes the first optimal assignment cal-culation based on the substructure similarity values andmaps one fragment of the smaller molecule on exactly onefragment of the other molecule An example of such amapping process is shown in Figure 4 on the two benzo-diazepine derivatives used to explain the topologicalerrors The information of the fragment-based mapping isstored in each atom of a mapped fragment This knowl-edge is used to establish constraints for the second assign-ment on the atomic level that reduces the frequency oftopological errors

An additional aim of this approach is the inclusion of geo-metrical information into the similarity calculation Toachieve this we integrate an extra calculation stepbetween the two assignment calculations The initial pre-processing of the structures identifies aromatic systemscondensed rings and conjugated environments whichare all rigid scaffolds of molecules Our idea is to superim-pose the rigid substructures which were mapped in thefirst assignment step To realize the superposition thesubstructures are treated as individual molecules and analgorithm based on quaternions which represents anequivalent method to the well-known Kabsch algorithmwas used [47-49] Note that the necessary information ofmatching atoms of the Kabsch algorithm is the result ofthe similarity calculation between the substructures Thereason for this is the optimal assignment of the atoms ofthe substructures to calculate the similarity Although thealgorithm calculates a superposition minimizing the root-mean-square deviation our approach uses the three-dimensional atom coordinates and integrates the infor-mation into the second assignment step

The second assignment is the final optimal assignment onthe atomic level which includes all information of theprevious steps Each atom has information about its inte-gration into substructures and the mapping of these sub-structures Considering this information all atoms of themolecules can be divided into three different atom classesFirst atoms that are not part of a substructure like linkersor side-chains Second atoms that are part of fragmentswhich were not mapped in the first assignment Thirdatoms that are part of fragments which were mappedUsing these three classes we obtain six different unordered2-element subsets Each subset represents a different caseof the atom-wise similarity calculation using differentinformation and parameters The case in which both

atoms are part of a fragment and both fragments were notmapped cannot occur because of the nature of the optimalassignment problem This reduces the number of differentcases to five which have to be regarded in the similaritycalculation on the atomic level However the case inwhich both atoms are part of a fragment has to be dividedinto two cases depending on whether the fragments weremapped onto each other or not So we need to consider sixcases overall To include this discrimination we inte-grated the following case differentiation into the originalatom-wise similarity calculation

1 Both atoms are not part of a fragment Calculate thesimilarity using the original RBF of the OAK

2 Both atoms are part of a fragment and the fragmentswere not mapped onto each other This case is themain cause of topological errors Therefore themethod has to penalize the similarity score to avoid amapping of those atoms The penalization is done bya decreased σ value in the RBF This modificationsharpens the RBF and reduces the overall score of theatom-wise similarity calculation

3 Both atoms are part of a fragment and the fragmentswere mapped onto each other The method makes useof the geometrical information of the superposition ofthe fragments Given two atoms from the two super-imposed fragments the method calculates a vectorbased on the two atom coordinates and maps the vec-tor into a two-dimensional plane using an isometricmapping The initial and terminal points of the two-dimensional vector represent the coordinates of theatoms in the two-dimensional plane An RBF is cen-tered at each point using the van der Waals radius of

the corresponding atom as σ value (σvdW) Equation 5

calculates the final geometrical similarity value for twomapped atoms a and b using two RBF and the half

atomic distance as input The result of this geo-

metrical atom-wise similarity calculation is integratedinto the original RBF of the OAK in the form of anadditional numerical descriptor

4 One atom is part of a fragment but this fragment isnot mapped The other atom is not part of a fragment

( ) exp expa b

⎛⎝⎜

⎞⎠⎟

⎜⎜⎜⎜⎜⎜⎜⎜

⎟⎟⎟⎟⎟⎟⎟⎟

minusσ

⎛⎛⎝⎜

⎞⎠⎟

⎜⎜⎜⎜⎜⎜⎜⎜

⎟⎟⎟⎟⎟⎟⎟⎟

2 2σ vdWb

In this case the method tries to map an atom of a rigidfragment onto a side-chain or linker atom The frag-ment was not mapped in the first assignment step butthis mapping is also doubtful and has to be penalizedThe technique for penalizing those mappings is thesame as in the second case

5 One atom is part of a fragment which is mappedThe other atom is not part of a fragment This case isrelated to the previous one but the fragment wasmapped in the first assignment step If the atom-wisesimilarity calculation results in a high similarity valuethe possibility that the atom of the fragment isassigned to an atom of the corresponding fragment ofthe other molecule is reduced Depending on thestructure of the two molecules this increases the pos-sibility of topological errors Therefore the penaltyhas to be higher than in the fourth case

6 Both atoms are part of a fragment but only one frag-ment was mapped This case is also related to thefourth and fifth one but the mapping of two atoms offragments is more reasonable from a chemical point ofview However the risk of topological errors is alsoincreased and therefore we use the same penalizationas in the fourth case

The separate σ values of each case can be set individuallyand have a great impact on the final similarity value Theoptimal parametrization depends on the structures of adata set Based on a chemical interpretation of the differ-ent cases we expected the following relations between thedifferent σ values σ2 ltσ5 ltσ6 le σ4 where the indices rep-resent the number of the case We performed a grid searchin the range [025 05 100] to define the σ values forthe penalization and to validate our hypothesis concern-ing the relation between the values The result of the gridsearch yields the following values σ2 = 25 σ5 = 50 σ6 =40 and σ4 = 40 The empirical results of the grid searchdiffer only in the relation between the σ values of the fifthand sixth case Thus the different cases of the 2SHAmethod show a correlation with the chemical interpreta-tion

The final result of the optimal assignment of the atomsbased on the atom-wise similarity calculations using thedifferentiation with the six cases is shown in Figure 5 onthe same molecules causing topological errors using theoriginal OAK The assignment preserves the mapping ofthe substructures and reduces the occurrence of topologi-cal errors The assignment of the atom from the con-densed ring system onto the nitrogen of the nitro group ispenalized

Optimal Local Atom Pair Environment Assignment

This variant of the optimal assignment approach useslocal atom pair environments (OAAP) There are two fun-damental types of atom pairs topological and geometricalatom pairs We employed the binned geometrical distance

matrix between the three-dimensional coordi-

nates of atoms i j of a molecule in this study The geomet-rical bin size b was set to 1 Aring Therefore this method canbe considered as a geometrical similarity measure Thecomputation time of D has a quadratic complexity withthe number of atoms The matrix is symmetric so it is suf-ficient to compute the upper half of each matrix The dis-tance in the diagonal equals zero

D is used as lookup table to store the information of allgeometrical atom pairs from atom i in a trie A trie is a pre-fix search tree that can be applied to patterns with a read-ing direction In our case we have a star-shaped localatom environment At the root of the trie of atom i thehash code of the atomic symbol i is placed Next the pat-terns of the form

are inserted as ordered triplets Note that symbol can beany atom type or fragment representation and hash(sym-bol) any suitable hash function hash symbol rarrN By usinga trie the collection of patterns is non-redundant becausethe trie is updated whenever a known part of a pattern isinserted If the whole pattern is already contained in thetrie the count is incremented by one For many similaritymetrics it is necessary to know basic properties of a triefor example the total number of patterns or the number ofunique patterns (number of leaves) Therefore our trieimplementation has a method to compute such proper-ties A geometrical local atom pair environment and thecorresponding trie of the marked atom is shown in Figure6

The representation of the local atom environments as triesallows us to apply efficient recursive similarity computa-tions A well-known similarity metric for nominal featuresis the Tanimoto coefficient The computation of the Tani-moto coefficient of two local atom environments isreduced to a comparison of two tries

Let LA LB be the sets of local atom pair environments of

two molecular graphs A B and the tries

i j of the nominal features (atom pair environments ofatoms i j) Now the Tanimoto coefficient can be definedas

= ⎢⎣⎢

⎥⎦⎥

hash symbol i d hash symbol jij( ( )) ( ( ))

l L l LA A B Bi jisin isin

The overall molecular similarity is computed by the scoreof the optimal assignment of the local atom pair environ-

ments The cost of computing the local similarity matrix isO(nml) for n m atoms of the molecules A B and l leavesof the larger trie The lookup for a pattern has a constantcomputation time because the depth of the tries is fixed

We computed the OAAP similarity between two structureson single low-energy conformations In spite that it ispossible to extend this method to conformational ensem-

Sim( )l llAi lB j

lAi lB j

A Bi j=

cup(6)

Optimal atom assignment using the two-step hierarchical assignmentFigure 5Optimal atom assignment using the two-step hierarchical assignment Result of an optimal assignment using the two-step hierarchical assignment approach with the case differentiation of the pairwise atom similarity calculation The hierar-chical assignment reduces the number of topological errors and shows a substructure preserving mapping The mapping of the carbon atom of the condensed ring system onto the nitrogen of the nitro group is an example of the sixth case and results in a penalized mapping

bles This could be accomplished by simply averaging thedistances as a case in point The approach was imple-mented using the Chemistry Development Kit (CDK)library [5051]

Parameters and Computation TimeAll presented methods have internal parameters whichallow modifications of the similarity measures Thesemodifications can improve the results on a specific prob-lem and data set The optimization of the parameters is acomplex process making extensive methods like gridsearches necessary However one aim of this work was toevaluate the usability of optimal assignment methods on

molecular graphs for ligand-based VS experiments There-fore we used the described standard parameters for eachmethod which were constant for all data sets to obtaincomparable results of the overall performance

The computation time of the methods depends on thenumber of atoms which have to be mapped in the assign-ment step In addition each method has its own type ofpreprocessing and mining the chemical information Thisyields differences in the performance depending on thedata set Instead of giving a complete overview of the com-putation time for each method and data set we report anaveraged computation time as it can be expected for drug-

Binned geometrical distances spheres and trieFigure 6Binned geometrical distances spheres and trie The upper left figure shows the spheres of the binned geometrical dis-tances of 10 20 and 30 Aring for the centered carbon atom The sphere of the binned geometrical distance of 00 Aring (distances in the range [0010)) is not visualized as individual sphere because it contains no atoms The upper right figure illustrates the resulting local atom pair environment of binned geometrical distances For simplicity only the distances to non-carbon atoms are displayed The lower figure visualizes the corresponding trie of geometric atomic distances of the annotated atom in the upper figures The root and leaves are labeled with the corresponding atom type The leaves contain additionally the total number of occurrences in the local atom pair environment

like structures for each method The original OAK has anaveraged performance of 2734 plusmn 340 similarity calcula-tions per second The flexibility extension OAKFLEX yields4103 plusmn 732 calculations per second This speed-up of theOAKFLEX is the result of a reduction of the used featuresThe OAAP method achieves a performance of 5149 plusmn1807 similarity calculations per second (computationtime for the atom typing is not included) The 2SHAmethod has the highest computation time because of thetwo assignment steps and the superposition of the frag-ments However the approach is capable to perform1404 plusmn 178 calculations per second All values weremeasured on a Intel Core2Duo CPU with 2 GHz usingone core and 1 GB memory Java 160_07 Ubuntu8043 Linux kernel 2624

ExperimentalThe setup of the VS experiments is based on the work ofCheeseright et al in which the FieldScreen approach isintroduced and compared against a docking algorithmWe decided to use this workflow to create a commonsetup of the data sets with the objective to achieve compa-rable results

Data Sets and PreparationAll ligand-based VS experiments were performed using amodified version of the Directory of Useful Decoys(DUD) Release 2 [5253] The DUD contains knownactives and mimetic [39] decoys for 40 target proteins Thiscollection of data sets was compiled to serve as an unbi-ased community benchmark database for the evaluationof docking algorithms Therefore the original version ofthe DUD is not suited for ligand-based VS experiments[41] A modification of the active structures of the DUDdatabase suggested by Good and Oprea [35] solves thisproblem by applying a lead-like filter [54] and a clusteralgorithm [55] The clustering identifies analogue struc-tures and can be used to reduce the bias of analogue ortrivial enrichment Additionally each cluster can be seenas an individual chemotype and allows the analysis of thescaffold-hopping behavior of the methods

The active structures for the VS experiments were preparedas follows We obtained the modified active structures ofGood and Oprea from the DUD site [38] These data setscontain only topological information about the struc-tures Thus we used CORINA3D [56] to generate initialgeometrical seed structures Those were refined with Mac-roModel 96 [57] using the OPLS 2005 force field and thelimited Broyden-Fletcher-Goldfarb-Shanno optimizationmethod with 5000 iterations and a gradient criterion of00001 RMSD for the atomic movement between two iter-ations The final data sets were used as active structures

The original decoys of the DUD release 2 were obtainedfrom the DUD site [52] Using these decoys together withthe filtered actives from Good and Oprea would intro-duce a bias of an artificial enrichment based on the dis-tinction of physical property values To remove thisartificial enrichment bias we applied the same lead-likefilter used by Good and Oprea to filter the active struc-tures on the decoys Thus the AlogP value and the molec-ular weight was calculated for each structure with dragonX14 [58] All structures with an AlogP value gt = 45 (55 fornuclear hormone receptor data sets) or a molecular

weight were removed The initial three-

dimensional coordinates were further optimized usingthe same configuration of MacroModel as for the activestructures

The described preparation protocol was applied on all 40data sets of the DUD The SD files of the prepared activesand decoys are directly obtainable from the DUD site[52]

To obtain results that are not biased by a low number ofchemotypes it is advisable to use data sets with a suffi-cient number of chemotypes Therefore we used the samesubset of DUD targets as used by Cheeseright et al Thissubset has a minimum of 17 and a maximum of 44 chem-otypes A comprehensive overview of the data sets can beseen in Table 1

Ligand-based VS methods need a biologically active struc-ture that serves as a query in the experiment In the workof Huang et al [53] complexed crystal structures wereused to identify the binding sites for the docking algo-rithm Cheeseright et al [27] used the ligands of the samecomplexed crystal structures as query structures to evalu-ate the FieldScreen approach To allow a comparison withFieldScreen we extracted the same bounded ligands fromthe protein data bank [5960] and corrected the bondslengths with MacroModel 96 These structures serve asqueries for the VS experiments in this study

Evaluation of VS PerformanceIn the following section we describe the metrics for theevaluation of VS experiments used in our work Addition-ally we explain why we used these metrics and show theircorrelation to well-known established metrics

The evaluation and comparison of VS performance is animportant but also an error-prone process [394061] Inrecent years a plethora of evaluation metrics have beenpublished each metric with its strengths and weaknessesFor an overview we refer to the work of Kirchmair et al

gt= 450 gmol

[62] There is no standard protocol for the analysis andpublication of VS results Thus the comparison of differ-ent works is a challenging task It is necessary to analyzethe results regarding three different aspects to character-ize the performance of a VS method The first aspect is theearly recognition problem and originates from real-worldscreening applications where only top ranked moleculeswere selected for testing in biological assays because ofhigh costs and expenditure of time Truchon and Baylydeveloped for this purpose the BEDROC score [63] whichuses a decreasing exponential weighting function Thedrawback of this metric is its lack of interpretability andthe dependency of an extrinsic variable [39] To resolvethese problems Jain and Nicholls suggested to report theenrichment values at predefined false positive fractions[40] These values can be obtained by dividing the sensi-tivity value by the fraction of false positives Equation 7represents the interrelation between the ROC value andthe ROC enrichment for a predefined false positive frac-

tion and is the number of

true positives (actives retrieved) and false positives(decoys retrieved) respectively found in the range con-

taining a false positive rate of X TP TN FP and FN arethe entries of the confusion matrix ranking X false posi-tives

We decided to report the ROC enrichments at false posi-tive rates of 05 10 2 and 5 as suggested by Jainand Nicholls [40]

The second aspect of characterizing the performance of VSmethods is the evaluation of the performance consideringthe complete data set This overall performance can be vis-ualized plotting the receiver operating characteristic(ROC) which should always be part of VS results [3964]However a comparison of many ROC plots is notstraightforward thus we report the area under the ROCcurve to provide a better overview of the results Equation8 shows the calculation of the area under the ROC curve(AUC) where Nactives and Ndecoys are the numbers of active

Nactives selectedX Ndecoys selected

ROC enrichment X

actives selectedX

total actives

decoys

N selectedX

total decoys

TPTP FN

FPTN FP

sensitivityspec

=minus1 iificity

y value ROC pointx value ROC point

Table 1 Data sets

target number actives number decoys number clustersa PDB codeb

acec 46 1796 19p 1o86ached 100 3859 18 1evecdk2e 47 2070 32 1ckpcox2f 212 12606 44 1cx2egfrg 365 15560 40 1m17fxah 64 2092 19 1f0rhivrti 34 1494 17 1rt1inhaj 57 2707 23 1p44p38k 137 6779 20 1kv2pde5l 26 1698 22 1xp0pdgfrbm 124 5603 22 1t46srcn 98 5679 21 2srcvegfr2o 48 2712 31 1fgi

Overview of the used data sets containing the number of actives decoys different chemotype clusters and the PDB code of the complexed crystal structure which contains the search queryaNumber of clusters using the reduced graph algorithm from Barker et al [55]b PDB code of the complexed crystal structures from which the search queries were takenc Angiotensine-converting enzymed Acetylcholinesterasee Cyclin-dependent kinasef Cyclooxygenase-2g Epidermal growth factor receptorh Factor Xai HIV reverse transcriptasej Enoyl ACP reductasek P38 mitogen activated proteinl Phosphodiesterase 5m Platelet derived growth factor receptor kinasen Tyrosine kinaseo Vascular endothelial growth factor receptorp The molecule with the ZINC ID 03814157 does not contain a ring and therefore it is assigned to a dummy cluster forming one additional cluster

and decoy structures respectively is the

number of decoys that are higher ranked than the ithactive structure

The last point of characterizing the performance of VSmethods is the evaluation of the retrieval of new scaffoldsAs already mentioned ligand-based VS suffers from thebias caused by enrichments of analog structures Toreduce the influence of this overestimated enrichments onthe result metrics Clark and Webster-Clark recommendedan adaption of the ROC and AUC calculation [36] Theyproposed an arithmetic and a harmonic weighting schemeto reduce the influence of structurally similar structuresUsing the arithmetic weighting each structure has aweight that is inversely proportional to the size of the clus-ter it belongs to Therefore the weight of all structurestaken from one cluster is equal This leads to the equation

where wij is the weight of the ith structure taken

from the jth cluster with Nj structures As a result of this

the true positive value of the sensitivity is no more thenumber of true active structures seen in a predefined frac-tion of the data set Instead it is the sum of the weights

resulting in the equation The

value of is 1 if the ith structure of the jth cluster is

contained in the fraction of the data set otherwise it is 0Nclusters is the number of clusters in the data set and Nj is

the number of structures in the jth cluster

The harmonic weighting scheme uses weights represent-ing the ranks of the active structures within each clusterTop-ranked structures of a cluster have a higher weightmotivated by the idea that those structures have the high-est information content The weight of the structures is

defined as where i is the ith structure of the jth

cluster in decreasing rank order A recently publishedanalysis of the weighting schemes conducted by Mackeyand Melville discloses that the arithmetic weightingscheme has better properties and is more robust than theharmonic scheme [65] Therefore we decided to use thearithmetic weighting scheme Integrating this scheme intothe basic ROC enrichment leads to an arithmetic weightedversion (awROC enrichment) given by Equation 9

This weighting scheme can also be embedded into theAUC calculation The modified version (awAUC) can beseen in Equation 10

To account for intrinsic variances [39] of the active anddecoy structures we bootstrapped the data sets andapproximated the error of the awROC enrichments andawAUC We performed 25000 iterations removing ran-domly 20 of the data set The standard deviation of the25000 metric results were used as an error of the resultusing the complete data set

Comparison against Other MethodsThe docking study by Huang et al [53] provides theenergy scores of the DOCK algorithm for each target of theDUD We used these rankings to calculate the awROCenrichments and awAUC on the same data sets as ourapproaches Hence extensive properties such as active todecoy ratio and data set size are identical and permit acomparison of the results A direct comparison of dockingalgorithms with ligand-based VS tools is critical becauseof the different techniques and information included inthe methods We calculated the results of the dockingalgorithm with the objective to provide an impression ofthe performance of the DOCK algorithm using theawROC and awAUC evaluation metrics

We considered the results of the FieldScreen approach toinclude a sophisticated ligand-based method To reducethe information content solely to the informationobtained from the query structure the results without theexcluded volume of the protein were used Although thedata set sizes and active to decoy ratios of the FieldScreenresults deviate from our setup the use of awROC enrich-ments ensures the comparability because the ROC enrich-ment is independent from extensive quantities [39]

Introducing a new method to a well-established field likeligand-based VS it is necessary to justify it by comparingthe results to a common approach To meet these require-ments we compared our approaches to the 166 bit

N idecoys seen

AUCactives

decoys seen

decoys

actives

= minus sum11

wij N j= 1

TP X= sumsum wij iji

N jclusters α

α ijX

wij i= 1

awROC Enrichment X

XN jclusters

clusters

sumsum wij ijijN

ss selectedX

total decoys

awAUCclusters

decoys seen

total decoys

= minus sum11

wijNij

N jtters

sum(10)

MACCS keys [12] using the Tanimoto coefficient to calcu-late a similarity value Although the results of the MACCSkeys are inferior in comparison to other fingerprints [66]we provide the results of the MACCS keys to give a base-line for the performance on the data sets The MACCS keyswere calculated using the CDK 115 [5051]

Results and DiscussionThe results of the four optimal assignment based methodsand the three comparison approaches are organized asfollows The awROC enrichments for each target at a pre-defined false positive fraction of 05 10 20 and50 can be seen in the Tables 2 3 4 and 5 The awAUCvalues are compiled in Table 6 Each value in the Tables 23 4 5 and 6 has a standard deviation which is the resultof the error approximation using the bootstrappingapproach The bold value in each row indicates the bestresult concerning the data set and evaluation metric Thelast row in each Table contains the rank of each methodaveraged over all data sets These values describe the over-all performance of one method considering the results ofall other methods

In addition to the already mentioned VS metrics Addi-tional file 2 contains the complete ranking of all structuresand several VS metrics for each method and data set

Virtual Screening ResultsThe results of Table 2 show the awROC enrichments at afalse positive rate of 05 which implies that the earlyrecognition problem is focused in this evaluation Thebest overall performance is achieved by the (OAAP) Theresults of the original OAK and the flexibility extensionare similar and only marginally worse than the OAAP The2SHA approach obtains an average rank that lies inbetween the other optimal assignment methods and

FieldScreen The MACCS keys show an increased perform-ance on the fxa and vegfr2 data set but the overall per-formance is lower than all other methods except theDOCK algorithm The ranking of the methods at higherfalse positive fractions changes but there is a tendencyobservable The 2SHA yields the best average rank at 12 and 5 Additionally the average rank decreases withan increasing false positive fraction FieldScreen shows asimilar behavior resulting in comparable results at 5 incomparison to OAK and OAAP The average ranks of theMACCS keys OAK OAKFLEX and OAAP increased withhigher false positive fractions by a value of asymp 07 The aver-age rank of the DOCK algorithm alternates between 654and 554 showing no dependency on the false positiverate

The increasing performance of the FieldScreen and 2SHAapproach can be explained by the ability to perform scaf-fold-hoppings The other optimal assignment methodshave a good early enrichment as seen in Table 2 but theseenrichment values are the result of retrievals of chemicalsimilar scaffolds with respect to the search query Thosemethods are not able to hold up the high retrieval rates ofthe chemotypes and the discovery of new chemotypesstagnates The influence of this bias is reduced withincreasing false positive fraction and the more robustmethods became apparent Regarding the technique ofFieldScreen the improved scaffold-hopping behavior isprobably the result of the conformational samplingtogether with the similarity calculation which is based onmolecular fields The 2SHA method yields a similar effectby the identification of rigid scaffolds their assignmentand superposition This procedure can be seen as aninherent heuristic of a flexible alignment of the searchquery and the screening structures The method assumesthat the linker between two fragments has the flexibility to

Table 2 awROC Enrichments at 05

target DOCK FieldScreen MACCS OAK OAKFLEX 2SHA OAAP

ace 170 plusmn 62 147 557 plusmn 85 945 plusmn 87 945 plusmn 96 735 plusmn 86 305 plusmn 84ache 00 plusmn 00 167 191 plusmn 46 245 plusmn 48 249 plusmn 48 248 plusmn 49 231 plusmn 47cdk2 40 plusmn 61 75 94 plusmn 18 94 plusmn 18 94 plusmn 18 94 plusmn 37 241 plusmn 47cox2 19 plusmn 06 488 170 plusmn 30 257 plusmn 36 425 plusmn 45 383 plusmn 64 684 plusmn 55egfr 76 plusmn 22 524 403 plusmn 32 566 plusmn 39 474 plusmn 36 1036 plusmn 69 531 plusmn 43fxa 151 plusmn 68 00 300 plusmn 74 200 plusmn 63 100 plusmn 46 100 plusmn 46 200 plusmn 63hivrt 44 plusmn 18 400 220 plusmn 55 201 plusmn 56 201 plusmn 56 201 plusmn 56 348 plusmn 80inha 00 plusmn 00 567 499 plusmn 77 318 plusmn 87 439 plusmn 84 579 plusmn 135 607 plusmn 74p38 00 plusmn 00 37 10 plusmn 04 222 plusmn 59 188 plusmn 50 282 plusmn 63 107 plusmn 43pde5 44 plusmn 43 68 43 plusmn 33 86 plusmn 40 86 plusmn 40 43 plusmn 23 00 plusmn 00pdgfrb 00 plusmn 00 273 420 plusmn 74 470 plusmn 67 444 plusmn 67 439 plusmn 67 577 plusmn 68src 00 plusmn 00 137 00 plusmn 00 57 plusmn 12 92 plusmn 12 43 plusmn 16 47 plusmn 10vegfr2 62 plusmn 30 129 187 plusmn 45 63 plusmn 16 125 plusmn 34 125 plusmn 34 125 plusmn 41

avg rank 627 408 435 323 327 354 312

awROC Enrichments at 05 false positive fraction of DOCK FieldScreen MACCS keys and the optimal assignment methods

perform the translation and rotation which is needed forthe superposition of the assigned fragments Additionallythis assumes a comparable length of the linkers in bothmolecules Considering these facts the inherent flexiblealignment of the method is a rough approximation but itincreases the ability to do scaffold-hoppings withoutthe use of a time-consuming conformational sampling

A comparison of the OAK with its flexibility extensionshows a comparable overall performance at 05 The gapbetween the methods increases at higher false positivefractions concerning the average rank In contrast to theaverage rank the difference on each data set is marginalwith an advantage for the OAK

As a result the OAK obtains lower ranks and the differ-ence of the average rank increases Major differences

regarding the performance on each data set can only beseen at 05 false positive rate Surprising is the improvedperformance of the OAKFLEX on the cox2 data set Theinhibitors of the cox2 are usually rigid scaffolds consistingof two or three ring systems Hence there should not be aperformance gain using an approach that integrates thelocal flexibility of the atom neighborhood The activestructures of the cox2 data set have terminal substitutionswith a depth of two or three bonds These substitutionsform the flexible part of the molecules and have an opti-mal depth for the local flexibility approach of the OAK-

FLEX The flexibility information of the terminalsubstitutions is stored in the atom which is part of the ringsystem Therefore the method has an improved ability todistinguish different substitution stages like ortho metaand para substitutions Additionally the existence of rigidand flexible parts with an ideal length yields a better dis-

avg rank 554 400 473 365 404 273 323

avg rank 654 362 493 358 412 281 373

tinction between rigid and flexible parts of a moleculeThe last point was already observed in QSAR experimentsreported by Fechner et al [34] The active structures of theegfr data set contain also a rigid basic scaffold with varioussubstitutions among the molecules Based on the previousobservation the OAKFLEX should attain a higher awROCenrichment as the OAK Obviously it is not the case as itcan be seen in Table 2 This contradictory result can beexplained by the chemical nature of the search queryErlotinib has two flexible substitutions each with fiveheavy atoms forming a chain These substitutions increasethe number of rotatable bonds to 11 The average numberof rotatable bonds in the active data set of egfr is 407 Forthat reason the OAKFLEX has difficulties to map the 10flexible atoms of the search query onto the atoms of theactive structures These mappings reduce the overall simi-larity and performance on the egfr data set An evidence

for this hypothesis is the result on the vegfr2 data set Thesearch query of the vegfr2 data set has five rotatablebonds whereas the actives have a mean value of 42Therefore the penalty of the OAKFLEX mapping is reducedand results in a higher enrichment rate because of the dif-ferent flexibility properties The OAAP method achievesgood results on the cox2 cdk2 and pdgfrb data set Theseresults are a consequence of the binning approach on thegeometrical distances The binning represents a fuzzy geo-metrical distance measure that allows minor changes inthe basic scaffold of a structure without reducing the sim-ilarity value This increases the ability to perform scaf-fold-hoppings to related chemotypes with a similar basicstructure

The awROC enrichment of the 2SHA algorithm at 05on the egfr data set is remarkable The enrichment on this

avg rank 554 369 504 369 419 223 362

Table 6 awAUC values

avg rank 427 342 450 404 462 308 392

awAUC values of DOCK FieldScreen MACCS keys and the optimal assignment methods

data set outperforms the results of the other methods Toelucidate this result we analysed the similarity calcula-tions of the 2SHA on the egfr structures in detail Theactive structures have a common property that favors thecomputation technique of the 2SHA Each active clusterhas a so-called parent molecule This molecule is thesmallest molecule of the cluster regarding the number ofheavy atoms [38] We compared these parent moleculesand identified a common basic scaffold which is includedin 32 of 40 clusters and in the search query Figure 7shows four examples of this basic scaffold which consistsof a ring system and a condensed heterocyclic system withtwo or three rings These systems are aromatic in the querystructure and in 25 out of 32 clusters The 2SHA methodidentifies the aromatic systems and maps the correspond-ing systems onto each other in the first assignment stepThe superposition of the fragments yields minimal dis-tances of the mapped atoms because of the planarity ofaromatic systems This results in a high similarity scorebased on this common basic scaffold and is an explana-tion of the high awROC enrichment value at 05 Ourhypothesis confirms the results of the 2SHA on egfr athigher false positive rates The method still has the highestenrichment rates but in comparison to the other methodsthe difference is significantly reduced

The awAUC values of Table 6 represent the performanceof each method using the complete data set The 2SHAapproach yields the best result with an average rank of308 The FieldScreen method performs better than theremaining optimal assignment techniques OAAP OAKand OAKFLEX The DOCK algorithm improves the resultsand obtained an average rank of 427 and achieves a betterresult than the MACCS keys Table 6 confirms with thetendency over the awROC enrichments and assigns thebest scaffold-hopping performance to the 2SHA andFieldScreen The reduced performance of the other opti-mal assignment methods can be explained by the resultsof the chemotype enrichment Figure 8 shows the chemo-type enrichment of the four optimal assignment methodsthe MACCS keys and the random performance on thep38 data set All optimal assignment methods have a com-parable chemotype retrieval until asymp 50 of the chemo-types are retrieved which is consistent with the results ofthe Tables 2 3 4 5 From that point on the retrieval ofthe OAK OAKFLEX and OAAP stagnates and their chemo-type enrichment is reduced to the level of the random per-formance The reduced enrichment performance of thosethree optimal assignment methods at higher false positiverates explains the improved results of the FieldScreenapproach regarding the awAUC values Only the 2SHA

Example of egfr clustersFigure 7Example of egfr clusters The figure visualizes four parent structures of different egfr clusters Although these structures belong to different clusters they share a common basic scaffold

approach has the ability to achieve a higher generalizationof the chemotypes which results in an increased retrievalrate over the complete data set

Correlation of Optimal Assignment MethodsThe presented optimal assignment methods are based onthe same functional principle On the one hand they usedifferent representations and algorithms but on the otherhand the final similarity computation between two mol-ecules is the result of the assignment algorithm Hence itis possible that the two new optimal assignment methods(2SHA and OAAP) are only marginally different com-pared to the two existing methods (OAK and OAKFLEX)The results in the previous section are an evidence thatthere are differences in the rankings of active and inactivestructures Another interesting question is the order of thechemotype discovery of the individual methods Differentsequences of the chemotype retrieval indicate that themethods explore the chemspace in various directionsTherefore we analyzed the sequences of all optimalassignment methods on all data sets We calculated theKendalls τ rank correlation coefficient for each data set toobtain a measure for the correlation of the chemotyperetrieval between two methods This results in 13 correla-tion coefficients for each pair of methods The results areillustrated in Figure 9 in form of box plots

The correlation analysis shows that the 2SHA and OAAPapproach have the lowest median of the correlation coef-

ficients This is the result of a different calculation processto determine the similarity between two atoms Theincreased correlation between 2SHA and OAKFLEX as wellas between 2SHA and OAK can be explained by a com-mon subset of descriptors and the fact that the similaritybetween two atoms in all three methods is calculated by aRBF 2SHA has a higher correlation with the OAKFLEX thanwith the OAK This indicates the already mentioned inher-ent flexibility consideration of the method In contrast tothe 2SHA the OAAP method shows a superior correlationto the OAK method without flexibility information Thestrong correlation between OAK and OAKFLEX is not sur-prising because the OAKFLEX is an extension of the OAKand the standard parametrization of the OAKFLEX uses aweight of 095 for the OAK similarity [34]

To further analyse the optimal assignment methods andtheir characteristic behaviour in chemspace we con-ducted the same experiment and evaluated the correlationbetween the optimal assignment methods and DOCK orthe MACCS keys The results can be seen in Figure 10

The results show that the chemotype retrieval sequence ofthe optimal assignment methods has no correlation withthe sequence retrieved by DOCK The correlation to themethods considering flexibility information (2SHA andOAKFLEX) is marginal increased in comparison to theother methods This is probably the result of the flexible-ligand method that is used by DOCK allowing the incor-

Chemotype enrichment and scaffold-hoppings on p38Figure 8Chemotype enrichment and scaffold-hoppings on p38 The left figure visualizes the chemotype enrichment of the four optimal assignment methods the MACCS keys and the random performance on the p38 data set A chemotype is consid-ered as retrieved if one structure of the chemotype is ranked The right figure shows five different chemotypes that were only retrieved by the 2SHA method ranking 25 of the data set

poration of the flexibility [53] The overall low correlationis not surprising given the fundamental differences of lig-and-based and docking approaches The correlation coef-ficients between the optimal assignment methods and theMACCS keys are comparable to the coefficients in the pre-vious experiment between two optimal assignment meth-ods From these results it follows that the similarity of thechemotype retrieval sequence between the MACCS keysand an optimal assignment method is comparable to thesimilarity of two optimal assignment methods Thereforethe optimal assignment methods and the MACCS keys

explore the chemspace in a comparable direction but theVS results show that each method has its strengths on dif-ferent data sets

The enrichment results of the DOCK algorithm in theTables 2 3 4 5 6 show an inferior performance in com-parison to the other methods The findings of the chemo-type retrieval sequence disclose an interesting property ofDOCK The uncorrelated sequences demonstrate thatDOCK explores the chemspace in an orthogonal mannerwith respect to the other methods Accordingly DOCK

Rank correlation coefficient of the chemotype discovery between optimal assignment methodsFigure 9Rank correlation coefficient of the chemotype discovery between optimal assignment methods The diagram illustrates all pairwise rank correlation coefficient of the order of the chemotype discovery between two optimal assignment methods A high correlation indicates that the order of the chemotype discovery between two methods is similar Each box plot was created using the correlation of the order of the chemotype discovery on each data set used in this study

retrieves different chemotypes that will only be found bythe optimal assignment methods and the MACCS keys ifa large part of the data set is ranked

ConclusionWe have introduced the concept of the optimal assign-ment approach in the field of ligand-based VS We pre-sented two new optimal assignment methods The OAAPmethod is based on geometrical distance atom pairs The2SHA computes two assignment steps Each method usesthe optimal assignment approach on different types ofobjects showing the wide field of application of the

approach Another advantage is the interpretability of themappings using visualizations as shown in Figure 1

We evaluated the methods on 13 modified DUD data setscovering a wide range of different molecules and chemo-types In order to grade the results of the optimal assign-ment approaches we compared the results with a state-of-the-art ligand-based VS method that uses a conforma-tional sampling and molecular fields Additionally weprovided the results of the 166 bit MACCS keys in combi-nation with the Tanimoto coefficient and the perform-ance of the DOCK algorithm on the same data sets The

Rank correlation coefficient of the chemotype discovery between optimal assignment methods and DOCKMACCS keysFigure 10Rank correlation coefficient of the chemotype discovery between optimal assignment methods and DOCKMACCS keys The boxplots show the correlation coefficients of the order of the chemotype discovery between the optimal assignment methods and DOCK as well as the MACCS keys The experimental setup is equal to the previous correlation anal-ysis between two optimal assignment methods

results show an improved early enrichment performanceof all optimal assignment methods Analyses show thatthese early enrichments are the results of the retrievals ofchemical similar chemotypes with respect to the searchquery With increasing false positive rates this bias isreduced and the gap between the optimal assignmentmethods and the literature results is successively closedOnly the 2SHA approach has the ability to perform thenecessary scaffold-hoppings to maintain a robustenrichment and obtained the overall best results The cal-culations of the 2SHA method approximate an implicitflexible alignment of the substructures and enables theretrieval of chemotypes that have larger distances to thequery in chemspace Further research will be spent on the2SHA method in combination with multiple queryscreenings The fragmentation and the two assignmentsteps enable a substructure based data fusion on the frag-ment level

The first assignment step can disclose fragment classeswith similar fragments of two or more search queriesThese classes can be used to enumerate all combinationsof the fragments resulting in an increased chemotype cov-erage This procedure can further improve the scaffold-hopping performance and contribute to the robustnessof the method

The presented methods are fast enough to screen morethan one million structures within one day on a singlecore CPU This high throughput performance qualifies thepresented methods to perform screenings on large data-bases with the aim to select relevant subsets for a givenproblem These subsets can be used in combination withbiological screenings or more time-consuming dockingalgorithms

Competing interestsThe authors declare that they have no competing interests

Authors contributionsAJ filtered and prepared the DUD data sets developed thetwo-step hierarchical assignment approach and the flexi-bility extension of the Optimal Assignment Kerneldesigned the experimental setup performed the virtualscreening experiments performed the correlation experi-ments of the methods drafted the manuscript and partic-ipated in the discussion of the results GH developed theoptimal assignment based on atom pairs helped to draftthe manuscript and participated in the discussion of theresults NF helped to develop the two-step hierarchicalassignment approach and the flexibility extension of theOptimal Assignment Kernel participated in drafting themanuscript and participated in the discussion of theresults AZ participated in the development of the originalOptimal Assignment Kernel participated in the design of

the experimental setup and participated in the discussionof the results All authors read and approved the finalmanuscript

Additional material

AcknowledgementsThe authors want to thank John J Irwin and Brian Shoichet from the Uni-versity of California San Francisco for adding the filtered and optimized DUD data sets to the official DUD site [52] This work has been partially funded by Nycomed GmbH Konstanz Germany

References1 Bajorath J Integration of virtual and high-throughput screen-

ing Nat Rev Drug Discov 2002 1(11)882-8942 Varnek A Tropsha A Eds Chemoinformatics Approaches to Virtual

Screening The Royal Society of Chemistry 2008 3 Kuntz ID Blaney JM Oatley SJ Langridge R Ferrin TE A geometric

approach to macromolecule-ligand interactions J Mol Biol1982 161(2)269-288

4 Shoichet BK Bodian DL Kuntz ID Molecular docking usingshape descriptors J Comput Chem 1992 13(3)380-397

5 Korb O Stuumltzle T Exner TE Empirical Scoring Functions forAdvanced Protein-Ligand Docking with PLANTS J Chem InfModel 2009 4984-96

6 Warren GL Andrews CW Capelli AM Clarke B LaLonde J LambertMH Lindvall M Nevins N Semus SF Senger S Tedesco G Wall IDWoolven JM Peishoff CE Head MS A Critical Assessment ofDocking Programs and Scoring Functions J Med Chem 200649(20)5912-5931

7 Cavasotto CN Orry AJW Ligand docking and structure-basedvirtual screening in drug discovery Curr Top Med Chem 20077(10)1006-1014

8 Maggiora GA Johnson MA Eds Concepts and Applications of MolecularSimilarity Wiley-Interscience 1990

9 Martin YC Kofron JL Traphagen LM Do Structurally SimilarMolecules Have Similar Biological Activity J Med Chem 200245(19)4350-4358

10 Bender A Jenkins JL Scheiber J Sukuru SC Glick M Davies JW HowSimilar Are Similarity Searching Methods A Principal Com-

Additional file 1Complete list atom and bond descriptors of the OAK The file OAK_descriptorlistpdf lists all 32 descriptors of the OAK used in this studyClick here for file[httpwwwbiomedcentralcomcontentsupplementary1758-2946-1-14-S1pdf]

Additional file 2Archive of the result files The file Resultstargz is a Gzip compressed Tar archive containing the results file of each method for each data set The result files are tab-separated plain text files including the following infor-mation method name active data set with size cluster information and the distribution of the molecules over the clusters decoy data set with size ratio activedecoy AUC awAUC BEDROC scores for predefined alpha values as suggested by Truchon and Bayly [63] relative enrichment factor [67] ROC enrichments awROC enrichments at predefined false positive fractions chemotype enrichment ROC data points and the ranking of each structureClick here for file[httpwwwbiomedcentralcomcontentsupplementary1758-2946-1-14-S2zip]

ponent Analysis of Molecular Descriptor Space J Chem InfModel 2009 49108-119

11 Sheridan R Why do we need so many chemical similaritysearch methods Drug Discov Today 2002 7(17)903-911

12 Symyx Software MACCS structural keys San Ramon CA 13 Daylight Chemical Information Systems Inc [httpwwwday

lightcom]14 Good AC Hermsmeier MA Hindle S Measuring CAMD Tech-

nique Performance A Virtual Screening Case Study in theDesign of Validation Experiments J Comput-Aided Mol Des 200418(7)529-536

15 Raymond JW Willett P Effectiveness of graph-based and fin-gerprint-based similarity measures for virtual screening of2D chemical structure databases J Comput-Aided Mol Des 20021659-71

16 Gillet VJ Willett P Bradshaw J Similarity Searching UsingReduced Graphs J Chem Inf Comput Sci 2003 43(2)338-345

17 Harper G Bravi GS Pickett SD Hussain J Green DVS TheReduced Graph Descriptor in Virtual Screening and Data-Driven Clustering of High-Throughput Screening Data JChem Inf Comput Sci 2004 44(6)2145-2156

18 Rarey M Dixon JS Feature trees a new molecular similaritymeasure based on tree matching J Comput-Aided Mol Des 199812(5)471-490

19 Bender A Mussa HY Glen RC Reiling S Similarity Searching ofChemical Databases Using Atom Environment Descriptors(MOLPRINT 2D) Evaluation of Performance J Chem Inf Com-put Sci 2004 44(5)1708-1718

20 Saeh JC Lyne PD Takasaki BK Cosgrove DA Lead HoppingUsing SVM and 3D Pharmacophore Fingerprints J Chem InfModel 2005 45(4)1122-1133

21 von Korff M Freyss J Sander T Flexophore a New Versatile 3DPharmacophore Descriptor That Considers Molecular Flex-ibility J Chem Inf Model 2008 48(4)797-810

22 Ballester PJ Finn PW Richards WG Ultrafast shape recognitionEvaluating a new ligand-based virtual screening technologyJ Mol Graphics Modell 2009 27(7)836-845

23 Rush TS Grant JA Mosyak L Nicholls A A Shape-Based 3-D Scaf-fold Hopping Method and Its Application to a Bacterial Pro-tein-Protein Interaction J Med Chem 2005 48(5)1489-1495

24 Goodford PJ A computational procedure for determiningenergetically favorable binding sites on biologically impor-tant macromolecules J Med Chem 1985 28(7)849-857

25 Cramer RD Patterson DE Bunce JD Comparative MolecularField Analysis (CoMFA) 1 Effect of Shape on Binding ofSteroids to Carrier Proteins J Am Chem Soc 1988110(18)5959-5967

26 Ahlstrom MM Ridderstrom M Luthman K Zamora I VirtualScreening and Scaffold Hopping Based on GRID MolecularInteraction Fields J Chem Inf Model 2005 45(5)1313-1323

27 Cheeseright TJ Mackey MD Melville JL Vinter JG FieldScreenVirtual Screening Using Molecular Fields Application to theDUD Data Set J Chem Inf Model 2008 48(11)2108-2117

28 Cheeseright T Mackey M Rose S Vinter A Molecular FieldExtrema as Descriptors of Biological Activity Definition andValidation J Chem Inf Model 2006 46(2)665-676

29 Froumlhlich H Wegner JK Sieker F Zell A Optimal assignment ker-nels for attributed molecular graphs In ICML 05 Proceedings ofthe 22nd international conference on Machine learning New York NYUSA ACM 2005225-232

30 Froumlhlich H Wegner JK Sieker F Zell A Kernel Functions forAttributed Molecular Graphs ndash A New Similarity-BasedApproach to ADME Prediction in Classification and Regres-sion QSAR Comb Sci 2006 25(4)317-326

31 Chavatte P Yous S Marot C Baurin N Lesieur D Three-dimen-sional quantitative structure-activity relationships of cyclo-oxygenase-2 (COX-2) inhibitors A comparative molecularfield analysis J Med Chem 2001 44(20)3223-3230

32 Rupp M Proschak E Schneider G Kernel Approach to MolecularSimilarity Based on Iterative Graph Similarity J Chem InfModel 2007 472280-2286

33 Hinselmann G Jahn A Fechner N Zell A Chronic Rat ToxicityPrediction of Chemical Compounds Using Kernel MachinesIn Lecture Notes in Computer Science (EvoBIO 2009) Volume 5483Edited by Pizutti C Ritchie M Giacobini M Springer-Verlag BerlinHeidelberg 200925-36

34 Fechner N Jahn A Hinselmann G Zell A Atomic Local Neighbor-hood Flexibility Incorporation into a Structured SimilarityMeasure for QSAR J Chem Inf Model 2009 49(3)549-560

35 Good AC Oprea TI Optimization of CAMD Techniques 3 Vir-tual Screening Enrichment Studies a Help or Hindrance inTool Selection J Comput-Aided Mol Des 2008 22(3ndash4)169-178

36 Clark RD Webster-Clark DJ Managing Bias in ROC Curves JComput-Aided Mol Des 2008 22(3ndash4)141-146

37 Cleves AE Jain AN Effects of Inductive Bias on ComputationalEvaluations of Ligand-based Modeling and on Drug Discov-ery J Comput-Aided Mol Des 2008 22(3ndash4)147-159

38 Good AC Andrew Goods DUD Clustering [httpduddockingorgclusters]

39 Nicholls A What do we know and when do we know it J Com-put-Aided Mol Des 2008 22239-255

40 Jain AN Nicholls A Recommendations for Evaluation of Com-putational Methods J Comput-Aided Mol Des 2008 22(3ndash4)133-139

41 Irwin JJ Community Benchmarks for Virtual Screening J Com-put-Aided Mol Des 2008 22(3ndash4)193-199

42 Kuhn HW The hungarian method for the assignment prob-lem Naval Res Logist 1955 283-97

43 Munkres J Algorithms for the Assignment and Transporta-tion Problems J Soc Indust and Appl Math 1957 532-38

44 Vert JP The optimal assignment kernel is not positive defi-nite 2008 [httpwwwcitebaseorgabstractid=oaiarXivorg08014061]

45 Guha R Howard MT Hutchison GR Murray-Rust P Rzepa H Stein-beck C Wegner J Willighagen EL The Blue Obelisk Interopera-bility in Chemical Informatics J Chem Inf Model 200646(3)991-998

46 Wegner JK JOELibJOELib2 [httpsourceforgenetprojectsjoelib]

47 Kabsch W A solution for the best rotation to relate two setsof vectors Acta Crystallogr Sect A Found Crystallogr 197632(5)922-923

48 Coutsias EA Seok C Dill KA Using quaternions to calculateRMSD J Comput Chem 2004 25(15)1849-1857

49 Coutsias EA Seok C Dill KA Rotational superposition and leastsquares the SVD and quaternions approaches yield identicalresults Reply to the preceding comment by G Kneller JComput Chem 2005 26(15)1663-1665

50 Steinbeck C Han Y Kuhn S Horlacher O Luttmann E Willighagen EThe Chemistry Development Kit (CDK) an open-sourceJava library for Chemo- and Bioinformatics J Chem Inf ComputSci 2003 43(2)493-500

51 Steinbeck C Hoppe C Kuhn S Floris M Guha R Willighagen ERecent developments of the chemistry development kit(CDK) ndash an open-source java library for chemo- and bioinfor-matics Curr Pharm Des 2006 12(17)2111-2120

52 DUD ndash A Directory of Useful Decoys [httpduddockingorg]53 Huang N Shoichet BK Irwin JJ Benchmarking Sets for Molecu-

lar Docking J Med Chem 2006 49(23)6789-680154 Oprea TI Davis AM Teague SJ Leeson PD Is There a Difference

between Leads and Drugs A Historical Perspective J ChemInf Comput Sci 2001 41(5)1308-1315

55 Barker EJ Gardiner EJ Gillet VJ Kitts P Morris J Further Develop-ment of Reduced Graphs for Identifying Bioactive Com-pounds J Chem Inf Comput Sci 2003 43(2)346-356

56 Gasteiger J Rudolph C Sadowski J Automatic generation of 3d-atomic coordinates for organic molecules Tetrahedron ComputMethod 1992 3537-547

57 Schroumldinger LLC MacroModel version 96 New York NY 200858 srl Talete Milano Italy dragonX 14 for Linux (Molecular

Descriptor Calculation Software) 59 Berman HM Westbrook J Feng Z Gilliland G Bhat TN Weissig H

Shindyalov IN Bourne PE The Protein Data Bank Nucl Acids Res2000 28235-242

60 RCSB Protein Data Bank [httpwwwpdborg]61 Hawkins PCD Warren GL Skillman AG Nicholls A How to do an

Evaluation Pitfalls and Traps J Comput-Aided Mol Des 200822(3ndash4)179-190

62 Kirchmair J Markt P Distinto S Wolber G Langer T Evaluation ofthe Performance of 3D Virtual Screening Protocols RMSDComparisons Enrichment Assessments and Decoy Selec-

Open access provides opportunities to our colleagues in other parts of the globe by allowing

anyone to view the content free of charge

Publish with ChemistryCentral and everyscientist can read your work free of charge

W Jeffery Hurst The Hershey Company

available free of charge to the entire scientific communitypeer reviewed and published immediately upon acceptancecited in PubMed and archived on PubMed Centralyours you keep the copyright

Submit your manuscript herehttpwwwchemistrycentralcommanuscript

tion -What can We Learn from Earlier Mistakes J Comput-Aided Mol Des 2008 22(3ndash4)213-228

63 Truchon JF Bayly CI Evaluating Virtual Screening MethodsGood and Bad Metrics for the Early Recognition ProblemJ Chem Inf Model 2007 47(2)488-508

64 Triballeau N Acher F Brabet I Pin JP Bertrand HO VirtualScreening Workflow Development Guided by the ReceiverOperating Characteristic Curve Approach Application toHigh-Throughput Docking on Metabotropic GlutamateReceptor Subtype 4 J Med Chem 2005 48(7)2534-2547

65 Mackey MD Melville JL Better than Random The ChemotypeEnrichment Problem J Chem Inf Model 2009 49(5)1154-1162

66 Muchmore SW Debe DA Metz JT Brown SP Martin YC Hajduk PJApplication of belief theory to similarity data fusion for usein analog searching and lead hopping J Chem Inf Model 200848(5)941-948

67 von Korff M Freyss J Sander T Comparison of Ligand- andStructure-Based Virtual Screening on the DUD Data Set JChem Inf Model 2009 49209-231

Abstract

Background

Results

Conclusion

Background

Methods

Optimal Assignment Kernel

Optimal Assignment Kernel with Flexibility Extension

Two-Step Hierarchical Assignment


Parameters and Computation Time

Experimental

Data Sets and Preparation

Evaluation of VS Performance

Comparison against Other Methods

Results and Discussion

Virtual Screening Results

Correlation of Optimal Assignment Methods

Conclusion

Competing interests

Authors contributions

Additional material

Acknowledgements

References