3
ISSN 00125008, Doklady Chemistry, 2011, Vol. 440, Part 2, pp. 263–265. © Pleiades Publishing, Ltd., 2011. Original Russian Text © P.V. Karpov, I.I. Baskin, N.I. Zhokhova, N.S. Zefirov, 2011, published in Doklady Akademii Nauk, 2011, Vol. 440, No. 4, pp. 480–483. 263 Modern estimates of conceivable organic struc tures exceed 10 60 . Among these structures, there are both biologically active and practically important sub stances, as well as compounds that currently have no applications. It is evident that synthesis and testing of all the structures from this set are impossible. In this context, virtual screening methods are being actively developed. On the basis of predictive models, this technique filters down initial databases to a manage able number of structures that can be synthesized and tested in the laboratory. When the 3D structure of a biological target is unknown, models for virtual screening are constructed on the basis of data on the ligands that bind to it. One approach is to construct a classification model that predicts whether a compound of interest is among the active or inactive structures. This model necessitates knowledge of examples of both types; however, in most cases, only active structures are reported, whereas no information on inactive compounds is, as a rule, avail able. Thus, all conceivable structures for which no information on activity is available can be classified as inactive (on the order of 10 60 compounds). It is impos sible to process such a number of compounds in rea sonable time; therefore, only a small number of inac tive structures are arbitrarily chosen as counterexam ples, so that the model starts depending on this choice. The oneclass classification [1] successfully copes with this difficulty since only active structures are used for constructing models. The method has other important features: weak dependence on the input space metrics because of the presence of the training stage, lower sensitivity to activity cliffs as compared with similarity search methods based on Tanimoto indices, and a clear mathematical justification. This underlies the efficiency of such methodology for determining the areas of applicability of QSAR models (quantitative structure–activity relationship) [2] and performing the virtual screening [3]. Just like any other method of statistical data processing, the one class classification operates with a definite numerical representation of a chemical structure. The descrip tion of molecules by means of fixed sized descriptor vectors has some drawbacks, the most important one being the arbitrary choice of initial descriptor sets and, as a result, the subjectivity of the constructed QSAR models. Therefore, current interest focuses on the development of methods that make it possible to com pare structures and construct predictive models on the basis of this comparison without using fixed sets of molecular descriptors. Common 3D QSAR methods are based on the description of the spatial distribution of molecular fields. The descriptors in these methods are the inter action energies of ligands with probe atoms (molecular field potentials) placed at the points of the hypotheti cal spatial grid constructed around the aligned 3D ligand structures. For example, in the CoMFA method (comparative molecular field analysis) [4], the steric, hydrophobic, and electrostatic molecular fields are used for these purposes. The descriptors thus obtained serve as a base for constructing a regression partial least squares (PLS) model, and the quality of the latter depends on the selected grid size, its angular orientation, and the grid pitch, which can be treated as the drawback of CoMFA. The idea of the method of continuous molecular fields (MCMF) suggested by us in [5] for the specific case of constructing regression models is in the description of molecular fields as continuous func tions of spatial coordinates rather than their values at the points of a hypothetical grid. Such a description is possible in the framework of the support vector machine (SVM) method based on the use of statistical kernels describing the similarity of the molecular fields of atoms in the ligand molecules. CHEMISTRY Method of Continuous Molecular Fields in the OneClass Classification Task P. V. Karpov, I. I. Baskin, N. I. Zhokhova, and Academician N. S. Zefirov Received April 26, 2011 DOI: 10.1134/S0012500811100016 Moscow State University, Moscow, 119991 Russia

Method of continuous molecular fields in the one-class classification task

  • Upload
    n-s

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Method of continuous molecular fields in the one-class classification task

ISSN 0012�5008, Doklady Chemistry, 2011, Vol. 440, Part 2, pp. 263–265. © Pleiades Publishing, Ltd., 2011.Original Russian Text © P.V. Karpov, I.I. Baskin, N.I. Zhokhova, N.S. Zefirov, 2011, published in Doklady Akademii Nauk, 2011, Vol. 440, No. 4, pp. 480–483.

263

Modern estimates of conceivable organic struc�tures exceed 1060. Among these structures, there areboth biologically active and practically important sub�stances, as well as compounds that currently have noapplications. It is evident that synthesis and testing ofall the structures from this set are impossible. In thiscontext, virtual screening methods are being activelydeveloped. On the basis of predictive models, thistechnique filters down initial databases to a manage�able number of structures that can be synthesized andtested in the laboratory.

When the 3D structure of a biological target isunknown, models for virtual screening are constructedon the basis of data on the ligands that bind to it. Oneapproach is to construct a classification model thatpredicts whether a compound of interest is among theactive or inactive structures. This model necessitatesknowledge of examples of both types; however, in mostcases, only active structures are reported, whereas noinformation on inactive compounds is, as a rule, avail�able. Thus, all conceivable structures for which noinformation on activity is available can be classified asinactive (on the order of 1060 compounds). It is impos�sible to process such a number of compounds in rea�sonable time; therefore, only a small number of inac�tive structures are arbitrarily chosen as counterexam�ples, so that the model starts depending on this choice.

The one�class classification [1] successfully copeswith this difficulty since only active structures are usedfor constructing models. The method has otherimportant features: weak dependence on the inputspace metrics because of the presence of the trainingstage, lower sensitivity to activity cliffs as comparedwith similarity search methods based on Tanimotoindices, and a clear mathematical justification. Thisunderlies the efficiency of such methodology fordetermining the areas of applicability of QSAR models

(quantitative structure–activity relationship) [2] andperforming the virtual screening [3]. Just like anyother method of statistical data processing, the one�class classification operates with a definite numericalrepresentation of a chemical structure. The descrip�tion of molecules by means of fixed sized descriptorvectors has some drawbacks, the most important onebeing the arbitrary choice of initial descriptor sets and,as a result, the subjectivity of the constructed QSARmodels. Therefore, current interest focuses on thedevelopment of methods that make it possible to com�pare structures and construct predictive models on thebasis of this comparison without using fixed sets ofmolecular descriptors.

Common 3D QSAR methods are based on thedescription of the spatial distribution of molecularfields. The descriptors in these methods are the inter�action energies of ligands with probe atoms (molecularfield potentials) placed at the points of the hypotheti�cal spatial grid constructed around the aligned 3Dligand structures. For example, in the CoMFAmethod (comparative molecular field analysis) [4], thesteric, hydrophobic, and electrostatic molecular fieldsare used for these purposes. The descriptors thusobtained serve as a base for constructing a regressionpartial least squares (PLS) model, and the quality ofthe latter depends on the selected grid size, its angularorientation, and the grid pitch, which can be treated asthe drawback of CoMFA.

The idea of the method of continuous molecularfields (MCMF) suggested by us in [5] for the specificcase of constructing regression models is in thedescription of molecular fields as continuous func�tions of spatial coordinates rather than their values atthe points of a hypothetical grid. Such a description ispossible in the framework of the support vectormachine (SVM) method based on the use of statisticalkernels describing the similarity of the molecular fieldsof atoms in the ligand molecules.

CHEMISTRY

Method of Continuous Molecular Fields in the One�Class Classification Task

P. V. Karpov, I. I. Baskin, N. I. Zhokhova, and Academician N. S. Zefirov

Received April 26, 2011

DOI: 10.1134/S0012500811100016

Moscow State University, Moscow, 119991 Russia

Page 2: Method of continuous molecular fields in the one-class classification task

264

DOKLADY CHEMISTRY Vol. 440 Part 2 2011

KARPOV et al.

In this work, we propose a promising method forvirtual screening of organic compounds based on acombination of the MCMF methodology with theone�class SVM method (1�SVM).

The first step of constructing the model was spatialalignment of the structures of organic ligands. In thiswork, the alignment was performed with the SEALalgorithm [6] implemented by us in the framework ofthe software for MCMF modeling. Then, kernel val�ues were calculated, and the model was constructedwith the LibSVM program [7] where the SVM methodwas best implemented.

In the one�class classification method, only activestructures are used. Sequentially excluding one struc�ture at a time and constructing the model based on theremaining structures, one can predict the activity of allactive compounds. However, for assessing the statisti�cal characteristics of classification models, it is neces�sary to determine not only the number of active com�pounds predicted to be active (true positive, TP) and

the number of active compounds predicted to be inac�tive (false negative, FN) but also the number of inac�tive compounds predicted to be inactive (true negative,TN) and the number of inactive compounds predictedto be active (false positive, FP). To determine the lasttwo characteristics, we used structures resembling thestructures of active ligands in their physicochemicalproperties but being inactive (so�called decoys).

In this work, one�class models were constructedwith the use of the DUD database [8], which containsthe structures of active ligands for different biologicaltargets, as well as the structures of correspondingdecoys. It is worth noting that the latter were used onlyfor assessing the statistical characteristics of classifica�tion models and were not involved in their construc�tion. In particular, decoys were used for determiningthe TN and FP of models constructed with the use ofactive compounds.

The suggested one�class classifier calculates a con�tinuous quantity (a classifier function), for which thethreshold value is determined. If the classifier functioncalculated for a certain ligand exceeds the thresholdvalue, the compound is considered active; otherwise,the structure is discarded from further consideration.

Dependence of the FN, FP, TN, and TP on thethreshold value is clearly reflected by a receiver opera�tor characteristic (ROC) curve [9] in the TPR–FPRcoordinates (true positive rate versus false positiverate), where TRP = TP/(TP + TN) and FPR =FP/(FP + TN). The larger the area under the curve(AUC), the higher the classifier efficiency.

Constructing the classification model necessitatesmaximizing the AUC value by optimizing the 1�SVMparameter ν and parameter α of the used kernel. Inthis work, we studied both the individual electrostatic,steric, and hydrophobic kernels and their linear com�binations. For individual kernels, two parameters wereoptimized: ν and the α parameter corresponding to agiven type of molecular field. For linear combinations,the number of parameters to be optimized increasedup to seven owing to the mixing coefficients for thecorresponding kernels: ν, αel, αst, αhyd, hel, hst, and hhyd.The search for optimal values by simply trying allparameters is rather laborious; therefore, we used spe�cial function optimization algorithms. To do this, theN1opt library [10] was built into the software forMCMF modeling. The solutions obtained by suchoptimization are not global and strongly depend onthe initial approximation. Therefore, at the first step,the optimization algorithm was launched ten times,each time starting from a set of random parameter val�ues in the ranges ν ∈ [0.01; 0.80], hk ∈ [0.0001;0.3000], and αk ∈ [0.001; 1.000]. At the second step,we used the Nelder–Mead algorithm for refining the

Table 1. Parameters and AUC of the models for HIV re�verse transcriptase inhibitors obtained by the one�class sup�port vector machine method with the use of kernels in theframework of the MCMF

Molecular field ν αel αst αhyd AUC

Electrostatic 0.0819 0.0311 – – 0.60

Steric 0.0017 – 0.0099 – 0.75

Hydrophobic 0.4658 – – 0.0086 0.65

0.8

0.6

0.4

0.2

TPR1.0

0 0.2 0.4 0.6 0.8 1.0FPR

12

ROC curves of the models constructed on the basis of (1) alinear combination and (2) individual molecular fields fortrypsin inhibitors. See text for the meaning of TPR andFPR.

Page 3: Method of continuous molecular fields in the one-class classification task

DOKLADY CHEMISTRY Vol. 440 Part 2 2011

METHOD OF CONTINUOUS MOLECULAR FIELDS 265

optimal parameters; the set of the best�fit parametersobtained at the first step of optimization were used asthe initial approximation.

Tables 1 and 2 summarize the results of construct�ing models by means of the one�class support vectormachine on the basis of continuous molecular fieldsfor HIV reverse transcriptase (HIVRT) inhibitors andtrypsin inhibitors. As follows from Table 1, for HIVRT,the best performance is shown by the model con�structed with the use of the steric kernel and resultingin an AUC value of 0.75. For this target, the use of alinear combination of several kernels does not increasethe AUC value. At the same time, for trypsin inhibi�tors, rather high AUC values (0.86–0.91) wereobtained on the basis of individual models constructedwith the use of all three kernels, which is likely due totheir intercorrelation. However, for this target, the useof a linear combination of all kernels increases theAUC value up to 0.94. The examples of ROC curvesfor all models constructed on the basis of individualkernels for trypsin inhibitors are shown in the figure.

Our calculation results show the efficiency of thesuggested method. Being an alternative to the tradi�tional search for active structures with the use of theTanimoto index and to the two�class classificationmethods, this approach has unique properties. In con�trast to the two�class classification methods, thedescribed method is not sensitive to the choice ofcounterexamples. As distinct from the Tanimoto sim�ilarity search, the suggested method can adapt to com�plex structure–activity landscapes and, thus, makes itpossible to avoid activity cliffs [11]. In addition, ascompared to similarity search methods based on the

use of fragmental descriptors, the suggested approachimplies using the same model to the sets of compoundsbelonging to different structural classes.

ACKNOWLEDGMENTS

This work was supported by the Russian Founda�tion for Basic Research (project no. 10�07�00201).

REFERENCES

1. Schölkopf, B., Platt, J.C., Shawe�Taylor, J., et al., Neu�ral Comput., 2001, vol. 13, pp. 1443–1471.

2. Baskin, I.I., Kireeva, N., and Varnek, A., Mol. Inf.,2010, vol. 29, pp. 581–587.

3. Fechner, N., Jahn, A., Hinselmann, G., and Zell, A.,J. Chemoinf., 2010, vol. 2, p. 2.

4. Cramer, R.I., Patterson, D., and Bunce, J., J. Am.Chem. Soc., 1988, vol. 110, pp. 5959–5967.

5. Zhokhova, N.I., Baskin, I.I., Bakhronov, D.K., Palyu�lin, V.A., and Zefirov, N.S., Dokl. Chem., 2009,vol. 429, part 1, pp. 273–276 [Dokl. Akad. Nauk, Ser.Khim., 2009, vol. 429, pp. 201–205].

6. Kearsley, S.K. and Smith, G., Tetrahedron Comput.Met., 1990, vol. 3, no. 6(3), pp. 615–633.

7. Chih�Chung Chang and Chih�Jen Lin., LIBSVM: aLibrary for Support Vector Machines, 2001.

8. Huang, N., Shoichet, B.K., and Irwin, J.J., J. Med.Chem., 2006, vol. 49, pp. 6789–6801.

9. Fawcett, T., Pattern Recogn. Lett., 2006, vol. 27,pp. 861–874.

10. Johnson, S., The Nlopt Nonlinear�Optimization Pack�age, http://ab�initio.mit.edu/nlopt

11. Maggiora, G.M., J. Chem. Inf. Model., 2006, vol. 46,p. 1535.

Table 2. Parameters and AUC of the models trypsin inhibitors obtained by the one�class support vector machine methodwith the use of kernels in the framework of the MCMF

Molecular field ν hel αel hst αst hhyd αhyd AUC

Electrostatic 0.5302 – 0.3000 – – – – 0.91

Steric 0.4695 – – – 0.0013 – – 0.87

Hydrophobic 0.6604 – – – – – 0.3000 0.86

Linear combination of the fields 0.4526 0.3158 0.2997 0.5772 0.0427 0.1070 0.1517 0.94