3
ISSN 0012-5008, Doklady Chemistry, 2007, Vol. 417, Part 2, pp. 282–284. © Pleiades Publishing, Ltd., 2007. Original Russian Text © N.I. Zhokhova, I.I. Baskin, V.A. Palyulin, A.N. Zefirov, N.S. Zefirov, 2007, published in Doklady Akademii Nauk, 2007, Vol. 417, No. 5, pp. 639–641. 282 Quantitative structure–activity and structure–prop- erty relationship (QSAR/QSPR) methods are widely used for predicting the physical and chemical proper- ties and biological activity of chemical compounds [1– 3]. In some cases, in the framework of the fragmental approach, a universal strategy for constructing quanti- tative structure–activity and structure–property rela- tionships [4], it is of interest to predict the properties of compounds with the use of descriptors calculated on the basis of molecular fragments containing definite atoms that play a specific role in description of a given property. To identify such atoms, we marked them with a special label [5]. In the present work, we suggest use of fragmental descriptors with labeled atoms in QSAR/QSPR study of a wide spectrum of properties: (i) for calculating local properties of molecules, for example, NMR chemical shifts; (ii) for predicting the biological activity for sets of congeneric compounds that contain a common fragment with anchor atoms bonded to substituents; and (iii) for predicting the kinetic parameters of reactions of the same type. In each case, the suggested strategy ensures the use of the most significant fragmental descriptors for constructing models. The use of such descriptors is exemplified by modeling (i) 31 P NMR chemical shifts of monophos- phine derivatives, (ii) the ability of 1-[(2-hydroxy- ethoxy)methyl]-6-(phenylthio)thymine analogues to inhibit HIV-1 reverse transcriptase, and (iii) the rate constants of hydrolysis of carboxylic esters. The fast stepwise multiple linear regression (FSMLR) and three-layer feedforward artificial neural network (ANN) methods implemented in the NASAWIN software program [6, 7] were used for cal- culating fragmental descriptors with labeled atoms and constructing QSAR/QSPR models. The number of neurons in the input layer of the ANN corresponded to the number of selected descrip- tors, the number of hidden neurons in the inner layer varied from two to five, and the output layer consisted of one neuron. The RPROP algorithm was used as the learning algorithm [8]. The predictive power of the models was estimated by means of an original procedure of N(N – 1)-fold double cross validation [9]. In this approach, the initial database is systematically divided into training, internal test, and external test sets in an (N – 2) : 1 : 1 ratio. The information from the internal test set is used for select- ing models with the highest predictive power. The information from the external test set is in no way used for constructing or selecting models; therefore, the error of prediction for this set (both the root-mean- square and mean average error) can be used for esti- mating the actual predictive power of models. In such partitions, each compound occurs N 2 – 3N + 2 times in the training set, N – 1 times in the internal test set, and N – 1 times in the external test set. The predicted value of the property for each compound is calculated as the average of the predicted values in all N – 1 partitions in which this compound occurs in the external test set. In this work, N = 5. When linear regression models are constructed by the FSMLR method [9], the internal test set is used for determining the optimal number of descriptors consid- ered in the model. In the framework of this method, the current error vector is initiated by the experimental val- ues of properties of compounds from the training set. At each iteration, the descriptor that correlates best with the current error vector for the training set is added to the current set of selected descriptors and the corre- sponding regression model based on this descriptor is used for recalculating the current error vector, which is used at the next iteration for selecting the next descrip- tor, etc. An interesting and nontrivial feature of this strategy is that each descriptor can be included in the model several times at different iterations. When a next descriptor is added, the regression coefficient at the constant term of the regression equation based on this descriptor is summed with the current coefficient at the constant term in the multivariate (i.e., involving many descriptors) model. The regression coefficient at the descriptor is either transferred to the multivariate model Fragmental Descriptors with Labeled Atoms and Their Application in QSAR/QSPR Studies N. I. Zhokhova, I. I. Baskin, V. A. Palyulin, A. N. Zefirov, and Academician N. S. Zefirov Received May 30, 2007 DOI: 10.1134/S0012500807120026 Moscow State University, Vorob’evy gory, Moscow, 119992 Russia CHEMISTRY

Fragmental descriptors with labeled atoms and their application in QSAR/QSPR studies

  • Upload
    n-s

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Fragmental descriptors with labeled atoms and their application in QSAR/QSPR studies

ISSN 0012-5008, Doklady Chemistry, 2007, Vol. 417, Part 2, pp. 282–284. © Pleiades Publishing, Ltd., 2007.Original Russian Text © N.I. Zhokhova, I.I. Baskin, V.A. Palyulin, A.N. Zefirov, N.S. Zefirov, 2007, published in Doklady Akademii Nauk, 2007, Vol. 417, No. 5, pp. 639–641.

282

Quantitative structure–activity and structure–prop-erty relationship (QSAR/QSPR) methods are widelyused for predicting the physical and chemical proper-ties and biological activity of chemical compounds [1–3]. In some cases, in the framework of the fragmentalapproach, a universal strategy for constructing quanti-tative structure–activity and structure–property rela-tionships [4], it is of interest to predict the properties ofcompounds with the use of descriptors calculated onthe basis of molecular fragments containing definiteatoms that play a specific role in description of a givenproperty. To identify such atoms, we marked them witha special label [5]. In the present work, we suggest useof fragmental descriptors with labeled atoms inQSAR/QSPR study of a wide spectrum of properties:(i) for calculating local properties of molecules, forexample, NMR chemical shifts; (ii) for predicting thebiological activity for sets of congeneric compoundsthat contain a common fragment with anchor atomsbonded to substituents; and (iii) for predicting thekinetic parameters of reactions of the same type. Ineach case, the suggested strategy ensures the use of themost significant fragmental descriptors for constructingmodels. The use of such descriptors is exemplified bymodeling (i)

31

P NMR chemical shifts of monophos-phine derivatives, (ii) the ability of 1-[(2-hydroxy-ethoxy)methyl]-6-(phenylthio)thymine analogues toinhibit HIV-1 reverse transcriptase, and (iii) the rateconstants of hydrolysis of carboxylic esters.

The fast stepwise multiple linear regression(FSMLR) and three-layer feedforward artificial neuralnetwork (ANN) methods implemented in theNASAWIN software program [6, 7] were used for cal-culating fragmental descriptors with labeled atoms andconstructing QSAR/QSPR models.

The number of neurons in the input layer of theANN corresponded to the number of selected descrip-tors, the number of hidden neurons in the inner layer

varied from two to five, and the output layer consistedof one neuron. The RPROP algorithm was used as thelearning algorithm [8].

The predictive power of the models was estimatedby means of an original procedure of

N

(

N

– 1)-folddouble cross validation [9]. In this approach, the initialdatabase is systematically divided into training, internaltest, and external test sets in an (

N

– 2) : 1 : 1 ratio. Theinformation from the internal test set is used for select-ing models with the highest predictive power. Theinformation from the external test set is in no way usedfor constructing or selecting models; therefore, theerror of prediction for this set (both the root-mean-square and mean average error) can be used for esti-mating the actual predictive power of models. In suchpartitions, each compound occurs

N

2

– 3

N

+ 2 times inthe training set,

N

– 1 times in the internal test set, and

N

– 1 times in the external test set. The predicted valueof the property for each compound is calculated as theaverage of the predicted values in all

N

– 1 partitions inwhich this compound occurs in the external test set. Inthis work,

N

= 5.When linear regression models are constructed by

the FSMLR method [9], the internal test set is used fordetermining the optimal number of descriptors consid-ered in the model. In the framework of this method, thecurrent error vector is initiated by the experimental val-ues of properties of compounds from the training set. Ateach iteration, the descriptor that correlates best withthe current error vector for the training set is added tothe current set of selected descriptors and the corre-sponding regression model based on this descriptor isused for recalculating the current error vector, which isused at the next iteration for selecting the next descrip-tor, etc. An interesting and nontrivial feature of thisstrategy is that each descriptor can be included in themodel several times at different iterations. When a nextdescriptor is added, the regression coefficient at theconstant term of the regression equation based on thisdescriptor is summed with the current coefficient at theconstant term in the multivariate (i.e., involving manydescriptors) model. The regression coefficient at thedescriptor is either transferred to the multivariate model

Fragmental Descriptors with Labeled Atomsand Their Application in QSAR/QSPR Studies

N. I. Zhokhova, I. I. Baskin, V. A. Palyulin, A. N. Zefirov, and

Academician

N. S. Zefirov

Received May 30, 2007

DOI:

10.1134/S0012500807120026

Moscow State University, Vorob’evy gory, Moscow, 119992 Russia

CHEMISTRY

Page 2: Fragmental descriptors with labeled atoms and their application in QSAR/QSPR studies

DOKLADY CHEMISTRY

Vol. 417

Part 2

2007

FRAGMENTAL DESCRIPTORS WITH LABELED ATOMS AND THEIR APPLICATION 283

if the descriptor is included in the model for the firsttime, or summed with the currently available value if itoccurs in the model more than once. The process ofiterative selection of descriptors and formulation of themodel is stopped when the minimal prediction error forthe internal test set is achieved, whereas the predictionerror for the external test set, which is in no way con-sidered in the statistical analysis, is used for estimatingthe predictive power of the resulting multivariate linearregression model. When models are developed by theANN method, the internal test set is used for determin-ing the point to stop learning in order to avoid over-learning.

Averaging the

N

(

N

– 1) partial FSMLR and ANNmodels derived for different partitions of the initialdatabase gives corresponding combined multivariatemodels. The calculated statistical characteristics are

(i) , the

Q

2

parameter (

Q

2

= (SS – PSS)/SS, wherePSS is the sum of the squared predictive errors for someproperty and SS is the sum of the squared deviations ofthe property from the mean value); (ii) RMSE

DCV

, theroot-mean-square error of prediction; and (iii) MAE

DCV

,the mean absolute error of prediction. The doublecross-validation method provides the most adequateestimate of the actual predictive power of models forwhich the selection procedure implies the use of a testset or the cross-validation procedure.

Below are some examples of use of descriptors withlabeled atoms in QSAR/QSPR studies.

Example 1.

To construct a QSPR model for

31

P NMR chemical shifts of substituted monophos-phines, we used a database comprising 291 phos-phines

PH

3 –

n

R

n

, including 29 primary, 38 secondary,and 224 tertiary phosphines with different substituents[10]. The experimental

31

P NMR chemical shifts arefrom –183 to +61 ppm. As is known, the chemical shiftdepends on the degree of shielding of atomic nuclei bythe electron cloud, its density depending on the nature

of the substituents at these atoms. Therefore, it is expe-dient to use descriptors considering the electronic andsteric effects of these substituents. As such descriptors,we chose descriptors based on the number of occur-rences of fragments containing four to ten nonhydrogenatoms, including the P atom (labeled with

). Amongthe resulting combined FSMLR and ANN models, thebest FSMLR model has the following predictive power

characteristics:

= 0.9560,

RMSE

DCV

= 9.1 ppm,and MAE

DCV

= 6.1 ppm. The most significant frag-ments for the description of the property under consid-eration are the following fragments with the labeledatom

P

a

:

The first three fragments reflect the

σ

-inductive effectof the alkyl substituents at the phosphorus atom, thefourth fragment reflects the effect of conjugation withan aromatic nucleus, and the fifth fragment reflects theeffect of the fluorine atom in the

ortho

position.

Example 2.

The inhibitory activity with respect toHIV-1 reverse transcriptase, represented by the effec-

tive concentration of compounds necessary

for the 50% protection of MT-4 cells from the cytotoxiceffect of the virus, was studied for a set of congenericderivatives of 1-[(2-hydroxyethoxy)methyl]-6-(phe-nylthio)thymine [11]. Shown below are the commonstructural element of the compounds in the set and thefragments of substituents R

1

, R

2

, and R

3

, which arebonded to anchor atoms

b

,

c

, and

d

of the common frag-ment and make the largest contribution to the best com-bined model.

The model was obtained by the ANN method. It has the

following predictive power parameters:

=0.8561,

RMSE

DCV

= 0.520, and MAE

DCV

= 0.41.

Example 3.

To predict reaction rate constants log

k

,we used a database containing information on hydroly-sis rate constants measured in the temperature range 0–154

°

C in binary water–solvent systems (the concentra-tion of the nonaqueous component was 0–98%) for

2092 carboxylic esters [12, 13]. Depending on thenature of the substituents at the C and O atoms of theacidic residue of the esters, the experimental log

k

val-ues varied from –7.53 to –0.17. QSPR models weredeveloped by the ANN method using the temperature,the concentration of organic solvents, parameters char-acterizing their properties as descriptors [13], and frag-ments containing labeled atoms involved in the reactioncenters at any stage of the reaction as suggested by its

QDCV2

QDCV2

Pa C Pa CH3 Pa CCH

Pa CC

F

Pa C C

log1

EC50-----------

Csp3

Csp3dSc

d—R1 c—R2 d—R1 b—R3

N

Ca

NbCc

Cd

X

R3

R1

R2

O

H2C

O

H2C

CH3bd

CC

C

QDCV2

Page 3: Fragmental descriptors with labeled atoms and their application in QSAR/QSPR studies

284

DOKLADY CHEMISTRY

Vol. 417

Part 2

2007

ZHOKHOVA et al.

mechanism [14]. Each of such fragments describes theeffect of the groups adjacent to the reaction centers onthe reaction rate. The best combined model for this set

was obtained by the ANN method and has =0.9162, RMSE

DCV

= 0.31, and MAE

DCV

= 0.19. Beloware shown schematically three fragments exerting thestrongest effect on the hydrolysis rate constants.

The first fragment describes the steric effect of thesubstituents at the

α

-carbon atom of a carboxylic acid,the second fragment describes the electronic effect ofthe oxygen atom with lone electron pairs located in theleaving group, and the third fragment describes theeffect of the phenyl group at the carboxyl.

Thus, the use of fragmental descriptors with labeledatoms makes it possible to extend the applicability ofthe fragmental approach in QSAR/QSPR studies.

REFERENCES

1. Hansch, C., Leo, A.,

Exploring QSAR. Fundamentalsand Applications in Chemistry and Biology

, Washington(D.C.): ACS, 1995, p. 542.

2. Katritzky, A.R., Maran, U., Lobanov, V.S., and Karel-son,

M.,

J. Chem. Inf. Comput. Sci.

, 2000, vol. 40, p. 1.

3. Kubinyi, H.,

QSAR: Hansch Analysis and RelatedApproaches

, Weinheim: VCH, 1993, p. 240.4. Zefirov, N.S. and Palyulin, V.A.,

J. Chem. Inf. Comput.Sci.

, 2002, vol. 42, pp. 1112–1122.5. Ivanova, A.A., Baskin, I.I., Palyulin, V.A., and

Zefirov,

N.S.,

Dokl. Chem.

, 2007, vol. 413, part 2,pp.

90–94 [

Dokl. Akad. Nauk

, 2007, vol. 413, no. 6,pp.

766–770].6. Baskin, I.I., Halberstam, N.M., Artemenko, N.V., et al.,

in

EuroQSAR-2002. Designing Drugs and Crop Pro-tectants: Processes, Problems, and Solutions

, Mel-bourne: Blackwell, 2003, pp. 260–263.

7. Artemenko, N.V., Baskin, I.I., Palyulin, V.A., andZefirov, N.S.,

Dokl. Chem.,

2001, vol. 381, nos. 1–3,pp.

317–320 [

Dokl. Akad. Nauk,

2001, vol. 381, no. 2,pp. 203–206].

8. Patnaik, L.M. and Rajan, R.,

Neurocomputing

, 2000,vol. 85, pp. 123–135.

9. Zhokhova, N.I., Baskin, I.I., Palyulin, V.A., et al., in

XVI European Symposium on Quantitative Structure–Activity Relationships and Molecular Modeling, Medi-terranean Sea, Italy, September 10–17, 2006

, p.

206.10. Bosque, R. and Sales, J.,

J. Chem. Inf. Comp. Sci.

, 2001,vol. 41, pp. 225–232.

11. Hannongbua, S., Nivesanond, K., Lawtrakul, L., et al.,

J.

Chem. Inf. Comp. Sci.,

2001, vol. 41, pp. 848–855.12.

Tablitsy konstant skorosti i ravnovesiya geterotsikli-cheskikh organicheskikh reaktsii

(Tables of Rate andEquilibrium Constant of Hetercyclic Organic Reations),Palm, V.A., Ed., Moscow: VINITI, 1975.

13. Halberstam, N.M., Baskin, I.I., Palyulin, V.A., andZefirov, N.S.,

Mendeleev Commun.

, 2002, no. 5,pp.

185–186.14. Ingold, C.K.,

Structure and Mechanism in OrganicChemistry

, Itaca: Cornell. University Press, 1969. Trans-lated under the title

Teoreticheskie osnovy organicheskoikhimii,

Moscow: Mir, 1973.

QDCV2

Ca

C OC

O

Ca

C OH C

OH

OCd

O

O O

OH HOCd

O

CO

Cd

O

OH

OHO

Cd

H2O

H+

H2O

H+

H2O

H+

O