CHAPTER - 2 QSAR METHODOLOGY - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/25218/7/07_chapter-2.pdf · CHAPTER - 2 QSAR METHODOLOGY ... who correlated electronic properties

36

CHAPTER - 2

QSAR METHODOLOGY

2.1 INTRODUCTION

Quantitative structure-activity relationships (QSAR) represent an attempt to correlate

structural or property descriptors of compounds with activities. These physicochemical

descriptors, which include parameters to account for hydrophobicity, topology, electronic

properties, and steric effects, are determined empirically or, more recently, by computational

methods. Activities used in QSAR include chemical measurements and biological assays.

QSAR currently are being applied in many disciplines, with many pertaining to drug design

and environmental risk assessment.

QSAR date back to the 19th century. In 1863, A.F.A. Cros at the University of Strasbourg

observed that toxicity of alcohols to mammals increased as the water solubility of the

alcohols decreased [1]. In the 1890's, Hans Horst Meyer of the University of Marburg and

Charles Ernest Overton of the University of Zurich, working independently, noted that the

toxicity of organic compounds depended on their lipophilicity [1, 2].

2.2 Linear Free Energy Relationships

Little additional development of QSAR occurred until the work of Louis Hammett (1894-

1987), who correlated electronic properties of organic acids and bases with their equilibrium

constants and reactivity. Consider the dissociation of benzoic acid:

Hammett observed that adding substituents to the aromatic ring of benzoic acid had an

orderly and quantitative effect on the dissociation constant. For example,

37

a nitro group in the meta position increases the dissociation constant, because the nitro group

is electron-withdrawing, thereby stabilizing the negative charge that develops. Consider now

the effect of a nitro group in the para position:

The equilibrium constant is even larger than for the nitro group in the meta position,

indicating even greater electron-withdrawal.

Now consider the case in which an ethyl group is in the para position:

In this case, the dissociation constant is lower than for the unsubstituted compound,

indicating that the ethyl group is electron-donating, thereby destabilizing the negative charge

that arises upon dissociation.

Hammett also observed that substituents have a similar effect on the dissociation of other

organic acids and bases. Consider the dissociation of phenylacetic acids:

38

Electron-withdrawal by the nitro group increases dissociation, with the effect being less for

the meta than for the para substituent, just as was observed with benzoic acid. The electron-

donating ethyl group decreases the equilibrium constant, as would be expected.

Data for these equilibria typically are graphed as illustrated below:

Figure 1: Example of a graph for a linear free energy relationship. K0 or K0' represent

equilibrium constants for unsubstituted compounds and K or K', for substituted compounds.

Values for the abscissa are calculated from the dissociation constants of unsubstituted and

substituted benzoic acid. Values for the ordinate are obtained from another organic acid or

base with identical patterns of substitution, in this case phenylacetic acid.

Because this relationship is linear, the following equation can be written:

where is the slope of the line. The values for the abscissa in Figure 1 are always those for

benzoic acid and are given the symbol, . Therefore, we can write:

, the slope of the line, is a proportionality constant pertaining to a given equilibrium. It

relates the effect of substituents on that equilibrium to the effect of those substituents on the

benzoic acid equilibrium. That is, if the effect of substituents is proportionally greater than on

the benzoic acid equilibrium, then > 1; if the effect is less than on the benzoic acid

equilibrium, < 1. By definition, for benzoic acid is equal to 1.

39

is a descriptor of the substituents. The magnitude of gives the relative strength of the

electron-withdrawing or -donating properties of the substituents. is positive if the

substituent is electron-withdrawing and negative if it is electron-donating.

These relationships as developed by Hammett are termed linear free energy relationships.

Recall the equation relating free energy to an equilibrium constant:

That is, the free energy is proportional to the logarithm of the equilibrium constant. These

linear free energy relationships are termed "extrathermodynamic". Although they can be

stated in terms of thermodynamic parameters, no thermodynamic principle states that the

relationships should be true.

To develop a better understanding of these relationships, it is instructive to consider some

values of and . Values of are provided below:

In the aniline and phenol equilibria, the hydrogen ion that is dissociating is one atom removed

from the phenyl ring, whereas in the benzoic acid equilibrium it is two atoms removed. Thus,

substituents are able to exert a greater effect on the dissociation in aniline and phenol than in

benzoic acid and the value of > 1. In phenylacetic and phenylpropionic acids, the hydrogen

ion dissociating is three and four atoms removed, respectively, from the phenyl ring.

Substituents are able to exert a lesser effect on the equilibrium than on the benzoic acid

equilibrium and < 1.

Some illustrative values of for substituents in the meta and para positions are given below:

40

By definition, for hydrogen is 0. The positive values of for the nitro group indicate that it

is electron-withdrawing. In understanding the magnitudes of the values for the nitro group

in meta vs. para positions, consider the mechanisms of electron withdrawal or donation. For a

nitro group in the meta position, electron-withdrawal is due to an inductive effect produced

by the electronegativity of the constituent atoms. If only induction were operative, one would

expect the electron-withdrawing effect of a nitro group in the para position to be less than in

the meta position. The larger value for a para-substituted nitro group results from the

combination of both inductive and resonance effects. For chlorine, the electronegativity of the

atom produces an inductive electron-withdrawing effect, with the magnitude of the effect in

the para position being less than in the meta position. For chlorine, only the inductive effect is

possible. The methoxy group can be electron-donating or -withdrawing, depending on the

position of substitution. In the meta position, the electronegativity of the oxygen produces an

inductive electron-withdrawing effect. In the para position, only a small inductive effect

would be expected. Moreover, an electron-donating resonance effect occurs for the methoxy

group in the para position, giving an overall electron-donating effect. Tables of values for

numerous substituents have been published [3,4]. In some cases, the sigma values are

generally applicable to many different equilibria. In other cases, sigma values have been

derived for specific equilibria, which is particularly true when one considers sigma values for

ortho substituents.

2.3 Computer-Assisted Design

Computer-assisted drug design (CADD), also called computer-assisted molecular design

(CAMD), represents more recent applications of computers as tools in the drug design

process. In considering this topic, it is important to emphasize that computers cannot

substitute for a clear understanding of the system being studied. That is, a computer is only an

additional tool to gain better insight into the chemistry and biology of the problem at hand.

In most current applications of CADD, attempts are made to find a ligand (the putative drug)

that will interact favorably with a receptor that represents the target site. Binding of ligand to

41

the receptor may include hydrophobic, electrostatic, and hydrogen-bonding interactions. In

addition, solvation energies of the ligand and receptor site also are important because partial

to complete desolvation must occur prior to binding.

This approach to CADD optimizes the fit of a ligand in a receptor site. However, optimum fit

in a target site does not guarantee that the desired activity of the drug will be enhanced or that

undesired side effects will be diminished. Moreover, this approach does not consider the

pharmacokinetics of the drug.

The approach used in CADD is dependent upon the amount of information that is available

about the ligand and receptor. Ideally, one would have 3-dimensional structural information

for the receptor and the ligand-receptor complex from X-ray diffraction or NMR. The ideal is

seldom realized. In the opposite extreme, one may have no experimental data to assist in

building models of the ligand and receptor, in which case computational methods must be

applied without the constraints that the experimental data would provide.

Based on the information that is available, one can apply either ligand-based or receptor-

based molecular design methods. The ligand-based approach is applicable when the structure

of the receptor site is unknown, but when a series of compounds have been identified that

exert the activity of interest. To be used most effectively, one should have structurally similar

compounds with high activity, with no activity, and with a range of intermediate activities. In

recognition site mapping, an attempt is made to identify a pharmacophore, which is a

template derived from the structures of these compounds. It is represented as a collection of

functional groups in three-dimensional space that is complementary to the geometry of the

receptor site.

In applying this approach, conformational analysis will be required, the extent of which will

be dependent on the flexibility of the compounds under investigation. One strategy is to find

the lowest energy conformers of the most rigid compounds and superimpose them.

Conformational searching on the more flexible compounds is then done while applying

distance constraints derived from the structures of the more rigid compounds. Ultimately, all

of the structures are superimposed to generate the pharmacophore. This template may then be

used to develop new compounds with functional groups in the desired positions. In applying

this strategy, one must recognize that one is assuming that it is the minimum energy

42

conformers that will bind most favorably in the receptor site. In fact, there is no a priori

reason to exclude higher energy conformers as the source of activity.

The receptor-based approach to CADD applies when a reliable model of the receptor site is

available, as from X-ray diffraction, NMR, or homology modeling. With the availability of

the receptor site, the problem is to design ligands that will interact favorably at the site, which

is a docking problem.

2.4 Hansch Analysis

QSAR based on Hammett's relationship utilize electronic properties as the descriptors of

structures. Difficulties were encountered when investigators attempted to apply Hammett-

type relationships to biological systems, indicating that other structural descriptors were

necessary.

Robert Muir, a botanist at Pomona College, was studying the biological activity of

compounds that resembled indoleacetic acid and phenoxyacetic acid, which function as plant

growth regulators. In attempting to correlate the structures of the compounds with their

activities, he consulted his colleague in chemistry, Corwin Hansch. Using Hammett sigma

parameters to account for the electronic effect of substituents did not lead to meaningful

QSAR. However, Hansch recognized the importance of the lipophilicity, expressed as the

octanol-water partition coefficient, on biological activity [5]. We now recognize this

parameter to provide a measure of the bioavailability of compounds, which will determine, in

part, the amount of the compound that gets to the target site.

Relationships were developed to correlate a structural parameter (i.e., lipophilicity) with

activity. In some cases, a univariate relationship correlating structure and activity was

adequate. The form of the equation is:

where C is the molar concentration of compound that produces a standard response (e.g.,

LD50, ED50). With other data, it was observed that correlations were improved by

combining Hammett's electronic parameters and Hansch's measure of lipophilicity using an

equation such as

43

where is the Hammett substituent parameter and pi is defined analogously to . That is,

In yet other cases, parabolic relationships between biological response and hydrophobicity

were observed that could be fit by including a (log P)**2 term in the QSAR. One

interpretation to account for this term is that many membranes must be traversed for

compounds to get to the target site, and those with greatest hydrophobicity will become

localized in the membranes they encounter initially. Thus, an optimum hydrophobicity may

be found in some test systems.

QSAR are now developed using a variety of parameters as descriptors of the structural

properties of molecules. Hammett sigma values are often used for electronic parameters, but

quantum mechanically derived electronic parameters also may be used. Other descriptors to

account for the shape, size, lipophilicity, polarizability, and other structural properties also

have been devised.

Quantitative Structure-Activity/Toxicity Relationship Studies (QSAR) are mathematical

models relating the biological activity measurements of a set of chemical compounds to the

variation in their chemical structures [6-9].

2.5 DEVELOPMENT OF QSAR

The steps involved in development of QSAR for a given type of activity and a given

type of chemical compounds can be mentioned as below:

(a) Selecting a “training set” of compounds from an almost infinite universe of

possible candidates;

(b) Synthesizing or otherwise getting hold of pure samples of these compounds;

(c) Obtaining appropriate biological activity measurements for the (n x M) matrix

Y, with n being the number of training set of compounds, M being the number

of biological activity measurements;

44

(d) Translating the variation in structure among the training set compound to the

variation of “structure descriptor variables”, i.e., achieving a relevant

quantitative description of the structural variation among the compounds. The

resulting data are denoted Xik for compound i (i = 1,2,….,n) and descriptor

variable k (k = 1,2,…., k). Together they form the (n x K) matrix X.

(e) Deriving a mathematical model connecting X and Y. This gives, among other

results, a measure of low well Y is modeled (e.g., explained variance, and

which variables in X that are important in the model (e.g., regression

coefficients);

(f) Validating the model by having it predict the biological activity / toxicity for

several new compounds, followed by the actual biological testing of these

compounds and the comparison of predictions with outcomes;

(g) Interpreting the model by relating it to biological and chemical knowledge.

2.6 MODEL DEVELOPMENT AND STATISTICAL DESIGNS

It is an accepted fact that all chemical and biological measurements are not accurate.

The value of any property / activity / varies a little (or much) when measured repeatedly,

even under as similar conditions as possible. Therefore, scientific models based on measured

data, including QSARs, necessarily are statistical in nature. These models separate the

variability in the data in two parts, the systematic and random parts. For the biological

measurements (BA) we can write the model as –

BA = F(X) + E (2.1)

Here, the systematic part, F(X) is the part of the biological data that is “explained” by

X. The random part – the residuals E – contain errors of measurements of Y, plus the model

errors, the imperfection of the systematic model F(X). The model errors, in turn have two

sources, the imperfect form, “shape” of the model F, and the incompleteness and possible

errors in the descriptor variables X [10-15].

Since presumably these structurally related properties of a chemical can be

determined by experimental or computational mean much more efficiently that its biological

activity using in-vivo or in-vitro approaches, a statistically validated QSAR model is capable

of predicting the biological activity of a new chemical within the same series in lieu of the

45

time-consuming and lab our- intensive processes of chemical synthesis and biological

evaluation. Applied judiciously, QSAR can have substantial amount of time, money and man

power.

The Quantitative Structure-Activity Relationships (QSARs) are mathematical models

relating the activity measurements of chemical compounds to the variation in their chemical

structure. In case properties or activities are related to structures of chemical compounds,

then the methodology is called quantitative structure-property relationships (QSPRs) or

quantitative structure-activity relationships (QSARs), respectively. Such methodologies are

being used successfully in antihypertensive drugs to understand the adverse effects of

chemical compounds. These predictions are done because of the large number of untested

chemicals and because of the high costs of biological testing [16-24].

QSAR models are now regarded as a scientifically credible tool for predicting and

classifying the biological activities of untested chemicals. QSAR has become inexorably

embedded as an essential tool in the pharmaceutical industry, from lead discovery and

optimization to lead development [25, 26]. A growing trend is to use QSAR early in the drug

discovery process as a screening and enrichment tool to estimate from further development.

Those chemicals lacking drug like properties [26]or those chemicals predicted to elicit a toxic

response. The fundamental assumption of QSAR is that variations in the biological activity of

a series of chemicals that target a common mechanism of action are correlated with variations

in their structural, physical and chemical properties [27]. A good number of papers have

been reported in this field [28-34].

The molecular structure and parameters derived from molecular spectral data of

organic compounds acting as drugs can be combined to form powerful models of biological

activity. Such data-activity relation is now-a-days called Quantitative Structure-Data-Activity

Relationships (QSDARs) instead of QSAR.

The pioneer worker in this field, considered as father of QSAR, is Corwin Hansch

who for the first time introduced this methodology in 1962 [35]. For the past 47 years, the use

of QSAR has become increasingly helpful in understanding chemical and biological

interactions in the drug design process, pesticide research, environmental pollution, in the

area of toxicology etc. QSAR methodology is very useful in eluc idating the mechanism of

chemical-biological interactions; in particular enzyme action.

46

In environmental toxicology QSAR are used to understand the adverse effects of

chemicals, and to predict the biological effects of yet untested compounds. These predictions

are done because of the large number of untested chemicals, and because of the high costs of

biological testing. It is worth mentioning that, although sometimes taken as a criterion,

prediction is not the primary goal of QSAR. If it results from interpolation, it is often trivial;

if extrapolation goes too far outside the included parameter space, it usually fails. QSAR

helps to understand structure-activity relationships in a quantitative manner and to find the

borders of certain properties, e.g. the optimum lipophilicity within a series of analogs or the

maximum size of a certain group in a stepwise procedure.

The strategy and philosophy of QSAR enables chemists, pharmacologists,

environmentalists and medicinal chemists to look at their structures in terms of

physicochemical properties instead of only considering certain pharmacophore groups in it.

Nevertheless, QSAR methods are still used to prove and to quantify the underlying

hypothesis regarding the dependence of biological activities on physicochemical interactions.

The predictive power of QSAR critically depends on the statistical quality of the data used to

develop the model and estimates its parameters.

Parameters which encode certain structural features and properties are needed to

correlate biological activities with chemical structures in a quantitative manner.

Physicochemical properties, which are directly related to the extra molecular force involved

in the drug-receptor interaction as

The development of a QSAR for a given type of biological activity and a given type

of chemical compounds involves the following steps:

(i) Selecting a “training set” of compounds from an almost infinit

universe of possible candidates;

(ii) Synthesizing or otherwise getting hold of pure samples of these compounds;

(iii) Obtaining appropriate biological activity-measurements from the training set

compounds. Together these data form the (n x M) matrix Y, n being the

number of training set-compounds, and M being number of biological activity

measurements;

(iv Translating the variation in structure among the training set compounds to the

variation of “structure descriptor variables”. That is, achieving a relevant

47

quantitative description of the structural variation among the compounds. The

resulting data are denoted Xik for compound i, (i = 1,2,3,....,n) and descriptor

variable k (k = 1,2,…..,k). Together they form the (n x K) matrix X.

(v) Deriving a mathematical model connecting X and Y. This gives, among other

results, a measure of how well Y is modeled and which

variables in X that are important in the model (i.e. regression coefficients);

(vi) Validating the model by having it predict the biological activity for several

new compounds, followed by the actual biological testing of these compounds

and the comparison of predictions with outcomes.

(vii) Interpreting the model by relating it to biological and chemical knowledge.

It is interesting to mention that the history of QSAR dates back to the

19th century [32,43 and44]. At this time QSAR was in foreground and prediction

played only a minor role. Only a few parameters were used at that time. The

parameters used were log P, ,, MR and some steric parameters.

Now, quantitative prediction is made using quantum chemical, geometrical,

connectivity values, electro- topology, WHIM and many other distance-based and

connectivity type topological indices. Based on the origin of molecular descriptors

used in calculations, QSAR method can be divided into three groups as shown below

QSAR Methods

1st Group 2nd Group 3rd Group

Based on relatively number

of physico-chemical

properties and parameters

describing hydrophobic,

steric, electrostatic, etc.

effects. Usually, these

descriptors are used as

independent variables in

Based on quantitative

characteristic of molecular

graphs i.e. molecular topological

descriptors. These include

molecular connectivity indices

[36-39], molecular shape indices

[40], topological and electro-

topological state indices [41,42],

Based on using descriptors

derived from spatial (three

dimensional) representation of

the molecular structures. These

methods are called (3D) QSAR.

They require 3D alignment of

all molecules according to a

pharmacophore model or based

48

multiple regression

approaches [11]. These

methods are referred to as

Hansch analysis [11].

atom-pair descriptors [41] etc.

Sometimes topological

descriptors are also combined

with physico-chemical properties

and/or spectral parameters of the

molecule. These methods are

referred to [27] QSAR.

on docking to a ligand binding

site of a receptor. The

descriptors used could be

electro-static, steric,

hydrophobic etc. field values in

grid points corresponding the

molecule.

2.7 MODEL VALIDATION AND PREDICTIVE POWER OF QSAR MODELS

In MLR (Multiple Linear Regression) QSAR analysis, more independent variables

result into higher probability of a chance correlation between predicted and observed

activities [33-35]. This conclusion is true not only for MLR - QSAR, but also for any QSAR

approach when the number of variables (descriptors) is comparable to or higher than the

number of compounds in data set. Consequently, model validation is one of the most

important aspects of QSAR analysis.

The best way to validate a model is to perform cross-validation which is based on

leave-one-out (LOO) or leave-some-out (LSO) Cross-validation procedures [33-35]. The

outcome from this procedure is the cross-validated parameters: PRESS (Predicted Residual

Sum of Squares), SSY (Sum of the Squares of the response values), Spress (Uncertainty of

Prediction), R2CV (or Q2) (Overall Predictive ability) and PSE (Predictive Square Error).

Frequently, R2cv (or Q2) is used as a criterion of both robustness and predictive ability of the

model. Many authors consider high Q2 (for instance, Q2 > 0.5) as an indicator or even as the

ultimate proof of the high predictive power of the QSAR model [45-49]. They do not test the

models for their ability to predict the activity of compounds of an external test set (i.e.

compounds which have not been used in the QSAR model development) [50, 51]. Other

authors validate their models using only one or more compounds that were not used in QSAR

model development [50, 51] and still claim that their models are highly predictive. In contrast

with such expectations, it has been shown that if a test set with known values of biological

activities is available for prediction, there exists no correlation between LOO cross-validated

Q2 and correlation coefficient R2 between the predicted and observed activities for the test set

[52, 53].

49

Another widely used approach to establish the model robustness is called y-

randomization (randomization of response i.e. activities) [50,51]. It consists of repeating the

calculation procedure with randomized activities and subsequent probability assessment of

the resultant statistics. Frequently, it is used along with cross-validation. It is expected that

models obtained for the data set with randomized activity should have low values of Q2.

However, sometimes models based on the randomized data have high Q2 values due to

chance correlation or structural redundancy [50-53].

Some authors have suggested that the only way to estimate the true predictive power

of a QSAR model is to compare the predicted and observed activities of sufficiently large

external test set of compounds that were not used in the model development [50-53].

To estimate predictive power of a QSAR model one needs the following statistical

characteristics of the test model:

(1) Correlation coefficient R between the predicted and observed activities;

(2) Coefficient of determination i.e. predicted versus observed activities (R2obs)

and observed versus predicted activities (R2Cal);

(3) Slopes k and k‟ of the regression lines through the origin. A QSAR model is

predictive, if the following conditions are satisfied[50].

Q2 > 0.5, (2.2)

k2 > 0.6, (2.3)

2

2o

2

R

RR < 0.1 or

2

2o

2

R

RR < 0.1 (2.4)

0.85 < k < 1.15 or 0.85 < k‟ < 1.15 (2.5)

2.8. QSAR MODELING

The QSAR modeling generally involves three steps:

(1) Collect or, if possible, design a training set of chemicals,

(2) Choose descriptors that can properly relate chemical structure to biological

activities, and

50

(3) Apply statistical methods that correlate changes in structure with changes in

biological activity.

Obtaining a good quality QSAR model with the ability to predict activity of a

chemical outside the training set depends upon many factors in the approach and execution of

each of the three steps mentioned above.

2.9. QUALITY OF DATA

Data should come from the same assay protocol, and care should be taken to avoid

inter- laboratory variability. Any bad data points will tend to corrupt the proper correlation of

structure and activity. Rules of thumb for a good QSAR data set are that the dose-response

relationship should be smooth, the potency (or affinity) should be reproducible, the activity

range should span two or more orders of magnitude from the least active to most active

chemicals in the series, the number of chemicals used to build the QSAR model should be

sufficiently large to ensure statistical stability, the activities of the chemicals should be evenly

distributed across the range of activity, and the chemicals selected for the training set should

possess enough structural diversity to span the range of chemical space associated with the

biological activity under study.

2.10. PHYSICO-CHEMICAL PARAMETERS IN QSAR

QSAR development involves the quantification of the structural variation in the

current structural class by a small number of design variables [6-9]. These design variables

may correspond to characteristic properties of the compounds, e.g. size, solubility,

lipophilicity, oxidation potential etc. In case such “principal properties” are not available,

they may be derived by a multivariate analysis [16,17] of a battery of measured or calculated

properties of parent compounds. In case, the structural variation in the class corresponds to

the variation of a number of substituents or other structural fragments, the principal properties

of these fragments, i.e. substituent scales [pi(π),sigma etc.] are suitable design variables.

Sometimes, physicochemical properties related to molecular structure are also used as

suitable design variables. Some such properties are : density (d), molecular weight (MW),

molar volume (MV), molar refraction (MR), parachor (Pr), surface tension (γ), polarizability

() etc. Fortunately an excellent program for the calculation of such properties is available

and can be down-load as ACD labs [18].

51

With the introduction of a mathematical concept viz., graph theory and topology in

chemistry, the recent trend is to use graph theoretical descriptions, which are commonly

called topological indices [19-23].

2.11. CHEMICAL GRAPH THEORY AND CHEMICAL TOPOLOGY

The concept of graph theory and topology as applied to chemistry is now well-

established [19-23,54-57], Several books and excellent reviews are now easily available in

the literature and one can use internet facility for collecting such information [24-30,58-61].

It is herefore, not necessary to give a detailed resume of chemical graph theory and topology.

However, the basic concept involved need to be given here.

While applying graph theory to chemistry it is necessary to construct a chemical

graph from the structure of the chemical compound. That is, one has to transform chemical

structure into chemical graph. Such a derived chemical graph is more commonly referred to

as molecular graph. Such a transformation is usually carried out by deleting all the C-H bonds

from the molecular structure. Say for example, the molecular structure of propyl alcohol:

C C C O

H

H

H H

H

H

H

H

Figure 2.1: Molecular Structure of Propyl alcohol

This transforms to its molecular graph as:

C C C O H

and is called hydrogen-deleted molecular graph of propyl alcohol. To mimic this with

mathematical graph, the atoms are called vertices and denoted as „.‟ or „o‟. Similarly, the

bonds are called edges and denoted by a sma ll line „__‟. No difference is made among single,

double and triple bonds. Following this we can write the molecular graph of propyl alcohol

52

Figure 2.2: C-H Supressed Molecular Graph

Note that like bonds (edges) no distinction is made among vertices (a toms) [33].

With the development of chemical graph theory it is now argued that the other and

more common way of obtaining molecular graph is to suppress all the hydrogen bonds from

the molecular structure, i.e., to delete all the carbon-hydrogen as well as hetero-atom-

hydrogen bonds from the molecular structure. Under such a situation, the molecular graph of

propyl alcohol reduces to:

Figure 2.3: All Hydrogen suppressed molecular graph of propyl alcohol

Which type of the molecular graph is to be used is yet to be decided? Consequently,

one has to mention clearly which molecular graph one is using in their studies.

2.12. GRAPH INVARIENT (TOPOLOGICAL INDEX)

The aforementioned discussion clearly establishes that like mathematical graph,

molecular graph can likewise be defined as –

G V.E (2.6)

That is, molecular graph (G) consists of vertex set V and an edge set E.

Now, by imposing certain conditions of vertices (atoms), edges (bonds) or both

(atoms and edges) one can obtain a number which is called graph invariant or more

commonly as topological index. Thus, generally we have following three types of topological

indices:

(1) Atom-weighted topological indices;

(2) Edge-weighted topological indices; and

(3) Atom-Edge weighted topological indices.

53

A plathora of topological indices are available in the literature and everyday many

new types of topological indices are introduced in the literature. A detailed study of such

indices are made by Basak [34], who classified the topological indices under following

categories :

(1) Topostructural (TS),

(2) Topochemical (TC),

(3) Geometrical (3D)/shape,

(4) Quantum chemical (QC).

Another way of classifying topological indices is based on the matrix used for their

calculations. Majority of well known topological indices are obtained from either adjacency

matrix or distance matrix or on their combination. Thus, we have following three types of

topological indices :

(1) Topological indices derived from adjacency matrix;

(2) Topological indices derived from distance matrix;

(3) Topological indices derived from the combination of adjacency and distance

matrices.

Yet another classification of topological indices is made as below :

(1) Fragment constants descriptors;

(2) Conformational descriptors;

(3) Electronic descriptors;

(4) Reactor descriptors;

(5) Quantum chemical descriptors;

(6) Graph theoretical descriptors;

(7) Topological descriptors;

(8) Information content descriptors;

54

(9) Molecular shape descriptors;

(10) Spatial descriptors;

(11) Structural descriptors;

(12) Thermodynamic descriptors;

(13) pKa descriptors;

(14) Molecular field descriptors;

(15) Receptor surface analysis descriptors, etc.

Subsequently, the descriptors are classified as :

(1) 2D descriptors, and (2) 3D descriptors.

The present study relates to 2D descriptors only.

2.13. COMPOUND SELECTION IN QSAR

The backbone of success of QSAR/QSTR is selection of compounds. The

importance of compound selection in QSAR was first pointed out by Hansch [35]and Unger

(1973)[35] and by Wootton et al. (1975)[62]; Austel (1982) [37]was the first to use fractional

factorial designs in QSAR, which was followed by the general treatise by Hellberg et al.

(1987)[38,42]. So far, however, statistical designs have been used very little in QSAR

development with the resulting low general quality and low predictive power of the models.

Carlson et al. (1987)[39] have used statistical designs extensively in the closely

related area of structure-reactivity-modeling in their investigations of synthetic methods in

organic chemistry.

The problem of compound selection in QSAR has drawn a ttention only recently, and

is still largely neglected, especially in the area of environmental toxicology applications

where no examples are found in the literature [39-44].

2.14. MULTIVARIATE ANALYSIS: A STATISTICAL TEST FOR DRUG

DESIGNING

55

Multiple regressions do two things that are very desirable. For prediction studies,

multiple regression makes it possible to combine many molecular descriptors, topological

indices) to produce optimal predictions of the biological activity. For casual analysis,

multiple regressions separate the effects of molecular descriptors (topological indices) on the

biological activity so that one can examine the unique contribution of each molecular

descriptor (topological index). In the following sections we will look closely at how these

two goals are accomplished.

In the last three decades, statisticians have introduced a number of more sophisticated

methods that achieve similar goals. These methods go by such names as logistic regression,

Poisson regression, structural equation models and survival analysis. Despite the arrival of

these alternatives multiple regression has retained its popularity, in part because it is

comparatively easier to use and easier to understand.

It is interesting to record that regression analysis is a simple method for investigating

functional relationship among variables. Such relationship is expressed in the form of an

equation or a model connecting the response or dependent variable and one or more

explanatory or predictor (independent) variables. We denote the response variable by Y and

the set of predictor variables by X1, X2, Xp where P denotes the number of predictor

variables.

A regression equation containing only one predictor variable is called a simple

regression equation and is represented by the following regression equation / regression

expression / model.

Y = o + 1X (2.7)

Here, the coefficient 1 is called the slope and o is called the constant coefficient

intercept.

An equation containing more than one predictor variable is called multiple regression

equation and represented as follows:

Y = o + 1X + 2Y + 3Z + … (2.8)

Where, o, 1, 2, p are constant referred to as the model partition regression coefficients.

The magnitude of 1, 2, p play dominant role in deciding whether or not the proposed

56

regression equation / regression expression / model is statistically significant. If their

magnitude is lower than the standard error or % variance then that model is statistically

useless and thus has not to be discarded. In addition, the magnitudes of regression coefficient

also give an idea of their participation in the proposed model. It is worth recoding that even

with the excellent value of correlation coefficient (R), if the earlier discussed situation exist

the model need to be discarded.

With regards to the use of number of predictors to be used to have a statistically

significant model there is a Thumb‟s rule described in the literature which states that the

number of compound used should be three to four times higher than the number of predictors.

Thus, if we have a set of 100 compounds whose activity is to be modeled then at the most we

can use 25 predictors. The lower the number of predictors and the larger the sampling, the

better is the proposed model.

It is also to be kept in mind that in multiple linear regression analysis care to be taken

of the fact that the proposed model should not contain highly linearly related descriptors,

otherwise the model suffer from the defect of co - linearity, i.e., there is an occurrence of

auto-correlation. However, auto-correlated descriptors are retained in the model only when

there is a consistence improvement in adjustable R2 i.e., R2A upon their introduction and that

their coefficients are higher than the respective standard deviations.

2.15. STATISTICAL INDICATORS FOR THE QUALITY OF MODELS

The quality of the fit of the training data is expressed by the correlation coefficient R

or standard error of estimation S‟, computed as :

M

AA

S

esti

M

i

2

1

)(

'

(2.9)

Where, AI and Aest and M denote experimental activities, estimated activities and the total

number of cases (molecules) considered, respectively. However, a more important measure of

the predictive quality are the cross-validated statistical parameters : correlation coefficient

RCV and standard error SCV, which are calculated using experimental ACVi and estimated

AestCvi activities based on the leave-one-out cross-validation procedure.

M

AA

S

estCViCVi

M

iCV

2

1

)(

'

(2.10)

57

The Student t-test can be used to assess the statistical significance of the calculated

regression parameter b1. The expression used for this purpose is as below:

S

XXbt

i

2/121 )(

(2.11)

If the value of t is greater than the tabulated at the required confidence level and n-2

degrees of freedom then the slope of the line is significantly different from zero. The

significance of the intercept (i.e., is bo = 0) can be assessed in a parallel manner. For simple

regression the t calculated by the above equation is the square root of the F calculated by the

following equation:

)1(

)2(2

2

R

RnF

(2.12)

So as to make the aforementioned treatment of simple regression analysis more

clearly we calculate some of the simple relationship using the following specific numbers.

Consider an activity Y is modeled by a molecular descriptor X as shown by the data

presented in Table 2.1

Table 2.1. Data for modeling activity Y by molecular descriptor X

S X Y X2 Y2 XY

1. 0.836 1.43 0.699 2.045 1.195

2. 1.171 1.74 1.371 3.028 2.038

3. 1.489 2.45 2.217 6.002 3.648

4. 1.644 2.26 2.703 5.108 3.715

5. 1.754 2.70 3.076 7.290 4.736

n=5 X=6.894 Y=10.58 X2=10.066 Y2=23.473 XY=15.332

The first step in obtaining the regression expression is to estimate regression

parameters bo and b1 using the following expression:

221

XXn

YXXYnb (2.13)

58

22

2

0

XXn

XYXXYb (2.14)

Alternatively, the expressions (2.13) and (2.14) can be used for this purpose.

3278.1)894.6(066.105

)580.10894.6(332.15521

b (2.15)

and

2854.0)894.6(066.105

)332.15894.6(066.10580.1020

b (2.16)

These parameter yields the following regression equation:

2854.03278.1 XY (2.17)

The quality of this regression expression is first determined by calculating: (1) standard

deviation in X, Y, viz., Sx, Sy; (2) sample variance, i.e., Sx2, and (3) sample standard deviation,

Sx. These quantities are determined as below:

2.16. CALCULATION OF STANDARD DEVIATION IN X AND Y :

SX AND SY

The standard deviation in X and Y is calculated using the following expressions:

5386.0)15(5

)894.6(066.105

)1(

222

nn

XXnS x

(2.18)

5209.0)15(5

)58.10(473.235

)1(

222

nn

YYnS x

(2.19)

1

)( 22

n

XXSx (2.20)

The data needed for the calculation of sample variance Sx2 are shown in Table 6.1,

which yields Sx2 as:

1402.015

5609.02

xS (2.21)

This gives sample standard deviation Sx as 0.3744.

59

Note that sample variance means a measure of data dispersion, i.e., the average of

the squared difference between the individual measurement and the mean. Also, that the

sample standard deviation has the same meaning as the population standard deviation.

Table 2.2. Data needed for the calculation of sample variance (Sx2)

and sample standard deviation (Sx)

S. X X X - X (X - X )2

1. 0.836 1.3788 -0.5428 0.2946

2. 1.171 1.3788 -0.2078 0.0432

3. 1.489 1.3788 0.1102 0.0121

4. 1.644 1.3788 0.2652 0.0703

5. 1.754 1.3788 0.3752 0.1407

n = 5 X =6.1353 = (X - X )2 = 0.5609

Next step in determining the quality of the regression equation is to calculate

correlation coefficient, r (R in case of linear multiple regression) using:

2222

YYnXXn

YXXYnr (2.22)

9541.0]936.111473.235][527.47066.105[

)580.10894.6(332.155

r (2.23)

This gives 2 = 0.9103, meaning there by that the molecular descriptor X accounts for

91.03% variance in the activity Y. This means that X is an appropriate parameter for modeling

Y. This is further established by estimating Y from the proposed regression equation (model)

and comparing the same with the observed (calculated) Y values and further by estimating

residue i.e., the difference between observed (experimental) and estimated (calculated from

the proposed regression equation) values of Y. The data needed for this purpose are given in

Table 2.3.

Table 2.3. The data needed for calculating residue

60

S.No. Yobs Yest Yobs – Yest = Residue

1. 1.430 1.396 0.034

2. 1.740 1.840 -0.100

3. 2.450 2.262 0.188

4. 2.260 2.468 -0.208

5. 2.700 2.614 0.086

Thus, for each observation the observed minus the calculated value is termed as the

residual. Analysis of trends in the residuals is used to explore which new variable will result

in an equation with a higher r2 (R2) and lower S.

Finally, F distribution is calculated using:

k

knF knk

)1(

)1(

2

2

1,

(2.24)

Where k is the number of molecular descriptors (Xn) used for modeling the activity of Y. In

our case, k = 1 as only one molecular descriptor (X) is used for modeling activity Y.

Hence, 444.3010897.0

39103.03,1

F (2.25)

A regression analysis is performed to establish and quantitative the dependence of the

variable on one or several others. For example, regression analysis in structure-activity

(QSAR) studies might be used to examine the dependence of relative potency on activity. A

very closely related concept is correlation analysis. This type of calculation is used to

establish and quantitate the interdependence or relatedness between two variables. The

calculation of the correlation coefficient and statistical tests of its significance are identical to

those of the square root of R2 (r2) discussed above with the exception that the direction of the

relationship is maintained as the sign of r(R).

The interpretation of the meaning of the coefficient differs from regression to

correlation. Variables which are correlated are said to covary. Correlation coefficients are

used in structure-activity studies to assess the degree of relatedness of the prediction variables

used for regression analysis. Highly correlated variables do not provide independent

information for the regression analysis

61

The calculation of the coefficient of the parameters in the linear multiple regression

equation is identical to that for the simple list squares line with the exception that b1, Xi, and

X of equation 2.13 and 2.14, refer to vectors (a list) of predictor properties (molecular

descriptors) rather than only a single value. The vector Xi, for example, contains the value of

each physical property (molecular descriptor) of compound i, property (molecular descriptor).

Very often linear multiple regression suffers from the defect due to collinearity and

the presence of compound outliers. The diagnostics to test for the presence of compound

outliers and multicollinearity between descriptors in the model is a variance inflation factor

(VIF) defined by:

21

1

RVIF

(2.26)

The VIF value less than 10 indicates that the model contained no multicollinearity.

Very recently Randic [63] has stated that many physical properties (molecular

descriptors) in spite of their high correlation can be used in the regression expression as many

of them have different information content. In such cases the proposed regression equation

(model) is considered statistically significant. Only when the coefficients of molecular

descriptors involved in the regression equation (model) are significantly larger than their

standard error.

The genuineness of the proposed regression equation (model) is determined by

estimating probable error of the correlation coefficient (PE) using:

n

2r1.

3

2PE

(2.27)

If r < PE, r is not significant, if r > PE, several times at least 3 times greater, the

correlation is indicated and if r > 6PE, the correlation is definitely good.

In the case considered, we have:

0266..05

9103.01.

3

2

PE

(2.28)

6PE = 0.1599 (2.29)

That is r (0.9541) is > 6PE, indicating thereby the proposed correlation (regression

equation) is definitely good for modeling the activity Y.

62

All the aforementioned statistics can be performed manually either using paper or

pencil or by using a scientific calculator. The Hewlett-Packard HP-11C programmable

scientific calculator is the most appropriate for this purpose.

2.17. IMPORTANT STATISTICAL PARAMETERS

2.17.1. Standard deviation:

Standard Deviation (SD) is the squared root of the variance, or the Root Mean Square

(RMS) error of deviations.

2

( )

1

n

i

iy

y y

sn

(2.30)

Where n −1 is the number of degrees of freedom, i.e. the number of parameters to be

determined is subtracted from the total number of parameters. This is the standard deviation

usually employed in statistics, which measures the dispersion of a data set in relation to the

arithmetic mean, that is, it is a measure of the magnitude of the residuals, accounting for

accuracy.

Conversely, the standard deviation of the dependent variable, before trying to fit

any model represents the amount of variation in the dependent variable, and the error

represents the variation left over after fitting the model.

2

( )

1

n

i

ix

x x

sn

(2.31)

Another form of Standard Deviation of x and y respectively

12 22

1X

n x xS

n n

(2.32)

12 22

1Y

n y yS

n n

(2.33)

63

2.17.2. Covariance

Covariance is the average of product of deviations from means, for the {xi, yi} pairs, namely,

the:

( )( )

1

n

i i

iXY

x x y y

cn

(2.34)

This measure, which depends on the scaling of the features, describes the relationship

between the x and y features. The covariance is the mean value of all the pairs of

differences from the mean for independent variables multiplied by the differences from the

mean for dependent variables. If x and y are not closely related to each other, they do not

co-vary, so the covariance is small, so the correlation is small. If x and y are closely related,

CXY turns out to be almost the same as the product of standard deviations of x and y, so the

correlation is almost 1.

2.17.3. Correlation Coefficient.

The correlation coefficient is a normalized covariance independent from scaling:

r =

XY

x y

c

s s that measures the quality of adjustment, that is, the degree of correlation between x and

y, and detects if the variables contain redundant information, and thus they are highly

correlated.

Substituting the corresponding identities in the previous equation:

1 1 11

1 1 1 12 22 2 2 22 2

2 2

1 11 1 1 1

i i

n n nn

i i i ii ii i ii

n nn n n n

i ii i

i ii i i i

n x y x yx x y y

r

x x y y n x x n y y

(2.35)

The correlation coefficient is a measure of the degree of linearity of the relationship,

i.e. it indicates the extent to which the pairs of numbers for these two variables lie on a

straight line.

64

Correlation coefficients for the variables in a dataset are compiled in a correlation

matrix, which shows the correlation of one descriptor with another, and thus the

relationships among descriptors. This matrix is a symmetric matrix in which the diagonal

elements are one and the off-diagonal elements are the correlation coefficients for the

appropriate variable pairs. The correlation coefficients for independent variables that are not

correlated, i.e. orthogonal variables are zero.

The addition of new parameters to the model always increases the r value, unless the

new parameter is a constant of a linear combination of other parameters, which would not

produce any effect. The increase in r when adding new parameters can result in over fitting,

that is, a spurious correlation.

2.17.4. Multiple Determination Coefficients

The multiple determination coefficients are the squared correlation coefficient used to

describe the goodness-of- fit of the data. It informs about how well the model reproduces the

experimental data. However, when a large number of free parameters intervene in the model,

r2 can arbitrarily be close to the value of one.

22

2 2

XY

x y

cr

S S

(2.36)

An alternative definition of the squared correlation coefficient can be

deduced

' 2

2 1

2

1

( )

1

( )

n

i i

i

n

i

i

y y

r

y y

(2.37)

The multiple determination coefficients are a quantitative measure of the precision of

adjustment for the fitted values to the observed ones, which measures the fraction of the

variance explained by the model. The coefficient mainly informs if the variation of y

explained by the regression equation permits to assume that there is a linear relationship

between y and x.

The squared coefficient multiplied by 100 is the percent of total variance explained by

65

the model. This percentage expresses the strength of the relationship between x and y.

R is defined in the [0, 1] interval, that is, it ranges from 0 to 1. The closer to the

unity, the more similar are the adjusted values to the experimental ones. The limit case, when

r2=1, is obtained when all the residuals are null, that is, the residual sum of the squares

approaches to zero, and, thus, the model fits exactly the data.

It must be noted, however, that a coefficient close to the unit does not mean that the

model is good; the simple addition of parameters to the regression induces an ever increasing

of r2, even if the newly added descriptor does not contribute to the model. To determine the

predictive capacity of the model, other measures are required.

2.17.5 Fischer Statistic.

The Fischer statistic parameter is one of several variance-related parameters that can

be used to compare two models differing by one or more variables. This statistic is used to

determine whether a more complex model is significantly better than a less complex one.

2( , 1) RSS

F k n kks

(2.38)

Where s2 is an unbiased estimate of the residual or error variance, and (k, n-k-

1) are the degrees of freedom.

Another form of Fischer Statistics

2

2

1

1test

R N kF

k R

(2.39)

Where N =number of samples, k = no. of descriptors, R = regression coefficient. The F test

reflects the ratio of the variation explained by the models and the variation not explained by

the models. So the higher value of F test indicates that the model is statistically significant.

The F statistic is computed and compared with standard tabulated values. If F is

larger than the tabulated value, the more complex model can be accepted as significant.

66

2.17.6. Probable Error of Estimation.

The Probable Error of Estimation is defined by the expression:

22 13

RPEN

(2.40)

It is argued that,

1. If, R < PE R is not significant

2. If, R > 6 PE is definitely a good model.

2.17.7. Degree of Lack of Relationship

21K R (2.41)

K indicates the degree of lack of relationship. Lower the value of K

better the productivity of the model.

2.17.8 Index of Forecasting Efficiency

Index of forecasting efficiency is defined as the percentage reduction in errors of prediction

by reason of correlation between two variables. so higher the value of E better the goodness

of model.

2100 1 1E R (2.42)

2.17.9. LOF = Friedman’s Lack of Fit Measured.

2

1

LSELOF

k d p

N

(2.43)

Where, k = number of descriptors + 1

P = No. of independent parameters

67

N = No. of Samples

d = 1

The advantages of using of LOF directly rather than LSE that LOF don‟t decrease with

increase in number of the descriptors. The lower value of LOF indicates the better model.

2.17.10. T:Significance Level

1

2

2

2

1test

NT R

R

(2.44)

Where, N = number of Samples

The T test is a measure of significant regression Coefficient. The higher value of T test

indicates the more significant regression model.

2.17.11. Sum of squares of deviations from the mean:

The sum of squares of deviations from the mean is a very important statistical

parameter. It is calculated by subtracting each observation from the mean, squaring each

deviation, and summing these squares.

2

XXSSmean (2.45)

2.17.12. Estimation of variance:

An estimation of the variance of a population is the sum of squares of deviations from

the sample mean, divided by the degrees of freedom:

1

)( 22

n

XXSmean (2.46)

This quantity forms the basis for very useful statistical calculations called analysis of

variance. Analysis of variance is discussed briefly below.

2.17.13. Outliers:

These are the points in the fitted model which have a large residual.

niPPPrC iiijii ,...,2,1;1/1/2 (2.47)

2.17.14. Doing with outliers:

68

Outliers and influential observations should not routinely be deleted or automatically

down-weighted because they are not necessarily bad observations. On the contrary, if they

are correct, they may be the most informative points in the data. This may indicate that the

data did not come from a normal population or that the model is not linear.

2.17.15. Multicollinearity:

Multicollinearity is defined as the condition of severe non-orthogonality. A thorough

investigation of multicollinearity can be examined by the values of R2 that result from

regressing each of the predictor variables against all others.

2.17.16. Condition Number (K) of the correlation matrix :

Condition number of the correlation matrix is a measure of the overall

multicollinearity of the variables. It is defined as:

pX

XK 1

matrixn correlatio theof eigenvalue minimum

matrixn correlatio theof eigenvalue maximum (2.48)

The condition number will always be greater than one. A large K number indicates

evidence of strong co linearity. The harmful effects of co linearity in the data become strong

when the values of K exceed 15 (which mean that X1 is more than 225 times Xp).

Another empirical criterion for the presence of multicollinearity is given by the sum

of the reciprocal of the eigenvalues, that is

j

p

jX

1

1

(2.49)

2.17.17. Mallows Cp (Cp-Statistics):

Mallows used Cp statistics for the estimation of Jp.

)2(2

npSSE

Cp

P

(2.50)

Where 2 and is usually obtained from the linear model with the full set of q-variables. The

expected value of Cp is p when there is no bias in the fitted equation containing p terms.

Consequently, the deviation of Cp from p can be used as a measure of bias. The Cp statistics,

therefore, measures the performance of the variables in terms of the standardized total mea n

square of error of prediction for the observed data points irrespective of the unknown true

model. It takes into account both the bias and the variance.

69

2.17.18. Cross validation:

Cross validation is a technique for evaluating the prediction ability of a model for a

given data set. Nowadays, cross validation constitutes the basis for the modern statistical

philosophy of “replacing standard assumptions about the data with massive calculations” for

assessing the generality of a relationship found from a sample data set.

If the PRESS value is transformed into a dimensionless form by relating it to the

initial sum of squares, one obtains Q2, i.e., the complement to the fraction of unexplained

variance over the total variance (see R2, above).

SSYPRESSQ /12 (2.51)

Or

SSYPRESSSSYR /pred2 (2.52)

2.17.19. Predictive Squared Error (PSE):

Predictive squared error (PSE) is the square root of the ratio of PRESS to N.

N

PRESSPSE (2.53)

It is more directly related to the uncertainty of the predictions, since it has the same

units as the actual y values.

2.17.20. Standard deviation of error of predictions (SDEP) and deviation of error of

calculations (SDEC):

The SDEP and SDEC are the terms used to distinguish between Predictors for new

data point or for the data points already used in modeling and are defined as:

2/12/ NyySDEP REDP (2.54)

2/12/ NyySDEC CALC (2.55)

Note that SDEC can also be called RSD (standard deviation of residuals).

2.17.21. Uncertainty of the Prediction (SPRESS):

The uncertainty of the prediction (SPRESS) is defined as

2/11/ knPRESSSPRESS (2.56)

Where k is the number of variables in the model and n is the number of compounds used in

the study

2.18. EVALUATION OF THE PREDICTIVE CAPACITY OF THE MODEL:

VALIDITY TECHNIQUES

70

Once the regression equation is obtained, in addition to the goodness of fit and the

stability of the model, it is also important to evaluate the robustness and the predictive

capacity or validity of the model before using the model on the interpretation and prediction

of the biological activity.

To validate a method is to establish the reliability and relevance of the method for a

particular purpose. The reliability refers to the reproducibility of results, the relevance is

related to the scientific use and practical usefulness, and the purpose refers to the intended

application. The validation of a QSAR model is the process by which the predictive ability of

a QSAR and the mechanistic basis are assessed for practical purposes. Validation assesses

if the model accurately represents the reality, from the perspective of the intended model

application.

It must be paid special attention to outliers, structures with a residual greater than two

times the standard deviation of the residuals that do not fit the model. Once identified,

diagnostic data that help making decisions about them should be examined. Outliers should

be iteratively removed from the observations used to calculate the QSAR equation, and then

the equation recalculated until the satisfactory results were obtained.

It may happen, for example, if the structure of one or more elements of the training set

differs significantly from the rest, that these elements determine the quality and shape of the

model. Several procedures can be used to check the reliability and significance of the model,

i.e. that the size of the model is appropriate for the quantity of data available of non

synthesized compounds, as well as provide some estimate of how well the model can predict

activity for new molecules. There are two techniques to determine the confidence and

robustness of the model, namely internal and external validation techniques.

2.19. SIGNIFICANCE OF ADJUSTABLE R2A

The adjustable R2 (R2A)[64] is an important statistical parameter used to explain

whether or not the added variable contributes its fair share in the model. This is defined as:

1

1)1(1 22

kn

nRRA (2.57)

That is, R2A values takes into account of adjustment of R2. If a variable is added that does not

contribute its fair share, the R2A will actually decline.

R2A is particularly important when the number of independent variables is large

relative to the sample size. It takes into account the relationship between sa mple size and

71

number of variables. R2 may appear artificially high if the number of independent variables is

high compared to the sample size. R2A is a measure of the % explained variation in the

dependent variable that takes into account the relationship between the number of cases and

the number of independent variables in the regression model, whereas R2 will always increase

when an independent variable is added. R2A will decrease if the added variable doesn‟t reduce

the unexplained variation enough to offset the loss of degrees of freedom.

2.20. MULTICOLLINEARITY PROBLEM

The problem due to multicollinearity in the model is resolved by Randic[63].

According to him selection of descriptors to be used in QSPR and QSAR (Quantitative

Structure-Activity-Relationship) statistics should not be delegated solely to the computers,

although the statistical criteria will continue to be useful for preliminary screening of

descriptors taken from a large pool. Often in an automated selection of descriptors, a

descriptor will be discarded because it is highly correlated with another descriptor already

selected, but what is important is not whether two descriptors parallel to the another, that is

duplicate much of the same structural information, but whether they in those parts that are

important for QSPR, and QSAR correlations. If they differ in the domain which is important

for the property, activity, or toxicity considered, both descriptors should be retained. If they

differ in parts that are not relevant for the correlation of considered property, activity or

toxicity, then one of them can be discarded. That is, despite of high collinearity among

descriptors they may have different information content, and thus be maintained in the model.

Such a model is then considered statistically significant.

2.21. APPROPRIATE SOFTWARE’S

Appropriate software, to be used for regression analysis, should provide the following

information:

(1) Dependent variable.

(2) Independent variable.

(3) Coefficient of independent variable.

(4) Standard error (95% confidence) of independent variable.

(5) Student t-test.

(6) Probability.

(7) Partial r2

(8) Standard error of estimate.

(9) Adjustable R2A.

72

(10) R2

(11) Multiple R.

(12) Analysis of variance.

(13) F-Ratio.

(14) Significant level of F.

(15) Residuals.

(16) Durbin-Watson Test.

Such information is contained in many of the software, including MSTAT [65]. We

give below the statistics covered in MSTAT (2.4).

Table 2.4. Information contained in MSTAT

Dependent Variable : Y

Var Regression

coefficient

Std. Error T(DF) Prob. Partial r2

X1 C1

X2 C2

X3 C3

| |

Xn Cn

Const.

Std. Error of Est.

Adjusted R squared

R squared

Multiple R

Analysis of Variance

Source of Sum of

squares

DF Mean sq. F-Ratio Prob.

Regression

Residual

Total

Observed Calculated Residual Standardized Residue

73

Durbin-Watson Test =

2.22. CHANCE CORRELATION

Many times correlation is noticed between two variables without any real relationship

between them. It may happen due to chance. This generally happens when a very small

sample is chosen from a large universe. For example, a small sample may give us a very high

correlation between dependent and independent variables. It is, therefore, essential that while

analyzing correlation between two variables, we do not jump to hasty conclusions.

QSAR investigations are plaused with the problem of chance correlation, in which

statistical significance is inflated by the multitude of possib le models. The problem of chance

correlation was explored by Topliss [ 66-67]and coworkers [TOP] in studies on correlations

with random numbers. It can be estimated to some extent by repeated random reassignment

of activity values to the matrix of independent variables, and repetitive of the selection

process. If these “nonsense” regressions frequently lead to correlations as good as the original

one, then the original one is non-significant.

74

2.23. RULE OF THUMB’S

Multiple regression analysis generally requires significantly more compounds than

parameters; a useful rule of Thumb‟s [68]is three to six times the number of parameters

under consideration. Thus, traditional regression method requires that the number of

parameters must be considerably smaller than the number of compounds in the data set or the

number of degrees of freedom in the data.

75

REFERENCES:

1. S. Borman, ; New QSAR Techniques Eyed for Environmental Assessments. Chem.

Eng. News, 1990,68,20-23.

2. R.L. Lipnick ,Charles Ernest Overton; Narcosis Studies and a Contribution to General

Pharmacology. Trends Pharmacol. Sci., 1986, 7, 161-164.

3. C. Hansch, A. Leo, and R.W. Taft, ; A Survey of Hammett Substituent Constants and

Resonance and Field Parameters. Chem. Rev., 1991, 91, 165-195.

4. C. Hansch, A. Leo, and D. Hoekman ; Exploring QSAR - Hydrophobic,

Electronic, and Steric Constants. American Chemical Society, Washington, D.C.

,1995.

5. C. Hansch, ; A Quantitative Approach to Biochemical Structure-Activity

Relationships. Acct. Chem. Res., 1969,2, 232-239.

6. M. Karlson; Molecular Descriptors in QSAR/QSPR, J. Wiley & Son, New York

2000.

7. J.Devilliers; Eds., Comparative QSAR, Taylor and Francis, Washington (DC). 1998.

8. F.Sanz, J. Giraldo, F. Manaut; Eds., QSAR and Molecular Modeling. Concepts;

Computational Tools and Biological Applications, Pooles Science, Barcelona (SP),

1995.

9. H.Kubinyi ; QSAR : Hansch Analysis and Related Approaches, VCH, Weinheim

(GER),1995.

10. M.V. Diudea, M.S.Florescu, P.V. Khadikar; Molecular Topology and its

Applications, Eficon Bucharest, 2006.

11. D. Bonchev, D.H. Rouvray; Chemical Topology: Applications and Techniques,

Gordon and Breach, New York, 2000.

12. M.V. Diudea; Eds., QSPR/QSAR Studies by Molecular Descriptors, Nova Science,

2000.

13. M.V.Diudea, O.Ivanciuc, Molecular Topology, Comprex, Cluj, 1995.

14. A.Verloop; The Sterimol Approach to Drug Design, Marcel Dekker, New York, 1987.

76

15. Z. Simon , A. Chiriac, S. Holbar, D. Ciubotariu, G.I.Mihalas ; Minimum Steric

Difference. The MTD Method for QSAR Studies, Research Studies Press, Cetchworth

(UK), 1984.

16. I. Gutman; Mathematical Concepts in Organic Chemistry, Spinger Verlag, Berlin,

1986.

17. J.E. Muth ; Basic statistics and Pharmaceutical Statistical Applications, Marcel

Dekker, New York, 1999.

18. ACD Labs; Richmol St. West State 605 M5H 2L3, Canada.

19. I. Gutman, O.E. Polansky; Mathematical Concepts in Organic Chemistry, Springer-

Verlag, Berlin, 1986.

20. F. Buckley, F.Hararay; Distances in Graphs, Addison-Wesley, Reading ,1990.

21. P.G.Mezey; Ed., Mathematical Modeling in Chemistry, VCH, Weindeim GEP, 1991.

22. J.Devillers, A.T. Balaban; Eds., Topological Indices and Related Descriptors in

QSAR and QSPR, Gordon and Breach, New York, 1999.

23. R.Todeschini, V. Consonni; Handbook of Molecular Descriptors, Wiley-

VCH,Weinnein (GER), 2000.

24. D.Bonchev; J. Mol. Graph. Model., 2000, 20, 65.

25. E.Esterda, E.Molina; J. Mol. Graph. Model., 2001, 20, 54.

26. M Randic; J. Mol. Graph. Model., 2001, 20, 19.

27. E Esterda ; Chem. Phys. Lett., 2001, 336, 248.

28. V.K.Agrawal,P.V.Khadikar,C.T.Supuran,J.Singh,S.Singh,S.Thakur,M.Lakhwani;

Arkivoc, 2006, 103.

29. V.K.Agrawal, J.Singh, A.Pandey, P.V.Khadikar; Oxid.Commun., 2006, 29, 4.

30. V.K.Agrawal , V.K.Dubey , B.Shaik , J.Singh , K.Singh , P.V.Khadikar;

J.Indian.Chem.Soc, 2009, 86, 1.

31. Xu, Lu, W.J.Zhang; Anal Chemica Acta., 2001, 446, 455.

32. P.V.Khadikar, S.Karmarkar, I. Lukovits,M.V.Diudea,V.K. Agrawal; Szeged Index –

10 Successful Years in QSAR, (Under publication).

33. P.V.Khadikar, A. Shrivastava, S.Karmarkar, S. Singh; Bioorg.Med.Chem,

2002,10,3163.

34. S.C. Basak, D. Mills, B.D. Guts, G.D. Grunwald,A.T. Balaban; Application of

Topological Indices in predictivity property / Bioactivity / Toxicity of Chemicals.

(Unpublished).

35. C.Hansch, S. Unger; J. Med. Chem., 1973, 16, 1217.

77

36. H.Wiener; J. Med. Chem., 1975, 18, 607.

37. V.Austel, Eur. J. Med. Chem., 1982, 17, 9.

38. S. Hellberg, M.Sjostron; Acta Pharm. Yugosla., 1987, 37, 53.

39. R.Carlsson; Chem. Scr., 1987, 27, 545.

40. The Use of QSAR for Chemical Screening - Limitations and Possibilities, Kemi

Report Science and Technology, Department, Swedan, 1988.

41. S.Wold, W.J.Dun; III. J. Chem. Inf. Comput. Sci., 1983, 23, 6.

42. S. Wold, W.J Dun, S. Hellberg; Envir. Hlth. Persp., 1985, 61, 257.

43. S.Wold, L.Eriksson, J.Jonsson, J.Hellbers, S. Hellhers, M.Sjostroler, Efficient

Selection of Training Set Compounds for QSAR KEMI Report, Science and

Technology Department, Sweden, 1988.

44. R.Franke; Theoretical Drug Design Methods, Elsevier, Amsterdam, 1984.

45. E. Overton; Z. Physik. Chem., 1897, 22, 189.

46. H.Meyer; Arch. Exp. Pathol. Pharmakol ., 1899, 42, 109.

47. E. Overton; Studien über dir Narkose. Gustav Fischer: Jena., 1901.

48. R.L.Lipnick, E.Overton; Trends Pharmacol. Sci., 1986, 7, 161.

49. J.Ferguson; Proc. Roy. Soc. Lond. B. Biol. Sci., 1939, 127, 387.

50. Jr S.M Free, J.W.Wilson; J. Med. Chem., 1964, 7, 395.

51. C.Hansch, T.Fujita; J. Am. Chem. Soc., 1964, 86, 1616.

52. L.P Hammett; J. Am. Chem. Soc., 1937, 59, 96.

53. L.P.Hammett; Physical Organic Chemistry. McGraw-Hill: New York,1940.

54. V.K. Agrawal, J. Singh, B.Shaik, S.Singh, S.Sikhima, P.V. Khadikar,C. T. Supuran;

Bioorg. Med. Chem., 2007, 15, 6501.

55. V.K. Agrawal, J.Singh, B.Shaik, S.Singh, P.V. Khadikar, O.Deeb,C. T. Supuran;

Chem. Bio. Drug Des., 2008 , 71, 244.

56. V.K. Agrawal ,J.Singh, S.Singh, B.Shaik, N.Sohani,P. V. Khadikar O.Deeb; Chem.

Bio. Drug Des., 2008, 71, 230.

57. V.K.Agrawal, J.Shrivastava, J. Singh, B. Shaik; Oxid. Commun., 2008, 31, 776.

58. V.K. Agrawal, J. Singh, B.Shaik, P V. Khadikar ; J. Indian Chem Soc., 2008, 85,

517.

59. V. K. Agrawal, J.Singh, V. K. Dubey, P. V. Khadikar; Oxid.Commun., 2008, 1, 27.

60. V. K. Agrawal , J. Singh, M. Gupta and P. V. Khadikar, C.T. Supuran; Eur. J. Med.

Chem., 2006, 41(3), 360.

61. V.K Agrawal, K.C. Mishra, J. Singh, P.V. Khadikar; Letters Drug Design &

78

Discovery., 2006, 3, 129.

62. R. Wootton, F.E. Norrington, R.M. Hyde, S.G. Williams; J. Med.Chem., 1975, 18,

604.

63. M. Randic; Croatica Chem. Acta., 1993, 66, 289.

64. S. Chatterjee, A.S. Hadi , B. Price ; “Regression Analysis Examples”,

65. MSTAT, Software, 1998.

66. J.Topliss; Eds., “Quantitative Structure-Activity Relationships of Drugs Academic

Press, New York, 1983.

67. J.K. Seyfel; Eds., “QSAR and Strategies in the Design of Bioactive Compounds”,

VCH, Weinheim, 1985.

68. S .Tuts; In Advances in drug research, N.T.Harper,A.B. Simmonds; Eds., Academic

Press London, 1971, 6, 1.

Documents

CHAPTER - 2 QSAR METHODOLOGY - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/25218/7/07_chapter-2.pdf · CHAPTER - 2 QSAR METHODOLOGY ... who correlated electronic properties