Upload
hoangnhi
View
230
Download
2
Embed Size (px)
Citation preview
36
CHAPTER - 2
QSAR METHODOLOGY
2.1 INTRODUCTION
Quantitative structure-activity relationships (QSAR) represent an attempt to correlate
structural or property descriptors of compounds with activities. These physicochemical
descriptors, which include parameters to account for hydrophobicity, topology, electronic
properties, and steric effects, are determined empirically or, more recently, by computational
methods. Activities used in QSAR include chemical measurements and biological assays.
QSAR currently are being applied in many disciplines, with many pertaining to drug design
and environmental risk assessment.
QSAR date back to the 19th century. In 1863, A.F.A. Cros at the University of Strasbourg
observed that toxicity of alcohols to mammals increased as the water solubility of the
alcohols decreased [1]. In the 1890's, Hans Horst Meyer of the University of Marburg and
Charles Ernest Overton of the University of Zurich, working independently, noted that the
toxicity of organic compounds depended on their lipophilicity [1, 2].
2.2 Linear Free Energy Relationships
Little additional development of QSAR occurred until the work of Louis Hammett (1894-
1987), who correlated electronic properties of organic acids and bases with their equilibrium
constants and reactivity. Consider the dissociation of benzoic acid:
Hammett observed that adding substituents to the aromatic ring of benzoic acid had an
orderly and quantitative effect on the dissociation constant. For example,
37
a nitro group in the meta position increases the dissociation constant, because the nitro group
is electron-withdrawing, thereby stabilizing the negative charge that develops. Consider now
the effect of a nitro group in the para position:
The equilibrium constant is even larger than for the nitro group in the meta position,
indicating even greater electron-withdrawal.
Now consider the case in which an ethyl group is in the para position:
In this case, the dissociation constant is lower than for the unsubstituted compound,
indicating that the ethyl group is electron-donating, thereby destabilizing the negative charge
that arises upon dissociation.
Hammett also observed that substituents have a similar effect on the dissociation of other
organic acids and bases. Consider the dissociation of phenylacetic acids:
38
Electron-withdrawal by the nitro group increases dissociation, with the effect being less for
the meta than for the para substituent, just as was observed with benzoic acid. The electron-
donating ethyl group decreases the equilibrium constant, as would be expected.
Data for these equilibria typically are graphed as illustrated below:
Figure 1: Example of a graph for a linear free energy relationship. K0 or K0' represent
equilibrium constants for unsubstituted compounds and K or K', for substituted compounds.
Values for the abscissa are calculated from the dissociation constants of unsubstituted and
substituted benzoic acid. Values for the ordinate are obtained from another organic acid or
base with identical patterns of substitution, in this case phenylacetic acid.
Because this relationship is linear, the following equation can be written:
where is the slope of the line. The values for the abscissa in Figure 1 are always those for
benzoic acid and are given the symbol, . Therefore, we can write:
, the slope of the line, is a proportionality constant pertaining to a given equilibrium. It
relates the effect of substituents on that equilibrium to the effect of those substituents on the
benzoic acid equilibrium. That is, if the effect of substituents is proportionally greater than on
the benzoic acid equilibrium, then > 1; if the effect is less than on the benzoic acid
equilibrium, < 1. By definition, for benzoic acid is equal to 1.
39
is a descriptor of the substituents. The magnitude of gives the relative strength of the
electron-withdrawing or -donating properties of the substituents. is positive if the
substituent is electron-withdrawing and negative if it is electron-donating.
These relationships as developed by Hammett are termed linear free energy relationships.
Recall the equation relating free energy to an equilibrium constant:
That is, the free energy is proportional to the logarithm of the equilibrium constant. These
linear free energy relationships are termed "extrathermodynamic". Although they can be
stated in terms of thermodynamic parameters, no thermodynamic principle states that the
relationships should be true.
To develop a better understanding of these relationships, it is instructive to consider some
values of and . Values of are provided below:
In the aniline and phenol equilibria, the hydrogen ion that is dissociating is one atom removed
from the phenyl ring, whereas in the benzoic acid equilibrium it is two atoms removed. Thus,
substituents are able to exert a greater effect on the dissociation in aniline and phenol than in
benzoic acid and the value of > 1. In phenylacetic and phenylpropionic acids, the hydrogen
ion dissociating is three and four atoms removed, respectively, from the phenyl ring.
Substituents are able to exert a lesser effect on the equilibrium than on the benzoic acid
equilibrium and < 1.
Some illustrative values of for substituents in the meta and para positions are given below:
40
By definition, for hydrogen is 0. The positive values of for the nitro group indicate that it
is electron-withdrawing. In understanding the magnitudes of the values for the nitro group
in meta vs. para positions, consider the mechanisms of electron withdrawal or donation. For a
nitro group in the meta position, electron-withdrawal is due to an inductive effect produced
by the electronegativity of the constituent atoms. If only induction were operative, one would
expect the electron-withdrawing effect of a nitro group in the para position to be less than in
the meta position. The larger value for a para-substituted nitro group results from the
combination of both inductive and resonance effects. For chlorine, the electronegativity of the
atom produces an inductive electron-withdrawing effect, with the magnitude of the effect in
the para position being less than in the meta position. For chlorine, only the inductive effect is
possible. The methoxy group can be electron-donating or -withdrawing, depending on the
position of substitution. In the meta position, the electronegativity of the oxygen produces an
inductive electron-withdrawing effect. In the para position, only a small inductive effect
would be expected. Moreover, an electron-donating resonance effect occurs for the methoxy
group in the para position, giving an overall electron-donating effect. Tables of values for
numerous substituents have been published [3,4]. In some cases, the sigma values are
generally applicable to many different equilibria. In other cases, sigma values have been
derived for specific equilibria, which is particularly true when one considers sigma values for
ortho substituents.
2.3 Computer-Assisted Design
Computer-assisted drug design (CADD), also called computer-assisted molecular design
(CAMD), represents more recent applications of computers as tools in the drug design
process. In considering this topic, it is important to emphasize that computers cannot
substitute for a clear understanding of the system being studied. That is, a computer is only an
additional tool to gain better insight into the chemistry and biology of the problem at hand.
In most current applications of CADD, attempts are made to find a ligand (the putative drug)
that will interact favorably with a receptor that represents the target site. Binding of ligand to
41
the receptor may include hydrophobic, electrostatic, and hydrogen-bonding interactions. In
addition, solvation energies of the ligand and receptor site also are important because partial
to complete desolvation must occur prior to binding.
This approach to CADD optimizes the fit of a ligand in a receptor site. However, optimum fit
in a target site does not guarantee that the desired activity of the drug will be enhanced or that
undesired side effects will be diminished. Moreover, this approach does not consider the
pharmacokinetics of the drug.
The approach used in CADD is dependent upon the amount of information that is available
about the ligand and receptor. Ideally, one would have 3-dimensional structural information
for the receptor and the ligand-receptor complex from X-ray diffraction or NMR. The ideal is
seldom realized. In the opposite extreme, one may have no experimental data to assist in
building models of the ligand and receptor, in which case computational methods must be
applied without the constraints that the experimental data would provide.
Based on the information that is available, one can apply either ligand-based or receptor-
based molecular design methods. The ligand-based approach is applicable when the structure
of the receptor site is unknown, but when a series of compounds have been identified that
exert the activity of interest. To be used most effectively, one should have structurally similar
compounds with high activity, with no activity, and with a range of intermediate activities. In
recognition site mapping, an attempt is made to identify a pharmacophore, which is a
template derived from the structures of these compounds. It is represented as a collection of
functional groups in three-dimensional space that is complementary to the geometry of the
receptor site.
In applying this approach, conformational analysis will be required, the extent of which will
be dependent on the flexibility of the compounds under investigation. One strategy is to find
the lowest energy conformers of the most rigid compounds and superimpose them.
Conformational searching on the more flexible compounds is then done while applying
distance constraints derived from the structures of the more rigid compounds. Ultimately, all
of the structures are superimposed to generate the pharmacophore. This template may then be
used to develop new compounds with functional groups in the desired positions. In applying
this strategy, one must recognize that one is assuming that it is the minimum energy
42
conformers that will bind most favorably in the receptor site. In fact, there is no a priori
reason to exclude higher energy conformers as the source of activity.
The receptor-based approach to CADD applies when a reliable model of the receptor site is
available, as from X-ray diffraction, NMR, or homology modeling. With the availability of
the receptor site, the problem is to design ligands that will interact favorably at the site, which
is a docking problem.
2.4 Hansch Analysis
QSAR based on Hammett's relationship utilize electronic properties as the descriptors of
structures. Difficulties were encountered when investigators attempted to apply Hammett-
type relationships to biological systems, indicating that other structural descriptors were
necessary.
Robert Muir, a botanist at Pomona College, was studying the biological activity of
compounds that resembled indoleacetic acid and phenoxyacetic acid, which function as plant
growth regulators. In attempting to correlate the structures of the compounds with their
activities, he consulted his colleague in chemistry, Corwin Hansch. Using Hammett sigma
parameters to account for the electronic effect of substituents did not lead to meaningful
QSAR. However, Hansch recognized the importance of the lipophilicity, expressed as the
octanol-water partition coefficient, on biological activity [5]. We now recognize this
parameter to provide a measure of the bioavailability of compounds, which will determine, in
part, the amount of the compound that gets to the target site.
Relationships were developed to correlate a structural parameter (i.e., lipophilicity) with
activity. In some cases, a univariate relationship correlating structure and activity was
adequate. The form of the equation is:
where C is the molar concentration of compound that produces a standard response (e.g.,
LD50, ED50). With other data, it was observed that correlations were improved by
combining Hammett's electronic parameters and Hansch's measure of lipophilicity using an
equation such as
43
where is the Hammett substituent parameter and pi is defined analogously to . That is,
In yet other cases, parabolic relationships between biological response and hydrophobicity
were observed that could be fit by including a (log P)**2 term in the QSAR. One
interpretation to account for this term is that many membranes must be traversed for
compounds to get to the target site, and those with greatest hydrophobicity will become
localized in the membranes they encounter initially. Thus, an optimum hydrophobicity may
be found in some test systems.
QSAR are now developed using a variety of parameters as descriptors of the structural
properties of molecules. Hammett sigma values are often used for electronic parameters, but
quantum mechanically derived electronic parameters also may be used. Other descriptors to
account for the shape, size, lipophilicity, polarizability, and other structural properties also
have been devised.
Quantitative Structure-Activity/Toxicity Relationship Studies (QSAR) are mathematical
models relating the biological activity measurements of a set of chemical compounds to the
variation in their chemical structures [6-9].
2.5 DEVELOPMENT OF QSAR
The steps involved in development of QSAR for a given type of activity and a given
type of chemical compounds can be mentioned as below:
(a) Selecting a “training set” of compounds from an almost infinite universe of
possible candidates;
(b) Synthesizing or otherwise getting hold of pure samples of these compounds;
(c) Obtaining appropriate biological activity measurements for the (n x M) matrix
Y, with n being the number of training set of compounds, M being the number
of biological activity measurements;
44
(d) Translating the variation in structure among the training set compound to the
variation of “structure descriptor variables”, i.e., achieving a relevant
quantitative description of the structural variation among the compounds. The
resulting data are denoted Xik for compound i (i = 1,2,….,n) and descriptor
variable k (k = 1,2,…., k). Together they form the (n x K) matrix X.
(e) Deriving a mathematical model connecting X and Y. This gives, among other
results, a measure of low well Y is modeled (e.g., explained variance, and
which variables in X that are important in the model (e.g., regression
coefficients);
(f) Validating the model by having it predict the biological activity / toxicity for
several new compounds, followed by the actual biological testing of these
compounds and the comparison of predictions with outcomes;
(g) Interpreting the model by relating it to biological and chemical knowledge.
2.6 MODEL DEVELOPMENT AND STATISTICAL DESIGNS
It is an accepted fact that all chemical and biological measurements are not accurate.
The value of any property / activity / varies a little (or much) when measured repeatedly,
even under as similar conditions as possible. Therefore, scientific models based on measured
data, including QSARs, necessarily are statistical in nature. These models separate the
variability in the data in two parts, the systematic and random parts. For the biological
measurements (BA) we can write the model as –
BA = F(X) + E (2.1)
Here, the systematic part, F(X) is the part of the biological data that is “explained” by
X. The random part – the residuals E – contain errors of measurements of Y, plus the model
errors, the imperfection of the systematic model F(X). The model errors, in turn have two
sources, the imperfect form, “shape” of the model F, and the incompleteness and possible
errors in the descriptor variables X [10-15].
Since presumably these structurally related properties of a chemical can be
determined by experimental or computational mean much more efficiently that its biological
activity using in-vivo or in-vitro approaches, a statistically validated QSAR model is capable
of predicting the biological activity of a new chemical within the same series in lieu of the
45
time-consuming and lab our- intensive processes of chemical synthesis and biological
evaluation. Applied judiciously, QSAR can have substantial amount of time, money and man
power.
The Quantitative Structure-Activity Relationships (QSARs) are mathematical models
relating the activity measurements of chemical compounds to the variation in their chemical
structure. In case properties or activities are related to structures of chemical compounds,
then the methodology is called quantitative structure-property relationships (QSPRs) or
quantitative structure-activity relationships (QSARs), respectively. Such methodologies are
being used successfully in antihypertensive drugs to understand the adverse effects of
chemical compounds. These predictions are done because of the large number of untested
chemicals and because of the high costs of biological testing [16-24].
QSAR models are now regarded as a scientifically credible tool for predicting and
classifying the biological activities of untested chemicals. QSAR has become inexorably
embedded as an essential tool in the pharmaceutical industry, from lead discovery and
optimization to lead development [25, 26]. A growing trend is to use QSAR early in the drug
discovery process as a screening and enrichment tool to estimate from further development.
Those chemicals lacking drug like properties [26]or those chemicals predicted to elicit a toxic
response. The fundamental assumption of QSAR is that variations in the biological activity of
a series of chemicals that target a common mechanism of action are correlated with variations
in their structural, physical and chemical properties [27]. A good number of papers have
been reported in this field [28-34].
The molecular structure and parameters derived from molecular spectral data of
organic compounds acting as drugs can be combined to form powerful models of biological
activity. Such data-activity relation is now-a-days called Quantitative Structure-Data-Activity
Relationships (QSDARs) instead of QSAR.
The pioneer worker in this field, considered as father of QSAR, is Corwin Hansch
who for the first time introduced this methodology in 1962 [35]. For the past 47 years, the use
of QSAR has become increasingly helpful in understanding chemical and biological
interactions in the drug design process, pesticide research, environmental pollution, in the
area of toxicology etc. QSAR methodology is very useful in eluc idating the mechanism of
chemical-biological interactions; in particular enzyme action.
46
In environmental toxicology QSAR are used to understand the adverse effects of
chemicals, and to predict the biological effects of yet untested compounds. These predictions
are done because of the large number of untested chemicals, and because of the high costs of
biological testing. It is worth mentioning that, although sometimes taken as a criterion,
prediction is not the primary goal of QSAR. If it results from interpolation, it is often trivial;
if extrapolation goes too far outside the included parameter space, it usually fails. QSAR
helps to understand structure-activity relationships in a quantitative manner and to find the
borders of certain properties, e.g. the optimum lipophilicity within a series of analogs or the
maximum size of a certain group in a stepwise procedure.
The strategy and philosophy of QSAR enables chemists, pharmacologists,
environmentalists and medicinal chemists to look at their structures in terms of
physicochemical properties instead of only considering certain pharmacophore groups in it.
Nevertheless, QSAR methods are still used to prove and to quantify the underlying
hypothesis regarding the dependence of biological activities on physicochemical interactions.
The predictive power of QSAR critically depends on the statistical quality of the data used to
develop the model and estimates its parameters.
Parameters which encode certain structural features and properties are needed to
correlate biological activities with chemical structures in a quantitative manner.
Physicochemical properties, which are directly related to the extra molecular force involved
in the drug-receptor interaction as
The development of a QSAR for a given type of biological activity and a given type
of chemical compounds involves the following steps:
(i) Selecting a “training set” of compounds from an almost infinit
universe of possible candidates;
(ii) Synthesizing or otherwise getting hold of pure samples of these compounds;
(iii) Obtaining appropriate biological activity-measurements from the training set
compounds. Together these data form the (n x M) matrix Y, n being the
number of training set-compounds, and M being number of biological activity
measurements;
(iv Translating the variation in structure among the training set compounds to the
variation of “structure descriptor variables”. That is, achieving a relevant
47
quantitative description of the structural variation among the compounds. The
resulting data are denoted Xik for compound i, (i = 1,2,3,....,n) and descriptor
variable k (k = 1,2,…..,k). Together they form the (n x K) matrix X.
(v) Deriving a mathematical model connecting X and Y. This gives, among other
results, a measure of how well Y is modeled and which
variables in X that are important in the model (i.e. regression coefficients);
(vi) Validating the model by having it predict the biological activity for several
new compounds, followed by the actual biological testing of these compounds
and the comparison of predictions with outcomes.
(vii) Interpreting the model by relating it to biological and chemical knowledge.
It is interesting to mention that the history of QSAR dates back to the
19th century [32,43 and44]. At this time QSAR was in foreground and prediction
played only a minor role. Only a few parameters were used at that time. The
parameters used were log P, ,, MR and some steric parameters.
Now, quantitative prediction is made using quantum chemical, geometrical,
connectivity values, electro- topology, WHIM and many other distance-based and
connectivity type topological indices. Based on the origin of molecular descriptors
used in calculations, QSAR method can be divided into three groups as shown below
QSAR Methods
1st Group 2nd Group 3rd Group
Based on relatively number
of physico-chemical
properties and parameters
describing hydrophobic,
steric, electrostatic, etc.
effects. Usually, these
descriptors are used as
independent variables in
Based on quantitative
characteristic of molecular
graphs i.e. molecular topological
descriptors. These include
molecular connectivity indices
[36-39], molecular shape indices
[40], topological and electro-
topological state indices [41,42],
Based on using descriptors
derived from spatial (three
dimensional) representation of
the molecular structures. These
methods are called (3D) QSAR.
They require 3D alignment of
all molecules according to a
pharmacophore model or based
48
multiple regression
approaches [11]. These
methods are referred to as
Hansch analysis [11].
atom-pair descriptors [41] etc.
Sometimes topological
descriptors are also combined
with physico-chemical properties
and/or spectral parameters of the
molecule. These methods are
referred to [27] QSAR.
on docking to a ligand binding
site of a receptor. The
descriptors used could be
electro-static, steric,
hydrophobic etc. field values in
grid points corresponding the
molecule.
2.7 MODEL VALIDATION AND PREDICTIVE POWER OF QSAR MODELS
In MLR (Multiple Linear Regression) QSAR analysis, more independent variables
result into higher probability of a chance correlation between predicted and observed
activities [33-35]. This conclusion is true not only for MLR - QSAR, but also for any QSAR
approach when the number of variables (descriptors) is comparable to or higher than the
number of compounds in data set. Consequently, model validation is one of the most
important aspects of QSAR analysis.
The best way to validate a model is to perform cross-validation which is based on
leave-one-out (LOO) or leave-some-out (LSO) Cross-validation procedures [33-35]. The
outcome from this procedure is the cross-validated parameters: PRESS (Predicted Residual
Sum of Squares), SSY (Sum of the Squares of the response values), Spress (Uncertainty of
Prediction), R2CV (or Q2) (Overall Predictive ability) and PSE (Predictive Square Error).
Frequently, R2cv (or Q2) is used as a criterion of both robustness and predictive ability of the
model. Many authors consider high Q2 (for instance, Q2 > 0.5) as an indicator or even as the
ultimate proof of the high predictive power of the QSAR model [45-49]. They do not test the
models for their ability to predict the activity of compounds of an external test set (i.e.
compounds which have not been used in the QSAR model development) [50, 51]. Other
authors validate their models using only one or more compounds that were not used in QSAR
model development [50, 51] and still claim that their models are highly predictive. In contrast
with such expectations, it has been shown that if a test set with known values of biological
activities is available for prediction, there exists no correlation between LOO cross-validated
Q2 and correlation coefficient R2 between the predicted and observed activities for the test set
[52, 53].
49
Another widely used approach to establish the model robustness is called y-
randomization (randomization of response i.e. activities) [50,51]. It consists of repeating the
calculation procedure with randomized activities and subsequent probability assessment of
the resultant statistics. Frequently, it is used along with cross-validation. It is expected that
models obtained for the data set with randomized activity should have low values of Q2.
However, sometimes models based on the randomized data have high Q2 values due to
chance correlation or structural redundancy [50-53].
Some authors have suggested that the only way to estimate the true predictive power
of a QSAR model is to compare the predicted and observed activities of sufficiently large
external test set of compounds that were not used in the model development [50-53].
To estimate predictive power of a QSAR model one needs the following statistical
characteristics of the test model:
(1) Correlation coefficient R between the predicted and observed activities;
(2) Coefficient of determination i.e. predicted versus observed activities (R2obs)
and observed versus predicted activities (R2Cal);
(3) Slopes k and k‟ of the regression lines through the origin. A QSAR model is
predictive, if the following conditions are satisfied[50].
Q2 > 0.5, (2.2)
k2 > 0.6, (2.3)
2
2o
2
R
RR < 0.1 or
2
2o
2
R
RR < 0.1 (2.4)
0.85 < k < 1.15 or 0.85 < k‟ < 1.15 (2.5)
2.8. QSAR MODELING
The QSAR modeling generally involves three steps:
(1) Collect or, if possible, design a training set of chemicals,
(2) Choose descriptors that can properly relate chemical structure to biological
activities, and
50
(3) Apply statistical methods that correlate changes in structure with changes in
biological activity.
Obtaining a good quality QSAR model with the ability to predict activity of a
chemical outside the training set depends upon many factors in the approach and execution of
each of the three steps mentioned above.
2.9. QUALITY OF DATA
Data should come from the same assay protocol, and care should be taken to avoid
inter- laboratory variability. Any bad data points will tend to corrupt the proper correlation of
structure and activity. Rules of thumb for a good QSAR data set are that the dose-response
relationship should be smooth, the potency (or affinity) should be reproducible, the activity
range should span two or more orders of magnitude from the least active to most active
chemicals in the series, the number of chemicals used to build the QSAR model should be
sufficiently large to ensure statistical stability, the activities of the chemicals should be evenly
distributed across the range of activity, and the chemicals selected for the training set should
possess enough structural diversity to span the range of chemical space associated with the
biological activity under study.
2.10. PHYSICO-CHEMICAL PARAMETERS IN QSAR
QSAR development involves the quantification of the structural variation in the
current structural class by a small number of design variables [6-9]. These design variables
may correspond to characteristic properties of the compounds, e.g. size, solubility,
lipophilicity, oxidation potential etc. In case such “principal properties” are not available,
they may be derived by a multivariate analysis [16,17] of a battery of measured or calculated
properties of parent compounds. In case, the structural variation in the class corresponds to
the variation of a number of substituents or other structural fragments, the principal properties
of these fragments, i.e. substituent scales [pi(π),sigma etc.] are suitable design variables.
Sometimes, physicochemical properties related to molecular structure are also used as
suitable design variables. Some such properties are : density (d), molecular weight (MW),
molar volume (MV), molar refraction (MR), parachor (Pr), surface tension (γ), polarizability
() etc. Fortunately an excellent program for the calculation of such properties is available
and can be down-load as ACD labs [18].
51
With the introduction of a mathematical concept viz., graph theory and topology in
chemistry, the recent trend is to use graph theoretical descriptions, which are commonly
called topological indices [19-23].
2.11. CHEMICAL GRAPH THEORY AND CHEMICAL TOPOLOGY
The concept of graph theory and topology as applied to chemistry is now well-
established [19-23,54-57], Several books and excellent reviews are now easily available in
the literature and one can use internet facility for collecting such information [24-30,58-61].
It is herefore, not necessary to give a detailed resume of chemical graph theory and topology.
However, the basic concept involved need to be given here.
While applying graph theory to chemistry it is necessary to construct a chemical
graph from the structure of the chemical compound. That is, one has to transform chemical
structure into chemical graph. Such a derived chemical graph is more commonly referred to
as molecular graph. Such a transformation is usually carried out by deleting all the C-H bonds
from the molecular structure. Say for example, the molecular structure of propyl alcohol:
C C C O
H
H
H H
H
H
H
H
Figure 2.1: Molecular Structure of Propyl alcohol
This transforms to its molecular graph as:
C C C O H
and is called hydrogen-deleted molecular graph of propyl alcohol. To mimic this with
mathematical graph, the atoms are called vertices and denoted as „.‟ or „o‟. Similarly, the
bonds are called edges and denoted by a sma ll line „__‟. No difference is made among single,
double and triple bonds. Following this we can write the molecular graph of propyl alcohol
52
Figure 2.2: C-H Supressed Molecular Graph
Note that like bonds (edges) no distinction is made among vertices (a toms) [33].
With the development of chemical graph theory it is now argued that the other and
more common way of obtaining molecular graph is to suppress all the hydrogen bonds from
the molecular structure, i.e., to delete all the carbon-hydrogen as well as hetero-atom-
hydrogen bonds from the molecular structure. Under such a situation, the molecular graph of
propyl alcohol reduces to:
Figure 2.3: All Hydrogen suppressed molecular graph of propyl alcohol
Which type of the molecular graph is to be used is yet to be decided? Consequently,
one has to mention clearly which molecular graph one is using in their studies.
2.12. GRAPH INVARIENT (TOPOLOGICAL INDEX)
The aforementioned discussion clearly establishes that like mathematical graph,
molecular graph can likewise be defined as –
G V.E (2.6)
That is, molecular graph (G) consists of vertex set V and an edge set E.
Now, by imposing certain conditions of vertices (atoms), edges (bonds) or both
(atoms and edges) one can obtain a number which is called graph invariant or more
commonly as topological index. Thus, generally we have following three types of topological
indices:
(1) Atom-weighted topological indices;
(2) Edge-weighted topological indices; and
(3) Atom-Edge weighted topological indices.
53
A plathora of topological indices are available in the literature and everyday many
new types of topological indices are introduced in the literature. A detailed study of such
indices are made by Basak [34], who classified the topological indices under following
categories :
(1) Topostructural (TS),
(2) Topochemical (TC),
(3) Geometrical (3D)/shape,
(4) Quantum chemical (QC).
Another way of classifying topological indices is based on the matrix used for their
calculations. Majority of well known topological indices are obtained from either adjacency
matrix or distance matrix or on their combination. Thus, we have following three types of
topological indices :
(1) Topological indices derived from adjacency matrix;
(2) Topological indices derived from distance matrix;
(3) Topological indices derived from the combination of adjacency and distance
matrices.
Yet another classification of topological indices is made as below :
(1) Fragment constants descriptors;
(2) Conformational descriptors;
(3) Electronic descriptors;
(4) Reactor descriptors;
(5) Quantum chemical descriptors;
(6) Graph theoretical descriptors;
(7) Topological descriptors;
(8) Information content descriptors;
54
(9) Molecular shape descriptors;
(10) Spatial descriptors;
(11) Structural descriptors;
(12) Thermodynamic descriptors;
(13) pKa descriptors;
(14) Molecular field descriptors;
(15) Receptor surface analysis descriptors, etc.
Subsequently, the descriptors are classified as :
(1) 2D descriptors, and (2) 3D descriptors.
The present study relates to 2D descriptors only.
2.13. COMPOUND SELECTION IN QSAR
The backbone of success of QSAR/QSTR is selection of compounds. The
importance of compound selection in QSAR was first pointed out by Hansch [35]and Unger
(1973)[35] and by Wootton et al. (1975)[62]; Austel (1982) [37]was the first to use fractional
factorial designs in QSAR, which was followed by the general treatise by Hellberg et al.
(1987)[38,42]. So far, however, statistical designs have been used very little in QSAR
development with the resulting low general quality and low predictive power of the models.
Carlson et al. (1987)[39] have used statistical designs extensively in the closely
related area of structure-reactivity-modeling in their investigations of synthetic methods in
organic chemistry.
The problem of compound selection in QSAR has drawn a ttention only recently, and
is still largely neglected, especially in the area of environmental toxicology applications
where no examples are found in the literature [39-44].
2.14. MULTIVARIATE ANALYSIS: A STATISTICAL TEST FOR DRUG
DESIGNING
55
Multiple regressions do two things that are very desirable. For prediction studies,
multiple regression makes it possible to combine many molecular descriptors, topological
indices) to produce optimal predictions of the biological activity. For casual analysis,
multiple regressions separate the effects of molecular descriptors (topological indices) on the
biological activity so that one can examine the unique contribution of each molecular
descriptor (topological index). In the following sections we will look closely at how these
two goals are accomplished.
In the last three decades, statisticians have introduced a number of more sophisticated
methods that achieve similar goals. These methods go by such names as logistic regression,
Poisson regression, structural equation models and survival analysis. Despite the arrival of
these alternatives multiple regression has retained its popularity, in part because it is
comparatively easier to use and easier to understand.
It is interesting to record that regression analysis is a simple method for investigating
functional relationship among variables. Such relationship is expressed in the form of an
equation or a model connecting the response or dependent variable and one or more
explanatory or predictor (independent) variables. We denote the response variable by Y and
the set of predictor variables by X1, X2, Xp where P denotes the number of predictor
variables.
A regression equation containing only one predictor variable is called a simple
regression equation and is represented by the following regression equation / regression
expression / model.
Y = o + 1X (2.7)
Here, the coefficient 1 is called the slope and o is called the constant coefficient
intercept.
An equation containing more than one predictor variable is called multiple regression
equation and represented as follows:
Y = o + 1X + 2Y + 3Z + … (2.8)
Where, o, 1, 2, p are constant referred to as the model partition regression coefficients.
The magnitude of 1, 2, p play dominant role in deciding whether or not the proposed
56
regression equation / regression expression / model is statistically significant. If their
magnitude is lower than the standard error or % variance then that model is statistically
useless and thus has not to be discarded. In addition, the magnitudes of regression coefficient
also give an idea of their participation in the proposed model. It is worth recoding that even
with the excellent value of correlation coefficient (R), if the earlier discussed situation exist
the model need to be discarded.
With regards to the use of number of predictors to be used to have a statistically
significant model there is a Thumb‟s rule described in the literature which states that the
number of compound used should be three to four times higher than the number of predictors.
Thus, if we have a set of 100 compounds whose activity is to be modeled then at the most we
can use 25 predictors. The lower the number of predictors and the larger the sampling, the
better is the proposed model.
It is also to be kept in mind that in multiple linear regression analysis care to be taken
of the fact that the proposed model should not contain highly linearly related descriptors,
otherwise the model suffer from the defect of co - linearity, i.e., there is an occurrence of
auto-correlation. However, auto-correlated descriptors are retained in the model only when
there is a consistence improvement in adjustable R2 i.e., R2A upon their introduction and that
their coefficients are higher than the respective standard deviations.
2.15. STATISTICAL INDICATORS FOR THE QUALITY OF MODELS
The quality of the fit of the training data is expressed by the correlation coefficient R
or standard error of estimation S‟, computed as :
M
AA
S
esti
M
i
2
1
)(
'
(2.9)
Where, AI and Aest and M denote experimental activities, estimated activities and the total
number of cases (molecules) considered, respectively. However, a more important measure of
the predictive quality are the cross-validated statistical parameters : correlation coefficient
RCV and standard error SCV, which are calculated using experimental ACVi and estimated
AestCvi activities based on the leave-one-out cross-validation procedure.
M
AA
S
estCViCVi
M
iCV
2
1
)(
'
(2.10)
57
The Student t-test can be used to assess the statistical significance of the calculated
regression parameter b1. The expression used for this purpose is as below:
S
XXbt
i
2/121 )(
(2.11)
If the value of t is greater than the tabulated at the required confidence level and n-2
degrees of freedom then the slope of the line is significantly different from zero. The
significance of the intercept (i.e., is bo = 0) can be assessed in a parallel manner. For simple
regression the t calculated by the above equation is the square root of the F calculated by the
following equation:
)1(
)2(2
2
R
RnF
(2.12)
So as to make the aforementioned treatment of simple regression analysis more
clearly we calculate some of the simple relationship using the following specific numbers.
Consider an activity Y is modeled by a molecular descriptor X as shown by the data
presented in Table 2.1
Table 2.1. Data for modeling activity Y by molecular descriptor X
S X Y X2 Y2 XY
1. 0.836 1.43 0.699 2.045 1.195
2. 1.171 1.74 1.371 3.028 2.038
3. 1.489 2.45 2.217 6.002 3.648
4. 1.644 2.26 2.703 5.108 3.715
5. 1.754 2.70 3.076 7.290 4.736
n=5 X=6.894 Y=10.58 X2=10.066 Y2=23.473 XY=15.332
The first step in obtaining the regression expression is to estimate regression
parameters bo and b1 using the following expression:
221
XXn
YXXYnb (2.13)
58
22
2
0
XXn
XYXXYb (2.14)
Alternatively, the expressions (2.13) and (2.14) can be used for this purpose.
3278.1)894.6(066.105
)580.10894.6(332.15521
b (2.15)
and
2854.0)894.6(066.105
)332.15894.6(066.10580.1020
b (2.16)
These parameter yields the following regression equation:
2854.03278.1 XY (2.17)
The quality of this regression expression is first determined by calculating: (1) standard
deviation in X, Y, viz., Sx, Sy; (2) sample variance, i.e., Sx2, and (3) sample standard deviation,
Sx. These quantities are determined as below:
2.16. CALCULATION OF STANDARD DEVIATION IN X AND Y :
SX AND SY
The standard deviation in X and Y is calculated using the following expressions:
5386.0)15(5
)894.6(066.105
)1(
222
nn
XXnS x
(2.18)
5209.0)15(5
)58.10(473.235
)1(
222
nn
YYnS x
(2.19)
1
)( 22
n
XXSx (2.20)
The data needed for the calculation of sample variance Sx2 are shown in Table 6.1,
which yields Sx2 as:
1402.015
5609.02
xS (2.21)
This gives sample standard deviation Sx as 0.3744.
59
Note that sample variance means a measure of data dispersion, i.e., the average of
the squared difference between the individual measurement and the mean. Also, that the
sample standard deviation has the same meaning as the population standard deviation.
Table 2.2. Data needed for the calculation of sample variance (Sx2)
and sample standard deviation (Sx)
S. X X X - X (X - X )2
1. 0.836 1.3788 -0.5428 0.2946
2. 1.171 1.3788 -0.2078 0.0432
3. 1.489 1.3788 0.1102 0.0121
4. 1.644 1.3788 0.2652 0.0703
5. 1.754 1.3788 0.3752 0.1407
n = 5 X =6.1353 = (X - X )2 = 0.5609
Next step in determining the quality of the regression equation is to calculate
correlation coefficient, r (R in case of linear multiple regression) using:
2222
YYnXXn
YXXYnr (2.22)
9541.0]936.111473.235][527.47066.105[
)580.10894.6(332.155
r (2.23)
This gives 2 = 0.9103, meaning there by that the molecular descriptor X accounts for
91.03% variance in the activity Y. This means that X is an appropriate parameter for modeling
Y. This is further established by estimating Y from the proposed regression equation (model)
and comparing the same with the observed (calculated) Y values and further by estimating
residue i.e., the difference between observed (experimental) and estimated (calculated from
the proposed regression equation) values of Y. The data needed for this purpose are given in
Table 2.3.
Table 2.3. The data needed for calculating residue
60
S.No. Yobs Yest Yobs – Yest = Residue
1. 1.430 1.396 0.034
2. 1.740 1.840 -0.100
3. 2.450 2.262 0.188
4. 2.260 2.468 -0.208
5. 2.700 2.614 0.086
Thus, for each observation the observed minus the calculated value is termed as the
residual. Analysis of trends in the residuals is used to explore which new variable will result
in an equation with a higher r2 (R2) and lower S.
Finally, F distribution is calculated using:
k
knF knk
)1(
)1(
2
2
1,
(2.24)
Where k is the number of molecular descriptors (Xn) used for modeling the activity of Y. In
our case, k = 1 as only one molecular descriptor (X) is used for modeling activity Y.
Hence, 444.3010897.0
39103.03,1
F (2.25)
A regression analysis is performed to establish and quantitative the dependence of the
variable on one or several others. For example, regression analysis in structure-activity
(QSAR) studies might be used to examine the dependence of relative potency on activity. A
very closely related concept is correlation analysis. This type of calculation is used to
establish and quantitate the interdependence or relatedness between two variables. The
calculation of the correlation coefficient and statistical tests of its significance are identical to
those of the square root of R2 (r2) discussed above with the exception that the direction of the
relationship is maintained as the sign of r(R).
The interpretation of the meaning of the coefficient differs from regression to
correlation. Variables which are correlated are said to covary. Correlation coefficients are
used in structure-activity studies to assess the degree of relatedness of the prediction variables
used for regression analysis. Highly correlated variables do not provide independent
information for the regression analysis
61
The calculation of the coefficient of the parameters in the linear multiple regression
equation is identical to that for the simple list squares line with the exception that b1, Xi, and
X of equation 2.13 and 2.14, refer to vectors (a list) of predictor properties (molecular
descriptors) rather than only a single value. The vector Xi, for example, contains the value of
each physical property (molecular descriptor) of compound i, property (molecular descriptor).
Very often linear multiple regression suffers from the defect due to collinearity and
the presence of compound outliers. The diagnostics to test for the presence of compound
outliers and multicollinearity between descriptors in the model is a variance inflation factor
(VIF) defined by:
21
1
RVIF
(2.26)
The VIF value less than 10 indicates that the model contained no multicollinearity.
Very recently Randic [63] has stated that many physical properties (molecular
descriptors) in spite of their high correlation can be used in the regression expression as many
of them have different information content. In such cases the proposed regression equation
(model) is considered statistically significant. Only when the coefficients of molecular
descriptors involved in the regression equation (model) are significantly larger than their
standard error.
The genuineness of the proposed regression equation (model) is determined by
estimating probable error of the correlation coefficient (PE) using:
n
2r1.
3
2PE
(2.27)
If r < PE, r is not significant, if r > PE, several times at least 3 times greater, the
correlation is indicated and if r > 6PE, the correlation is definitely good.
In the case considered, we have:
0266..05
9103.01.
3
2
PE
(2.28)
6PE = 0.1599 (2.29)
That is r (0.9541) is > 6PE, indicating thereby the proposed correlation (regression
equation) is definitely good for modeling the activity Y.
62
All the aforementioned statistics can be performed manually either using paper or
pencil or by using a scientific calculator. The Hewlett-Packard HP-11C programmable
scientific calculator is the most appropriate for this purpose.
2.17. IMPORTANT STATISTICAL PARAMETERS
2.17.1. Standard deviation:
Standard Deviation (SD) is the squared root of the variance, or the Root Mean Square
(RMS) error of deviations.
2
( )
1
n
i
iy
y y
sn
(2.30)
Where n −1 is the number of degrees of freedom, i.e. the number of parameters to be
determined is subtracted from the total number of parameters. This is the standard deviation
usually employed in statistics, which measures the dispersion of a data set in relation to the
arithmetic mean, that is, it is a measure of the magnitude of the residuals, accounting for
accuracy.
Conversely, the standard deviation of the dependent variable, before trying to fit
any model represents the amount of variation in the dependent variable, and the error
represents the variation left over after fitting the model.
2
( )
1
n
i
ix
x x
sn
(2.31)
Another form of Standard Deviation of x and y respectively
12 22
1X
n x xS
n n
(2.32)
12 22
1Y
n y yS
n n
(2.33)
63
2.17.2. Covariance
Covariance is the average of product of deviations from means, for the {xi, yi} pairs, namely,
the:
( )( )
1
n
i i
iXY
x x y y
cn
(2.34)
This measure, which depends on the scaling of the features, describes the relationship
between the x and y features. The covariance is the mean value of all the pairs of
differences from the mean for independent variables multiplied by the differences from the
mean for dependent variables. If x and y are not closely related to each other, they do not
co-vary, so the covariance is small, so the correlation is small. If x and y are closely related,
CXY turns out to be almost the same as the product of standard deviations of x and y, so the
correlation is almost 1.
2.17.3. Correlation Coefficient.
The correlation coefficient is a normalized covariance independent from scaling:
r =
XY
x y
c
s s that measures the quality of adjustment, that is, the degree of correlation between x and
y, and detects if the variables contain redundant information, and thus they are highly
correlated.
Substituting the corresponding identities in the previous equation:
1 1 11
1 1 1 12 22 2 2 22 2
2 2
1 11 1 1 1
i i
n n nn
i i i ii ii i ii
n nn n n n
i ii i
i ii i i i
n x y x yx x y y
r
x x y y n x x n y y
(2.35)
The correlation coefficient is a measure of the degree of linearity of the relationship,
i.e. it indicates the extent to which the pairs of numbers for these two variables lie on a
straight line.
64
Correlation coefficients for the variables in a dataset are compiled in a correlation
matrix, which shows the correlation of one descriptor with another, and thus the
relationships among descriptors. This matrix is a symmetric matrix in which the diagonal
elements are one and the off-diagonal elements are the correlation coefficients for the
appropriate variable pairs. The correlation coefficients for independent variables that are not
correlated, i.e. orthogonal variables are zero.
The addition of new parameters to the model always increases the r value, unless the
new parameter is a constant of a linear combination of other parameters, which would not
produce any effect. The increase in r when adding new parameters can result in over fitting,
that is, a spurious correlation.
2.17.4. Multiple Determination Coefficients
The multiple determination coefficients are the squared correlation coefficient used to
describe the goodness-of- fit of the data. It informs about how well the model reproduces the
experimental data. However, when a large number of free parameters intervene in the model,
r2 can arbitrarily be close to the value of one.
22
2 2
XY
x y
cr
S S
(2.36)
An alternative definition of the squared correlation coefficient can be
deduced
' 2
2 1
2
1
( )
1
( )
n
i i
i
n
i
i
y y
r
y y
(2.37)
The multiple determination coefficients are a quantitative measure of the precision of
adjustment for the fitted values to the observed ones, which measures the fraction of the
variance explained by the model. The coefficient mainly informs if the variation of y
explained by the regression equation permits to assume that there is a linear relationship
between y and x.
The squared coefficient multiplied by 100 is the percent of total variance explained by
65
the model. This percentage expresses the strength of the relationship between x and y.
R is defined in the [0, 1] interval, that is, it ranges from 0 to 1. The closer to the
unity, the more similar are the adjusted values to the experimental ones. The limit case, when
r2=1, is obtained when all the residuals are null, that is, the residual sum of the squares
approaches to zero, and, thus, the model fits exactly the data.
It must be noted, however, that a coefficient close to the unit does not mean that the
model is good; the simple addition of parameters to the regression induces an ever increasing
of r2, even if the newly added descriptor does not contribute to the model. To determine the
predictive capacity of the model, other measures are required.
2.17.5 Fischer Statistic.
The Fischer statistic parameter is one of several variance-related parameters that can
be used to compare two models differing by one or more variables. This statistic is used to
determine whether a more complex model is significantly better than a less complex one.
2( , 1) RSS
F k n kks
(2.38)
Where s2 is an unbiased estimate of the residual or error variance, and (k, n-k-
1) are the degrees of freedom.
Another form of Fischer Statistics
2
2
1
1test
R N kF
k R
(2.39)
Where N =number of samples, k = no. of descriptors, R = regression coefficient. The F test
reflects the ratio of the variation explained by the models and the variation not explained by
the models. So the higher value of F test indicates that the model is statistically significant.
The F statistic is computed and compared with standard tabulated values. If F is
larger than the tabulated value, the more complex model can be accepted as significant.
66
2.17.6. Probable Error of Estimation.
The Probable Error of Estimation is defined by the expression:
22 13
RPEN
(2.40)
It is argued that,
1. If, R < PE R is not significant
2. If, R > 6 PE is definitely a good model.
2.17.7. Degree of Lack of Relationship
21K R (2.41)
K indicates the degree of lack of relationship. Lower the value of K
better the productivity of the model.
2.17.8 Index of Forecasting Efficiency
Index of forecasting efficiency is defined as the percentage reduction in errors of prediction
by reason of correlation between two variables. so higher the value of E better the goodness
of model.
2100 1 1E R (2.42)
2.17.9. LOF = Friedman’s Lack of Fit Measured.
2
1
LSELOF
k d p
N
(2.43)
Where, k = number of descriptors + 1
P = No. of independent parameters
67
N = No. of Samples
d = 1
The advantages of using of LOF directly rather than LSE that LOF don‟t decrease with
increase in number of the descriptors. The lower value of LOF indicates the better model.
2.17.10. T:Significance Level
1
2
2
2
1test
NT R
R
(2.44)
Where, N = number of Samples
The T test is a measure of significant regression Coefficient. The higher value of T test
indicates the more significant regression model.
2.17.11. Sum of squares of deviations from the mean:
The sum of squares of deviations from the mean is a very important statistical
parameter. It is calculated by subtracting each observation from the mean, squaring each
deviation, and summing these squares.
2
XXSSmean (2.45)
2.17.12. Estimation of variance:
An estimation of the variance of a population is the sum of squares of deviations from
the sample mean, divided by the degrees of freedom:
1
)( 22
n
XXSmean (2.46)
This quantity forms the basis for very useful statistical calculations called analysis of
variance. Analysis of variance is discussed briefly below.
2.17.13. Outliers:
These are the points in the fitted model which have a large residual.
niPPPrC iiijii ,...,2,1;1/1/2 (2.47)
2.17.14. Doing with outliers:
68
Outliers and influential observations should not routinely be deleted or automatically
down-weighted because they are not necessarily bad observations. On the contrary, if they
are correct, they may be the most informative points in the data. This may indicate that the
data did not come from a normal population or that the model is not linear.
2.17.15. Multicollinearity:
Multicollinearity is defined as the condition of severe non-orthogonality. A thorough
investigation of multicollinearity can be examined by the values of R2 that result from
regressing each of the predictor variables against all others.
2.17.16. Condition Number (K) of the correlation matrix :
Condition number of the correlation matrix is a measure of the overall
multicollinearity of the variables. It is defined as:
pX
XK 1
matrixn correlatio theof eigenvalue minimum
matrixn correlatio theof eigenvalue maximum (2.48)
The condition number will always be greater than one. A large K number indicates
evidence of strong co linearity. The harmful effects of co linearity in the data become strong
when the values of K exceed 15 (which mean that X1 is more than 225 times Xp).
Another empirical criterion for the presence of multicollinearity is given by the sum
of the reciprocal of the eigenvalues, that is
j
p
jX
1
1
(2.49)
2.17.17. Mallows Cp (Cp-Statistics):
Mallows used Cp statistics for the estimation of Jp.
)2(2
npSSE
Cp
P
(2.50)
Where 2 and is usually obtained from the linear model with the full set of q-variables. The
expected value of Cp is p when there is no bias in the fitted equation containing p terms.
Consequently, the deviation of Cp from p can be used as a measure of bias. The Cp statistics,
therefore, measures the performance of the variables in terms of the standardized total mea n
square of error of prediction for the observed data points irrespective of the unknown true
model. It takes into account both the bias and the variance.
69
2.17.18. Cross validation:
Cross validation is a technique for evaluating the prediction ability of a model for a
given data set. Nowadays, cross validation constitutes the basis for the modern statistical
philosophy of “replacing standard assumptions about the data with massive calculations” for
assessing the generality of a relationship found from a sample data set.
If the PRESS value is transformed into a dimensionless form by relating it to the
initial sum of squares, one obtains Q2, i.e., the complement to the fraction of unexplained
variance over the total variance (see R2, above).
SSYPRESSQ /12 (2.51)
Or
SSYPRESSSSYR /pred2 (2.52)
2.17.19. Predictive Squared Error (PSE):
Predictive squared error (PSE) is the square root of the ratio of PRESS to N.
N
PRESSPSE (2.53)
It is more directly related to the uncertainty of the predictions, since it has the same
units as the actual y values.
2.17.20. Standard deviation of error of predictions (SDEP) and deviation of error of
calculations (SDEC):
The SDEP and SDEC are the terms used to distinguish between Predictors for new
data point or for the data points already used in modeling and are defined as:
2/12/ NyySDEP REDP (2.54)
2/12/ NyySDEC CALC (2.55)
Note that SDEC can also be called RSD (standard deviation of residuals).
2.17.21. Uncertainty of the Prediction (SPRESS):
The uncertainty of the prediction (SPRESS) is defined as
2/11/ knPRESSSPRESS (2.56)
Where k is the number of variables in the model and n is the number of compounds used in
the study
2.18. EVALUATION OF THE PREDICTIVE CAPACITY OF THE MODEL:
VALIDITY TECHNIQUES
70
Once the regression equation is obtained, in addition to the goodness of fit and the
stability of the model, it is also important to evaluate the robustness and the predictive
capacity or validity of the model before using the model on the interpretation and prediction
of the biological activity.
To validate a method is to establish the reliability and relevance of the method for a
particular purpose. The reliability refers to the reproducibility of results, the relevance is
related to the scientific use and practical usefulness, and the purpose refers to the intended
application. The validation of a QSAR model is the process by which the predictive ability of
a QSAR and the mechanistic basis are assessed for practical purposes. Validation assesses
if the model accurately represents the reality, from the perspective of the intended model
application.
It must be paid special attention to outliers, structures with a residual greater than two
times the standard deviation of the residuals that do not fit the model. Once identified,
diagnostic data that help making decisions about them should be examined. Outliers should
be iteratively removed from the observations used to calculate the QSAR equation, and then
the equation recalculated until the satisfactory results were obtained.
It may happen, for example, if the structure of one or more elements of the training set
differs significantly from the rest, that these elements determine the quality and shape of the
model. Several procedures can be used to check the reliability and significance of the model,
i.e. that the size of the model is appropriate for the quantity of data available of non
synthesized compounds, as well as provide some estimate of how well the model can predict
activity for new molecules. There are two techniques to determine the confidence and
robustness of the model, namely internal and external validation techniques.
2.19. SIGNIFICANCE OF ADJUSTABLE R2A
The adjustable R2 (R2A)[64] is an important statistical parameter used to explain
whether or not the added variable contributes its fair share in the model. This is defined as:
1
1)1(1 22
kn
nRRA (2.57)
That is, R2A values takes into account of adjustment of R2. If a variable is added that does not
contribute its fair share, the R2A will actually decline.
R2A is particularly important when the number of independent variables is large
relative to the sample size. It takes into account the relationship between sa mple size and
71
number of variables. R2 may appear artificially high if the number of independent variables is
high compared to the sample size. R2A is a measure of the % explained variation in the
dependent variable that takes into account the relationship between the number of cases and
the number of independent variables in the regression model, whereas R2 will always increase
when an independent variable is added. R2A will decrease if the added variable doesn‟t reduce
the unexplained variation enough to offset the loss of degrees of freedom.
2.20. MULTICOLLINEARITY PROBLEM
The problem due to multicollinearity in the model is resolved by Randic[63].
According to him selection of descriptors to be used in QSPR and QSAR (Quantitative
Structure-Activity-Relationship) statistics should not be delegated solely to the computers,
although the statistical criteria will continue to be useful for preliminary screening of
descriptors taken from a large pool. Often in an automated selection of descriptors, a
descriptor will be discarded because it is highly correlated with another descriptor already
selected, but what is important is not whether two descriptors parallel to the another, that is
duplicate much of the same structural information, but whether they in those parts that are
important for QSPR, and QSAR correlations. If they differ in the domain which is important
for the property, activity, or toxicity considered, both descriptors should be retained. If they
differ in parts that are not relevant for the correlation of considered property, activity or
toxicity, then one of them can be discarded. That is, despite of high collinearity among
descriptors they may have different information content, and thus be maintained in the model.
Such a model is then considered statistically significant.
2.21. APPROPRIATE SOFTWARE’S
Appropriate software, to be used for regression analysis, should provide the following
information:
(1) Dependent variable.
(2) Independent variable.
(3) Coefficient of independent variable.
(4) Standard error (95% confidence) of independent variable.
(5) Student t-test.
(6) Probability.
(7) Partial r2
(8) Standard error of estimate.
(9) Adjustable R2A.
72
(10) R2
(11) Multiple R.
(12) Analysis of variance.
(13) F-Ratio.
(14) Significant level of F.
(15) Residuals.
(16) Durbin-Watson Test.
Such information is contained in many of the software, including MSTAT [65]. We
give below the statistics covered in MSTAT (2.4).
Table 2.4. Information contained in MSTAT
Dependent Variable : Y
Var Regression
coefficient
Std. Error T(DF) Prob. Partial r2
X1 C1
X2 C2
X3 C3
| |
Xn Cn
Const.
Std. Error of Est.
Adjusted R squared
R squared
Multiple R
Analysis of Variance
Source of Sum of
squares
DF Mean sq. F-Ratio Prob.
Regression
Residual
Total
Observed Calculated Residual Standardized Residue
73
Durbin-Watson Test =
2.22. CHANCE CORRELATION
Many times correlation is noticed between two variables without any real relationship
between them. It may happen due to chance. This generally happens when a very small
sample is chosen from a large universe. For example, a small sample may give us a very high
correlation between dependent and independent variables. It is, therefore, essential that while
analyzing correlation between two variables, we do not jump to hasty conclusions.
QSAR investigations are plaused with the problem of chance correlation, in which
statistical significance is inflated by the multitude of possib le models. The problem of chance
correlation was explored by Topliss [ 66-67]and coworkers [TOP] in studies on correlations
with random numbers. It can be estimated to some extent by repeated random reassignment
of activity values to the matrix of independent variables, and repetitive of the selection
process. If these “nonsense” regressions frequently lead to correlations as good as the original
one, then the original one is non-significant.
74
2.23. RULE OF THUMB’S
Multiple regression analysis generally requires significantly more compounds than
parameters; a useful rule of Thumb‟s [68]is three to six times the number of parameters
under consideration. Thus, traditional regression method requires that the number of
parameters must be considerably smaller than the number of compounds in the data set or the
number of degrees of freedom in the data.
75
REFERENCES:
1. S. Borman, ; New QSAR Techniques Eyed for Environmental Assessments. Chem.
Eng. News, 1990,68,20-23.
2. R.L. Lipnick ,Charles Ernest Overton; Narcosis Studies and a Contribution to General
Pharmacology. Trends Pharmacol. Sci., 1986, 7, 161-164.
3. C. Hansch, A. Leo, and R.W. Taft, ; A Survey of Hammett Substituent Constants and
Resonance and Field Parameters. Chem. Rev., 1991, 91, 165-195.
4. C. Hansch, A. Leo, and D. Hoekman ; Exploring QSAR - Hydrophobic,
Electronic, and Steric Constants. American Chemical Society, Washington, D.C.
,1995.
5. C. Hansch, ; A Quantitative Approach to Biochemical Structure-Activity
Relationships. Acct. Chem. Res., 1969,2, 232-239.
6. M. Karlson; Molecular Descriptors in QSAR/QSPR, J. Wiley & Son, New York
2000.
7. J.Devilliers; Eds., Comparative QSAR, Taylor and Francis, Washington (DC). 1998.
8. F.Sanz, J. Giraldo, F. Manaut; Eds., QSAR and Molecular Modeling. Concepts;
Computational Tools and Biological Applications, Pooles Science, Barcelona (SP),
1995.
9. H.Kubinyi ; QSAR : Hansch Analysis and Related Approaches, VCH, Weinheim
(GER),1995.
10. M.V. Diudea, M.S.Florescu, P.V. Khadikar; Molecular Topology and its
Applications, Eficon Bucharest, 2006.
11. D. Bonchev, D.H. Rouvray; Chemical Topology: Applications and Techniques,
Gordon and Breach, New York, 2000.
12. M.V. Diudea; Eds., QSPR/QSAR Studies by Molecular Descriptors, Nova Science,
2000.
13. M.V.Diudea, O.Ivanciuc, Molecular Topology, Comprex, Cluj, 1995.
14. A.Verloop; The Sterimol Approach to Drug Design, Marcel Dekker, New York, 1987.
76
15. Z. Simon , A. Chiriac, S. Holbar, D. Ciubotariu, G.I.Mihalas ; Minimum Steric
Difference. The MTD Method for QSAR Studies, Research Studies Press, Cetchworth
(UK), 1984.
16. I. Gutman; Mathematical Concepts in Organic Chemistry, Spinger Verlag, Berlin,
1986.
17. J.E. Muth ; Basic statistics and Pharmaceutical Statistical Applications, Marcel
Dekker, New York, 1999.
18. ACD Labs; Richmol St. West State 605 M5H 2L3, Canada.
19. I. Gutman, O.E. Polansky; Mathematical Concepts in Organic Chemistry, Springer-
Verlag, Berlin, 1986.
20. F. Buckley, F.Hararay; Distances in Graphs, Addison-Wesley, Reading ,1990.
21. P.G.Mezey; Ed., Mathematical Modeling in Chemistry, VCH, Weindeim GEP, 1991.
22. J.Devillers, A.T. Balaban; Eds., Topological Indices and Related Descriptors in
QSAR and QSPR, Gordon and Breach, New York, 1999.
23. R.Todeschini, V. Consonni; Handbook of Molecular Descriptors, Wiley-
VCH,Weinnein (GER), 2000.
24. D.Bonchev; J. Mol. Graph. Model., 2000, 20, 65.
25. E.Esterda, E.Molina; J. Mol. Graph. Model., 2001, 20, 54.
26. M Randic; J. Mol. Graph. Model., 2001, 20, 19.
27. E Esterda ; Chem. Phys. Lett., 2001, 336, 248.
28. V.K.Agrawal,P.V.Khadikar,C.T.Supuran,J.Singh,S.Singh,S.Thakur,M.Lakhwani;
Arkivoc, 2006, 103.
29. V.K.Agrawal, J.Singh, A.Pandey, P.V.Khadikar; Oxid.Commun., 2006, 29, 4.
30. V.K.Agrawal , V.K.Dubey , B.Shaik , J.Singh , K.Singh , P.V.Khadikar;
J.Indian.Chem.Soc, 2009, 86, 1.
31. Xu, Lu, W.J.Zhang; Anal Chemica Acta., 2001, 446, 455.
32. P.V.Khadikar, S.Karmarkar, I. Lukovits,M.V.Diudea,V.K. Agrawal; Szeged Index –
10 Successful Years in QSAR, (Under publication).
33. P.V.Khadikar, A. Shrivastava, S.Karmarkar, S. Singh; Bioorg.Med.Chem,
2002,10,3163.
34. S.C. Basak, D. Mills, B.D. Guts, G.D. Grunwald,A.T. Balaban; Application of
Topological Indices in predictivity property / Bioactivity / Toxicity of Chemicals.
(Unpublished).
35. C.Hansch, S. Unger; J. Med. Chem., 1973, 16, 1217.
77
36. H.Wiener; J. Med. Chem., 1975, 18, 607.
37. V.Austel, Eur. J. Med. Chem., 1982, 17, 9.
38. S. Hellberg, M.Sjostron; Acta Pharm. Yugosla., 1987, 37, 53.
39. R.Carlsson; Chem. Scr., 1987, 27, 545.
40. The Use of QSAR for Chemical Screening - Limitations and Possibilities, Kemi
Report Science and Technology, Department, Swedan, 1988.
41. S.Wold, W.J.Dun; III. J. Chem. Inf. Comput. Sci., 1983, 23, 6.
42. S. Wold, W.J Dun, S. Hellberg; Envir. Hlth. Persp., 1985, 61, 257.
43. S.Wold, L.Eriksson, J.Jonsson, J.Hellbers, S. Hellhers, M.Sjostroler, Efficient
Selection of Training Set Compounds for QSAR KEMI Report, Science and
Technology Department, Sweden, 1988.
44. R.Franke; Theoretical Drug Design Methods, Elsevier, Amsterdam, 1984.
45. E. Overton; Z. Physik. Chem., 1897, 22, 189.
46. H.Meyer; Arch. Exp. Pathol. Pharmakol ., 1899, 42, 109.
47. E. Overton; Studien über dir Narkose. Gustav Fischer: Jena., 1901.
48. R.L.Lipnick, E.Overton; Trends Pharmacol. Sci., 1986, 7, 161.
49. J.Ferguson; Proc. Roy. Soc. Lond. B. Biol. Sci., 1939, 127, 387.
50. Jr S.M Free, J.W.Wilson; J. Med. Chem., 1964, 7, 395.
51. C.Hansch, T.Fujita; J. Am. Chem. Soc., 1964, 86, 1616.
52. L.P Hammett; J. Am. Chem. Soc., 1937, 59, 96.
53. L.P.Hammett; Physical Organic Chemistry. McGraw-Hill: New York,1940.
54. V.K. Agrawal, J. Singh, B.Shaik, S.Singh, S.Sikhima, P.V. Khadikar,C. T. Supuran;
Bioorg. Med. Chem., 2007, 15, 6501.
55. V.K. Agrawal, J.Singh, B.Shaik, S.Singh, P.V. Khadikar, O.Deeb,C. T. Supuran;
Chem. Bio. Drug Des., 2008 , 71, 244.
56. V.K. Agrawal ,J.Singh, S.Singh, B.Shaik, N.Sohani,P. V. Khadikar O.Deeb; Chem.
Bio. Drug Des., 2008, 71, 230.
57. V.K.Agrawal, J.Shrivastava, J. Singh, B. Shaik; Oxid. Commun., 2008, 31, 776.
58. V.K. Agrawal, J. Singh, B.Shaik, P V. Khadikar ; J. Indian Chem Soc., 2008, 85,
517.
59. V. K. Agrawal, J.Singh, V. K. Dubey, P. V. Khadikar; Oxid.Commun., 2008, 1, 27.
60. V. K. Agrawal , J. Singh, M. Gupta and P. V. Khadikar, C.T. Supuran; Eur. J. Med.
Chem., 2006, 41(3), 360.
61. V.K Agrawal, K.C. Mishra, J. Singh, P.V. Khadikar; Letters Drug Design &
78
Discovery., 2006, 3, 129.
62. R. Wootton, F.E. Norrington, R.M. Hyde, S.G. Williams; J. Med.Chem., 1975, 18,
604.
63. M. Randic; Croatica Chem. Acta., 1993, 66, 289.
64. S. Chatterjee, A.S. Hadi , B. Price ; “Regression Analysis Examples”,
65. MSTAT, Software, 1998.
66. J.Topliss; Eds., “Quantitative Structure-Activity Relationships of Drugs Academic
Press, New York, 1983.
67. J.K. Seyfel; Eds., “QSAR and Strategies in the Design of Bioactive Compounds”,
VCH, Weinheim, 1985.
68. S .Tuts; In Advances in drug research, N.T.Harper,A.B. Simmonds; Eds., Academic
Press London, 1971, 6, 1.