View
0
Download
0
Category
Preview:
Citation preview
Chapter 2
STATISTICAL
INTRODUCTION
Introduction
In general the modeling problem can be de�ned as estimating the relation betweena set of predictor variables X and one or more response variables Y:
Y = f(X) + " (2.1)
in which " contains all kinds of errors like sampling and measurement errors. Thefunction f(X) is the conditional expectation
f(X) = E(Y jX) (2.2)
and can be estimated by an approximation on a certain dataset (xi;yi),i = 1; : : : ; n. The estimation of the above mentioned relation can be achieved
by deriving an equation that describes the physical or chemical mechanisms de-
�ning the process under study. In this case the overall form of the relationship isknown and only values of some parameters have to be estimated. In the case thatfunctional relations are not (yet) known these functions have to be approximated
by an empirical model. The �rst objective in the modeling procedure then is to
search for an appropriate class of functions of which the true function is expectedto be a member. Next a model is selected from this class of functions. The suc-
cess of this procedure depends on the existence of a convenient class and a correctchoice of the class of functions. When a parametric regression model is used one
assumes that the true underlying function is a member of a very strict family of(parametric) models, whether these are linear or nonlinear.
Classes of models that are relatively more exible are used in the so-called
nonparametric regression methods. These exible models adapt themselves to the
Chapter 2 20
data, which means that these methods determine an appropriate model solely on
the data. This type of methods can generate very complex models, so the risk of
over�tting is considerably high.
This chapter will focus on di�erent theoretical aspects of modeling. The order
in which subjects will be covered will follow the application of these tools in the
subsequent chapters of this thesis. Generally in response surface modeling the �rst
objective is the choice of an appropriate design, because this de�nes the limits on
the information that can be obtained and therefore has a large in uence on the
modeling. Second part of the response surface modeling process is the choice of a
certain class of models, from which the exact model is estimated. These two steps
cannot be considered separately from each other. The type of model de�nes the
choice of an optimal design, while the design choice puts restrictions on the type
of model that can be used. Criteria will be described that contribute in the search
for an appropriate model and in the evaluation of the predictive performance ofthe estimated model.
Parametric modeling
In regression analysis most often a linear model is used to describe the relationbetween variables. A linear model is de�ned by a linear summation of the predictor
variables:y = �0 +
Xi
�izi + " (2.3)
in which zi may be all kinds of transformations and cross-products of the originalpredictor variables xi, and �0 and �i are model parameters. Such a model is
based on Taylor-series expansions. The set of linear parametric functions is avery exible family and often used when no theoretical model is available.
An example of a nonlinear parametric model is the Michaelis-Menten model
for enzyme kinetics:
y =�1x
�2 + x+ " (2.4)
where �1 and �2 are model parameters. With the appropriate transformation this
model becomes linear:
1
y=
1
�1+�2
�1
1
x
(2.5)
y0 = �01 + �02x0
Amodel like the Michaelis-Mentenmodel is called intrinsically linear or transform-
ably linear, because after the appropriate transformation of variables or parameters
a linear model will result.
Statistical introduction 21
An example of an intrinsically nonlinear model is
y =�0
(1 + exp[�(�1 + �2x)=�3])�3+ " (2.6)
The parameters are nonlinearly incorporated in the model, and no transformation
of parameters or variables will convert the model into a linear model.
A nonlinear model can be used when theory suggests such a model, when
linear models give considerable lack-of-�t or when after visual inspection nonlinear
behavior can be detected in the data [47].
Nonlinear parametric regression
The estimation of parameter values in nonlinear regression is much more complexthan in linear regression. In this section will be explained why this estimation ismuch more complex.
In regression the search is for the least squares estimator of the parameters,which minimizes
S(�) =nXi=1
(yi � f(Xi;�))2 (2.7)
This minimization is equal to solving the derivative S for all �j:
@S(�)
@�j= �2
nXi=1
(yi � f(Xi;�))[@f(Xi;�)
@�j]�= ^�
= 0 (2.8)
In these equations Xi contain the values of the independent variables at designpoint i, while yi contains the value of the dependent variable at design point i
(only one independent variable is considered). �j is the j-th element of the vector
�, which contains the model parameters. In this thesis the independent variables
will also be called predictor variables and the dependent variable will also be called
the response variable. In linear regression @f=@�j = Xi and thus is independentof �. For linear model equation (2.8) becomes
@S(�)
@�= �2XT (y�X�) = 0 (2.9)
and can easily be solved analytically. In the case of nonlinear regression an iter-
ative method is needed to solve the equations.
Chapter 2 22
One iterative method, the Steepest descent method is based on a gradient.
Writing equation (2.8) as a product of matrices gives:
@S(�)
@�= �2(y � f(X;�))[
@f(X;�)
@�] (2.10)
in which @S(�)=@� is a vector with size 1�p (p is the number of model parameters).
De�ne the following:
F: =@f(X;�)
@�(2.11)
and
e = (y � f(X;�)) (2.12)
F. is an n � p matrix (n is the number of design points) and is often called the
Jacobian matrix, e is a 1*n vector and �F:0e is the gradient along which S(�)
increases, thus � = F:0e is the direction of steepest descent. To calculate thisgradient initial estimates for the parameters � are required. The new estimate for� should be �1 = �0+ k�. This step is repeated several times untill the decreasein S(�) for successive iterations falls below a certain limit. For the mth iteration
k should be chosen so that S(�m + k�) < S(�m) (S=Sum of Squared Errors) isful�lled. The slow convergence of this method often leads to very long procedures.
The Gauss-Newton algorithm gives another iterative procedure. It uses a lin-earization based on a Taylor expansion in the parameter space to estimate para-meter values. For the Taylor expansion of f(Xi;�) around the point �0, if � is
close to �0, the following approximation holds:
f(Xi;�) � f(Xi;�0) +
pXj=1
[@f(Xi;�)
@�j]�=�
0 (�j � �0j ) (2.13)
if we set�0j = �j � �0j (2.14)
F 0ji = [
@f(Xi;�)
@�j]�=�
0 (2.15)
the following equation results:
yi � f(Xi;�0) =
pXj=1
�0jF0ji + "i (2.16)
with "i = yi � f(Xi;�) Equation (2.16) is linear in the parameters and can easilybe solved for �0j with normal linear least squares theory. The resulting vector
�0j gives the search direction. The size of this vector can be adjusted to avoiddivergence or too slow convergence. We can put this Gauss-Newton algorithm in
Statistical introduction 23
matrix form,while using the same notation, to compare it with the steepest descent
method:
e = F:� (2.17)
Solving this equation for � leads to:
� = (F:0F:)�1F:0e (2.18)
The Marquardt method gives a compromise between Gauss-Newton and steep-
est descent direction:
� = (F:0F:+ �diag(F:0F:))�1F:0e (2.19)
If �! 0 the Marquardt direction is the Gauss-Newton direction and if �! 1 theMarquardt direction approaches the direction of steepest descent. The parameter
� is normally adjusted on the outcome of the sum of squared errors. If S(�m+�) <S(�m) then � = �=� (� > 1) in the next iteration. In this case � decreases andthe search direction is more in the direction of steepest descent. If S(�m + �) >S(�m), � is increased � = � � for � > 1 and the search direction approaches morethe Gauss-Newton search direction. More detailed information on these and otheriterative techniques can be found in refs. [47] and [48].
All iterative procedures require initial estimates of parameter values. Theseinitial estimates may be based on certain experience or knowledge (e.g. fromprevious estimation of parameter values of equivalent models). Intelligent guessesby inspection of the shape of the response surface as suggested by Ratkowsky [49]or the use of Genetic Algorithms for a �rst rough search in the entire feasible
parameter space [37] are other options.
Bias-variance trade-o�
Measurements are subject to error and these random variations are the cause thata model can never exactly describe the uctuations in the response variable. The
prediction error may be divided into 3 di�erent terms:
E(y � y)2 = E(y � � + � � E(y) + E(y)� y)2
= E(y � �)2 + E(� � E(y))2 + E(E(y)� y)2 (2.20)
In this equation y is the predicted value of y, while � is the expected value of y.
The termE(y��)2 is the measurement error, while E(E(y)� y)2 is the variance iny. These two terms are usually taken together and called the variance componentof the prediction error. The second term E(� � E(y))2 is called the bias, this
Chapter 2 24
variance bias
Predictionerror
Model complexity
Figure 2.1: The bias-variance trade-o�
is the di�erence between the real value of y, �, and the expectation of y for the
studied model [50]. More complex models will reduce the bias, while at the sametime these models will increase the variance of the predictions (see �gure 2.1).The lower boundary of the variance is the measurement error. Model selection hasto do with the determination of the optimal bias-variance trade-o� and thus theestimation of model complexity. The performance of a model can be evaluated on
certain criteria, depending on the objective of the modeling procedure.
Model selection
In modeling a certain set of data more than one model may seem appropriate. In
that case these models have to be compared. In creating sets of possible models,
two situations may be distinguished. The �rst situation is formed when nested
models compose the total set of possible models. In this case all possible models
are part of a larger model and can be derived by setting certain parameters tozero. The second situation arises when non-nested models are compared. In this
thesis only the �rst situation is considered. For nested models many selectionstrategies are available. In reference [51, 52] a description is given on di�erent
methods for model selection. One example is the all subset regression strategy,
where all possible combinations of the individual parameters of the overall model
are evaluated, and the best subset-model is selected. A di�erent approach is the
Statistical introduction 25
forward selection procedure, where the starting point is a model with no terms
and subsequently terms are added. The backward elimination procedure follows
the reversed process, where the starting point is the most complex model, and
in each step one term is removed from this model. Combinations of the last
two procedures are also possible, for instance stepwise regression. Duineveld [53]
warns on a pitfall of model selection, data mining, which is more a problem with
all subset selection, than with forward selection or backward elimination.
Criteria for model selection
Model selection can be based on the selection of signi�cant parameters. The
signi�cance of parameters is determined by their con�dence interval and thus bytheir standard deviation. In this subsection two variance estimation methods for
the model parameters are described [54, 55, 56]. The �rst variance estimator is theasymptotic variance, which is based on linearization of the model. This linearizedmodel is then evaluated at the estimated parameter values, under the assumptionthat a large set of data is available (this assumption is why it is called asymptoticvariance). With the asymptotic variance con�dence intervals of the parameters �
can be estimated.
� � Np(�; �2C�1) C = F0F (2.21)
F is a matrix containing the �rst order derivatives of the model equation with
respect to the model parameters. C is evaluated at the estimated parameter
values. The uncertainty in the asymptotic variance can be very large when thereis much intrinsic curvature in the model [47]. Error heteroscedasticity may alsohave in uence on the resulting values of the asymptotic variances.
The second variance estimator is the jackknife variance estimator �2J , a varianceestimator which is based on resampling [55].
�J =
vuutn� 1
n
nXi=1
(�(i) � �(:))2 (2.22)
In this equation �(i) represents the parameter values, estimated without observa-tion i, and �(:) is the average of all �(i). Both the asymptotic and the jackknife
standard deviation follow a t-distribution. The jackknife variance is posed [57] tobe independent of the error distribution and intrinsic curvature.
Chapter 2 26
�0 �1 �2 �0 �1 �2 0 1 2
�truei 1.531 -4.805 2.113 19.336 -11.371 11.635 1.203 10�4 -7.523 3.168
�i 1.524 -4.830 2.207 19.264 -11.309 11.486 1.186 10�4 -7.352 2.695
�Ji 1.524 -4.831 2.207 19.268 -11.311 11.490 1.193 10�4 -7.384 2.763
Table 2.1: Mean parameter values �i estimated on total dataset and
mean parameter values �Ji estimated in jackknife procedure
compared to true parameter values �truei
To illustrate the performance of the two variance estimators a 9 parameter
model f(x;�) was used that will also be used in Chapter 4. The simulated datasetswere constructed according to the following model:
y = f(x;�) + "
where
f(x;�) =�0e
�1'+�2'2
0e 1'+ 2'
2
+ �0e�1'+�2'
2
[H+]
0e 1'+ 2'2 [H+]
(2.23)
and
" � N(0; �2f(x;�)2)
In this model ' and [H+] are the predictor variables. Parameter values used togenerate the simulated data are based on results of �tting the same model to real
data and can be found in the �rst row of Table 2.1 (�truei ). Errors were added tothe generated data according to the above mentioned model. A random number
generator was used to create the normally distributed ", while the scaling constant
� was given a value of 0.05. An estimate of the asymptotic variance s2a and thejackknife variance s2J were calculated on 100 simulated datasets each containing
30 observations. For each simulated dataset the pH (=-log[H+]) was varied over6 di�erent values (2-7), while the variable ' was varied over 5 di�erent values (0-
0.5). The average values (�sa and �sJ) and the standard deviations s(sa) and s(sJ)
of standard deviation of the variance estimators over the 100 di�erent simulateddatasets is calculated. The question is whether or not the use of the di�erent
variance estimators results in the same model; i.e. will both estimators cause thesame model parameters to be signi�cant?
In Table 2.2 the results of the simulation studies of the two model selection cri-
teria, the asymptotic standard deviation sa and the jackknife standard deviaiton
sJ , are collected. Especially the behavior of the asymptotic standard deviation
was of interest during the study, since it is easier to obtain than an estimate of the
Statistical introduction 27
�0 �1 �2 �0 �1 �2 0 1 2
�sa 0.127 1.866 6.275 0.174 0.302 1.091 6.77 10�6 2.315 9.296
�sJ 0.154 1.369 3.425 0.959 0.723 1.817 3.97 10�5 4.449 11.474
s(sa) 0.057 0.844 2.834 0.079 0.136 0.495 3.35 10�6 1.029 4.226
s(sJ) 0.067 0.495 1.115 0.722 0.377 0.853 3.26 10�5 1.996 4.458
Table 2.2: Mean values and standard deviation of the asymptotic stand-
ard deviation sa and the jackknife standard deviation sJ from
simulated data
jackknife standard deviation. The asymptotic standard deviation is standard out-put of the SAS nonlinear least squares procedure NLIN [58]. The table shows the
average value of the asymptotic variance and jackknife variance, together with thecorresponding variation over the 100 simulations. It can clearly be seen (Table 2.2)that the jackknife and asymptotic method act di�erently for di�erent parameters,but an interpretation of the results is di�cult. No real di�erence can be seenbetween the two variance estimators (compare �sa and �sJ), although for �0 and 0 the jackknife standard deviation is much larger than the asymptotic standard
deviation. For �1 and 1 a problem arises, because there is no agreement betweenthe two variance estimators, concerning the signi�cance of these parameters. Re-markable are the high values of the standard deviation for parameters �2 and 2.The variation caused by the curvature that results from the inclusion of a quad-ratic term with the given parameter values of the generating model seems to be
too small to be modeled. In Table 2.1 the means of estimated parameter valuesare given. Except for 2 these values are very similar to the true parameter values
�truei .
Another approach to model selection might be the application of an F-test.
The gain in the �t that is achieved by adding extra parameters when going froma partial model to the full model, when nested models are evaluated. In Table 2.3
the essential elements of the sum of squares analysis for model selection are sum-marized. In this table S is the residual sum of squares, � the degrees of freedom
and s2 the mean sum of squares, while N is the number of observation in the
dataset and P the number of parameters. The subscripts e, f and p refer to extra,full and partial. The partial model is accepted if the F-ratio is lower than thevalue of F (�e; �f : �). For nonlinear models the ratio of the sum of squares is only
approximately F-distributed as explained by Bates and Watts [48], but use of the
F-test is appropriate in general.
Chapter 2 28
Sum of Degrees of Mean Square F Ratio
Source Squares Freedom
Extra parameters Se = Sp � Sf �e = Pf � Pp s2e = Se=�e s2e=s2f
Full model Sf �f = N � Pf s2f = Sf=�f
Partial model Sp N � Pp
Table 2.3: Extra Sum of Squares analysis
Yet another approach to model selection can be the use of cross-validation
criteria [59], which will be described in the next section.
Criteria for model validation and method evaluation
For the selection of an appropriate model, model complexity, or the evaluation of amodeling method all kinds of criteria are available. An example is the correlationcoe�cient R2 = SSR
SST, in which SSR is the regression sum of squares and SST
is the total sum of squares, corrected for the mean. R2 is a measure for thepercentage of the variation in the data that is explained by the regression model.Other examples are the mean squared error, Mallows Cp [60] and the adjusted
correlation coe�cient R2adjusted = 1� SSE=(n�q�1)
SST=(n�1), which can only be used for linear
models and where SSE is the sum of squared errors. All these criteria give an
impression of how well a parametric model describes the data set.
For prediction purposes di�erent criteria must be used. If an independenttestset is available the mean squared error of prediction over the testset can beused:
mSEP =1
n
nXi=1
(yi � yi)2
(2.24)
in which n is the number of observations in the testset and yi and yi are, respect-ively, the experimental and predicted response values. When no testset is available
a cross-validation criterion can be used. The dataset is split into a trainingset, on
which a model is estimated, and a testset on which the model is evaluated. Thesplitting procedure is repeated until all observations have once and only once been
in the testset. The leave-one-out cross-validation criterion mPRESS (mean of thepredictive error sum of squares) is most often used [61, 62]:
mPRESS =1
n
nXi=1
(yi � y(i))2
(2.25)
Statistical introduction 29
In this case the predicted response value y(i) is predicted on a model which was
estimated on the dataset minus the i-th observation, while the testset contains
only one observation. These two criteria can be used for parametric as well as
for nonparametric exible modeling methods. A variant of the mPRESS is the
generalized cross-validation criterion [63], which contains a leverage correction
term:
GCV =nXi=1
(yi � y(i))2 1 � hii
1�Pnj=1 hjj=n
(2.26)
in which hii is the leverage of observation i. Not including a very in uential
observation (which has a high leverage) in the modeling will result in a bad leave-
one-out prediction of that observation and the cross-validation residual will also be
very large. Therefore, the cross-validated residual is weighted with a factor pro-
portional to its relative importance. The mPRESS and GCV criteria can also be
used for model selection (see for instance the paragraphs on Multivariate AdaptiveRegression Splines and arti�cial neural networks).
Measurement error
In analytical chemistry the measurement errors are sometimes assumed to varywith the measured analytical response. For capacity factor values this is alsotrue, as will be discussed in Chapter 5. For non-constant measurement errors,the coe�cient of variation may be a more useful estimate of the measurementerror than the standard deviation. This coe�cient of variation, also called relativestandard deviation (RSD) is the standard deviation s expressed as a percentage of
the mean:RSD =
s
�y� 100% (2.27)
Suppose that 2 sets of replicated experiments are collected for which the following
assumptions are true:
Y1 � N(�1; �21)
Y2 � N(�2; �22) (2.28)
and�1
�1=
�2
�2
Under these assumptions the RSD can be pooled over di�erent sets of replicated
experiments. The pooled RSD (pooled over n sets of replicated experiments) is
de�ned by:
RSD2pooled =
PdfiRSD
2iP
dfi(2.29)
where dfi is the number of degrees of freedom for the ith set of replicated measure-ments. For capacity factors the assumptions of equation (2.29) are approximately
Chapter 2 30
true for capacity factor values between 1 and 10. Small capacity factor values have
a measurement error that is approximately constant, which results in increasing
values for the relative standard deviation. In a recent paper by Rocke and Loren-
zato [64] a model for two types of measurement error (constant and multiplicative)
is proposed. It describes a method to estimate the measurement error due to these
two types of error over the entire range of possible response values.
Heteroscedasticity
Heteroscedasticity means nonconstant variance [65] and is an important phe-
nomenon that should be taken into account in modeling, because of its in uence
on model validation and the construction of (optimal) designs. In �gure 2.2 an
example of a response with a heteroscedastic variance structure is given. The un-derlying relation between de response y and the predictor variable for this exampleis:
y = exp (x=4) + " (2.30)
The standard deviation � of ", that is normally distributed, depends on the valueof the systematic part of y. For the example in �gure 2.2 the value of � was taken
to be proportional to the response: �y = �� exp (x=4), and the value of �� was setto 0.1. When replicated response values are generated for di�erent settings of thex-variable, a funnel shaped structure is the result. As will be discussed in the nextfew subsections, a logarithmic transformation of the response values, can trans-form this type of heterogeneous error (relative error) into a homogeneous error
(see �gure 2.3). Not all observations have the same in uence on the estimation ofmodel parameters. Observations with a large in uence on the model are expectedto give small residuals. Given a homogeneous error structure (var(")=�2) it canbe proven [47, 66] that the resulting residuals may be nonhomogeneous (var(e)=(I-
H)�2), in which I is the identity matrix and H is the Hat-matrix containing the
leverage values. To detect variance heterogeneity, Cook and Weisberg [67] sugges-
ted the use of leverage corrected residuals, or the so-called studentized residuals.
When leverage corrected residuals (residuals ei divided byp(1-hi)) are used
and they reveal a funnel-shaped structure, this can no longer be due to the ex-perimental design. This implicates that the funnel shape is caused by a variance
which is not constant over the entire response range.
If variance is non-constant, the quality of information about the response in
regions where the variance is large is inferior to the quality of information in re-gions where the variance is small [65, 68]. Thus, the heterogeneity of the varianceis an in uential factor in the modeling process.
There are two categories of methods for dealing with variance heterogeneity. The
�rst category contains methods based on weighted regression and weighted least
Statistical introduction 31
Figure 2.2: Heteroscedastic variance
Figure 2.3: Illustration of the e�ect of a logarithmic transformation on
error distribution
squares, while the second category contains methods based on data transforma-tions.
Weighted least squares
The most obvious method that can be used to correct for a heterogeneous variance
structure is Weighted Least Squares (WLS) [68]. One way to perform WLS is
Chapter 2 32
using replicate experiments to estimate the variance structure. The inverse of the,
possibly smoothed, error variances are the weights to be used in Weighted Least
Squares. To get a reasonably accurate estimation of these variances the number of
replicates should be large. In some cases this would cost too much experimental
e�ort. Fortunately there are ways to overcome this problem. For instance, a good
alternative to WLS is a generalized version of WLS, Generalized Least Squares.
Another alternative to WLS is the application of transformations on the response
variables.
Generalized least squares
For responses with a heterogeneous error structure a general model can be con-
structed [65]:
y = f(x;�) + �g(�; z;�)" (2.31)
in which " is random, with a normal distribution with mean 0 and a standarddeviation 1, and � is a scaling constant. f(x;�) describes the functional relationbetween the regressors x and the dependent variable y; � being the set of para-meters de�ning this relation. In the case of modeling capacity factors, y is equal tothe capacity factor k, and x consists of the variables pH and the modi�er content
of the mobile phase. g2(�; z;�) is the variance function, where z is a set of in-dependent variables, which may contain (part of) x, � the mean of the functionalresponse value f(x;�) for certain values of x, and � is a set of parameters for thevariance function. The variance function may be known completely, when para-meter values are known, or partially unknown, but at least the overall shape of the
function needs to be de�ned. When the variance function is partially unknown,the values of � have to be estimated. In the case where y is the capacity factor,it is reasonable to assume that the variance function looks like
g2(�; z;�) = f(x;�)2� (2.32)
This function is now de�ned as related to the theoretical model for the capacityfactor, represented by f(x;�). The corresponding errors are �f(x;�)�". If �=0,
the variance of y is homogeneous, while if �=1 the standard deviation is propor-
tional to the values of y. In the latter case, a logarithmic transformation of theresponse y, before modeling, would be appropriate.
To model this variance function both � and � need to be estimated, while formodeling the functional relation also � needs to be estimated. The estimation
of the parameter values of the functional relationship and the estimation of the
variance function can be implemented as an iterative process using the following
algorithm [65, 68]:
- First � of f(x;�) is estimated using unweighted nonlinear regression.
Statistical introduction 33
- Subsequently � of g(�; z;�) is estimated using nonlinear regression on the
residuals that result from the �rst step. In the case of modeling capacity
factors, � is estimated.
- Then again � needs to be estimated using weighted nonlinear regression, for
which the weights are given by w = 1=g2(�; z;�) = 1=f2�(x;�) . The �
used to calculate these weights result from the previous iteration step.
These three steps are usually repeated until convergence, or for a prede�ned
number of times assuming that after this number of times convergence has been
reached. A discussion on the choice of the number of cycles in the iteration can
be found in Carroll and Ruppert [68].
Transformations
To obtain a homogeneous variance a transformation can be applied to theresponse variable as well as to the predictor variables. Another reason for ap-
plying a transformation can be to obtain a less complex model (for instance lin-earization). When the structure of the variance heterogeneity is exactly known atransformation on response and predictor variables can be chosen to make the vari-ance constant. Examples of transformation functions are quadratic, logarithmicor the inverse transform. Models can be constructed on evidence or knowledge of
the functional form of an underlying chemical or physical relation. To maintainthese models, together with a transformation of the response the model describingthe relation is also transformed by the same function. This procedure is calledTransform Both Sides.
When no information is available on the structure of the variance, the appro-priate transformation has to be estimated. The Box-Cox power transformation
family covers the whole scale of TBS procedures, as long as the transformation is
not complicated by the need for transformations on predictor variables:
h(v; �) =
8<: (v� � 1)=� if � 6= 0
ln(v) if � = 0(2.33)
In �gure 2.3 the results of the logarithmic transformation of the response variabley of equation (2.30) are presented. The variation in the response values compared
to their mean now is the same for all settings of the predictor variables, and thevariance is called homoscedastic. As a consequence of the logarithmic transform-
ation the relation between the response and predictor variable has become linear,
which is an extra advantage.
Chapter 2 34
y
x
xxxxxx xx
x
t
xx
xx
x
x
xx
xxx
xx
x
Figure 2.4: Estimation of a knot position by MARS
Nonparametric methods
As was stated in the Introduction section of this chapter the modeling problem
can be de�ned as estimating the relation between a set of predictor and responsevariables
f(X) = E(Y jX) (2.34)
When exible models are used to estimate this relation, most often one of twotypes of techniques is used: smoothers or regression splines [69]. These methodsuse local models after splitting the design space. Regression splines use piecewisepolynomials on �xed intervals of the design regions. The splitting points are calledknot locations. Smoothers, on the other hand, use di�erent intervals for every
datapoint. This interval is oriented as a window around the speci�c datapoint,
and a smoothed estimate of the datapoint is obtained. Examples of smoothersare the local averaging method, spline-smoothing and kernel smoothing. Thesemethods require a relatively dense distribution of data in the predictor variable
space. In the next few subsections three di�erent exible modeling methods are
described.
Multivariate Adaptive Regression Splines
The main philosophy behind Multivariate Adaptive Regression Splines (MARS)
is that predictor variables may contribute more to the variation in response in oneregion than in others. MARS splits the design space and in each subregion a linearor cubic spline is �tted to the data. The subregions are de�ned by splitting points,
also called knot locations.
Each step in the MARS algorithm is made up of three parts. First the variable
has to be selected on which the split is made by placing a knot. Then the exact
Statistical introduction 35
y
x
xxxxxx xx
t2
x
xx
xx
xxx
x
xx
x xx
t1
Figure 2.5: Illustration of the backward elimination step in MARS: two
basisfunctions (dashed lines) are replaced by a single basis-
function (solid line).
place of the knot somewhere in the variable region has to be chosen and �nallya spline basisfunction has to be �t in the regions on both sides of the knot (see�gure 2.4). Each region may be split again, while the parent regions remainincorporated in the overall model and available for composing new split regions.
Successive splits can be performed on all earlier estimated basisfunctions, whileold basisfunctions are retained after splitting. The basisfunctions Bm are tensorproducts of univariate spline functions (each univariate spline functions involvesonly one predictor and one knot location):
Bqm(x) =
KmYk=1
[skm(xv(k;m) � tkm)]q+ (2.35)
in which q is the order of the spline (q=3 gives a cubic spline), t the knot location,and the subscript + indicates that only the part of the function that gives positive
values is evaluated. The parameter s can take the values + or -, and thus de�nes
which part of the design region becomes positive. Together with the + subscriptof the function this parameter s de�nes which regions of the splitting procedure is
evaluated. Km denotes the splitting degree, i.e. the number of splits that composethe basisfunction, and xv is the predictor variable to which the split is applied.
For the example in �gure 2.4 if s=-1 the part of the x-region that is smaller thant (the knot position) is evaluated and a linear basisfunction (q=1) is �tted to the
data. For s=+ the part of the x-region larger than t is evaluated and another
basisfunction is created. The splitting degree Km is 1.
Chapter 2 36
The overall model function is a weighted summation of all basisfunctions:
f(x) = a0 +MX
m=1
am
KmYk=1
[skm(xv(k;m) � tkm)]q+ (2.36)
The regression coe�cients am are estimated once the basisfunctions are calculated.
Decisions about model complexity are made on a lack-of-�t criterion (which is
a type of Generalized Cross-Validation criterion):
GCV (M) =1=N
PNi=1[yi � fM(xi)]
2
[1�C(M)=N ]2(2.37)
which contains a penalty for the number of model termsM (basisfunctions). After
the generation of a large amount of basisfunctions super uous model terms are
eliminated by a one-at-a-time backward stepwise procedure (see �gure 2.5). Seefor more details the original research paper by Friedman [70] and for some short
introductions [69] and [71].
Projection Pursuit Regression
Projection Pursuit in general is based on the projection of high dimensional dataon a lower dimensional space. This projection technique can be integrated as part
of all kinds of data-analytical methods [72].
Projection Pursuit Regression (PPR) was developed by Friedman and Stuet-zle [73]. The method is based on the iterative estimation of linear combinations ofthe original predictor variables (the lower dimensional projections) and the corres-ponding smooth functions that describe the relation between the projection andthe response. The model can be written as:
�(X) =MXm=1
S�m(�m �X) (2.38)
�m is a 1�n vector and �m �X gives a linear combination of the predictor variables.
To this projection a smooth S�mis applied. M is the total number of projections
and smooths needed to be able to describe variation in a response variable y.
An iterative algorithm de�nes the exact estimation of model terms. For a givenlinear combination Z = �m � X of the predictor variables X a smooth S�m
(Z)has to be constructed. The fraction of unexplained variance that is explained by
adding the term S�m(Z) can be used as a �t criterion:
I(�m) = 1�nXi=1
(yi � S�m(�m �X))2=
nXi=1
yi2 (2.39)
Statistical introduction 37
input layer
output layer
hidden layer
Figure 2.6: Topology of a feedforward neural network
The vector �m and the corresponding smooth S�mthat maximize I(�m) are
selected and the process is terminated when I(�m) is smaller than a user-speci�edtreshold.
TheM projections and smooths are produced in a stepwise procedure. As longas I(�m) is not smaller than a treshold value residuals are calculated:
ri = yi � S�m(�m �X) (2.40)
These residuals are then used as the response values in the next step of the stepwiseprocedure. The smooth can be estimated by any kind of smoothing procedure.
Neural Networks
Arti�cial neural networks are said to mimic the signal processing in the brain.Modeling is one of the application areas of neural networks. Neural networks arethe most extensively used exible modeling methods in chemometrics at this mo-
ment. All kinds of neural networks are available, each suited for speci�c kinds
of problems (see for instance references [74, 75] and [76]). A feedforward back-propagation network was used in this thesis and will be described here. A feed-
forward network passes the signals only in one direction (forward). The buildingblocks of a feedforward backpropagation network are units (also called neurons or
nodes), usually organized in three layers: an input layer, a hidden layer and an
output layer. The units of the input layer contain the input signals, the units of theoutput layer contain the response values. The hidden layer provides the possibility
Chapter 2 38
to model nonlinear relationships. The pattern of network connections is called the
topology of the network and de�nes the actual signal processing. One unit has
multiple inputs, but only one output. A transformation function of the weighted
sum of all the input signals produces the output of a unit. This transformation
function can be a linear function, a threshold function or a nonlinear function, e.g.
a sigmoid function [77]. The output of a unit is given by:
Zj =Xi
wjiXi (2.41)
A sigmoidal transfer function is applied to this output:
f(Zj) =1
1 + e�(Zj+�j)(2.42)
in which Zj is the weighted sum of the input signals Xi and f(Zj) is the transferfunction for unit j. �j is a bias parameter which de�nes the in ection point ofthe sigmoidal curve as described by the transfer function and is in practice oftenestimated as an extra weight to an extra input with a �xed value of 1.
At the start the values of the weights contain random numbers and during the
modeling procedure (training) the weights are adjusted. The weights can be seenas the modeling parameters. The estimation of the weights, when estimating arelation between input and output is called supervised learning. In an iterativeprocedure the weights are adjusted by comparing the produced output with thetarget output. To adjust the weights a so-called learning rule is applied. In this
study the delta rule was used by which the errors are used to adjust the weights(backpropagation):
�wij = ��j yi (2.43)
�wij is the change in the weight between unit i of one layer and unit j of the nextlayer, � is the learning rate and yi is the output of unit i. This delta rule is based
on a minimization of the sum of squared errors:
S = 0:5Xj
(yj � yj)2 (2.44)
where yj is the predicted and yj is the target output of the jth layer. The error
correction term �j for the output layer j that results for this minimization is:
�j = (yj � yj)yj(1� yj) (2.45)
For a unit of the hidden layer j the error correction term is calculated di�erentlybecause no estimate of the target output is available:
�j = yj(1 � yj)Xk
�kwjk (2.46)
Statistical introduction 39
in which k represents the output layer.
Yasui [78] developed an algorithm, called parametric lateral inhibition to es-
timate the optimal number of hidden neurons. In fact this algorithm eliminates
redundant network connections. Weights are adjusted by an extra correction term:
�w0ij = �wij +�Vij (2.47)
where �Vij is the extra term that provides for the elimination of redundant con-
nections:
�Vij = �"sgn(wij)
0@ nXi=1
nXj=1
jwijj1A� jwijj (2.48)
where " is a constant. Equation (2.48) shows that connections to a unit decrease
the weights of another connection. The growth of (weights of) certain connections
thus eliminates other connections.For applications of feedforward back-propagation neural networks in analytical
chemistry references [79, 80] are good examples.
Chapter 2 40
Optimum experimental design
In the �eld of optimum experimental design one is engaged in the constructions
of designs that give the best performance on a certain statistical criterion. One
such criterion can be that the expected maximum variance of a prediction should
be minimized. The design choice has in uence on this variance, which can be
explained by the following. For linear models the variance of a prediction for a
certain vector of predictor variable settings xi is:
�2x0i(X0X)�1xi (2.49)
where X is the augmented matrix of predictor variables (with quadratic, cross
terms etc.), and is also called the design matrix. It will be obvious that the choice
of the design will in uence the variance of predicted response variables. Another
statistical criterion is the minimal variance-covariance matrix of the estimated
parametersvar(�) = �2(X0X)�1 (2.50)
The type of design which minimizes the maximum (for all possible settings of
the predictor variables) predictor variance is called a G-optimum design. Manymore criteria exist, each optimizing a criterion that is dependent on the designchoice. The D-criterion minimizes the volume of the sphere that gives a combinedcon�dence interval of all parameters. This D-criterion is used most often. AD-optimal design is achieved when j(X0X)�1j is minimized. A G-optimal design
is achieved when the maximum value of x0i(X0X)�1xi over the design space is
minimized.It will be clear that to obtain an optimal design the model must be known. For
introductions on experimental design see references [81, 82, 83].
The General Equivalence Theorem states that the following three conditions of
an optimal design �� are equivalent:
1. �� minimizes j(X0X)�1j;2. �� minimizes the maximum of xi(X
0X)�1xi for all xi over the design spaceX;
3. the maximum of xi(X0X)�1xi over the design space is equal to p (the
number of model parameters).
is the diagonal matrix containing the weights associated to each of the design
points. This General Equivalence Theorem states that a D-optimal design is equi-valent to a G-optimal design and that the maximum of the G-criterion is equal to
p. For a G-optimal design the maximum values are attained at the design support
points.
Statistical introduction 41
Optimum experimental designs for nonlinear models
The main di�erence between optimum experimental designs for linear and nonlin-
ear models is their dependence on model parameter values. For linear models the
estimation of optimum experimental design only depends on the type of model.
For nonlinear models the optimum design depends on the true parameter value.
The contradiction lies in the fact that the design usually is constructed to be able
to estimate these parameters. Therefore, to obtain optimum designs for nonlinear
models a di�erent strategy has to be followed.
The variance of a prediction for nonlinear models for a certain setting of pre-
dictor variables becomes:
F0i(F
0F)�1Fi (2.51)
where F is the Jacobian matrix, which must be evaluated for the estimated para-
meters �. The variance-covariance matrix for estimated nonlinear model para-meters is:
var(�) = �2(F0F)�1 (2.52)
As was described in a previous section of this chapter (see equation (2.13) and fur-ther) the linearization of a nonlinear model by a Taylor expansion in the parameterspace, results in the following equation:
Y � f(Xi;�0) = F0�0 + " (2.53)
This linearization gives a linear regression model. A design that gives a minimumvariance-covariance of �0 will also give a minimum variance-covariance of �0.
Thus, the determinant of the matrix F0F has to be maximized. Atkinson [83]proves that for a minimal D-optimum design (minimal indicates that the numberof design points equals the number of model parameters and F is a square matrix)maximizing jF0Fj equals maximizing jFj.
To obtain a D-optimal design, initial values of the parameters have to be avail-
able. The D-optimal design is used to perform experiments and on these exper-
iments parameter values are estimated. If these values are signi�cantly di�erentfrom the initial parameter values or their accuracy is too small the new values can
be used in the search for a new optimal design.
Recommended