Ch apt er 2 · Ch apt er 2 ST A TISTICAL INTR ODUCTION In tro d u ct ion In gen eral t h emod elin g prob lem can b e d e n ed as est im a t in gt e rela ion b et w een a s et of

Chapter 2

STATISTICAL

INTRODUCTION

Introduction

In general the modeling problem can be de�ned as estimating the relation betweena set of predictor variables X and one or more response variables Y:

Y = f(X) + " (2.1)

in which " contains all kinds of errors like sampling and measurement errors. Thefunction f(X) is the conditional expectation

f(X) = E(Y jX) (2.2)

and can be estimated by an approximation on a certain dataset (xi;yi),i = 1; : : : ; n. The estimation of the above mentioned relation can be achieved

by deriving an equation that describes the physical or chemical mechanisms de-

�ning the process under study. In this case the overall form of the relationship isknown and only values of some parameters have to be estimated. In the case thatfunctional relations are not (yet) known these functions have to be approximated

by an empirical model. The �rst objective in the modeling procedure then is to

search for an appropriate class of functions of which the true function is expectedto be a member. Next a model is selected from this class of functions. The suc-

cess of this procedure depends on the existence of a convenient class and a correctchoice of the class of functions. When a parametric regression model is used one

assumes that the true underlying function is a member of a very strict family of(parametric) models, whether these are linear or nonlinear.

Classes of models that are relatively more exible are used in the so-called

nonparametric regression methods. These exible models adapt themselves to the

Chapter 2 20

data, which means that these methods determine an appropriate model solely on

the data. This type of methods can generate very complex models, so the risk of

over�tting is considerably high.

This chapter will focus on di�erent theoretical aspects of modeling. The order

in which subjects will be covered will follow the application of these tools in the

subsequent chapters of this thesis. Generally in response surface modeling the �rst

objective is the choice of an appropriate design, because this de�nes the limits on

the information that can be obtained and therefore has a large in uence on the

modeling. Second part of the response surface modeling process is the choice of a

certain class of models, from which the exact model is estimated. These two steps

cannot be considered separately from each other. The type of model de�nes the

choice of an optimal design, while the design choice puts restrictions on the type

of model that can be used. Criteria will be described that contribute in the search

for an appropriate model and in the evaluation of the predictive performance ofthe estimated model.

Parametric modeling

In regression analysis most often a linear model is used to describe the relationbetween variables. A linear model is de�ned by a linear summation of the predictor

variables:y = �0 +

Xi

�izi + " (2.3)

in which zi may be all kinds of transformations and cross-products of the originalpredictor variables xi, and �0 and �i are model parameters. Such a model is

based on Taylor-series expansions. The set of linear parametric functions is avery exible family and often used when no theoretical model is available.

An example of a nonlinear parametric model is the Michaelis-Menten model

for enzyme kinetics:

y =�1x

�2 + x+ " (2.4)

where �1 and �2 are model parameters. With the appropriate transformation this

model becomes linear:

1

y=

1

�1+�2

�1

1

x

(2.5)

y0 = �01 + �02x0

Amodel like the Michaelis-Mentenmodel is called intrinsically linear or transform-

ably linear, because after the appropriate transformation of variables or parameters

a linear model will result.

Statistical introduction 21

An example of an intrinsically nonlinear model is

y =�0

(1 + exp[�(�1 + �2x)=�3])�3+ " (2.6)

The parameters are nonlinearly incorporated in the model, and no transformation

of parameters or variables will convert the model into a linear model.

A nonlinear model can be used when theory suggests such a model, when

linear models give considerable lack-of-�t or when after visual inspection nonlinear

behavior can be detected in the data [47].

Nonlinear parametric regression

The estimation of parameter values in nonlinear regression is much more complexthan in linear regression. In this section will be explained why this estimation ismuch more complex.

In regression the search is for the least squares estimator of the parameters,which minimizes

S(�) =nXi=1

(yi � f(Xi;�))2 (2.7)

This minimization is equal to solving the derivative S for all �j:

@S(�)

@�j= �2

nXi=1

(yi � f(Xi;�))[@f(Xi;�)

@�j]�= ^�

= 0 (2.8)

In these equations Xi contain the values of the independent variables at designpoint i, while yi contains the value of the dependent variable at design point i

(only one independent variable is considered). �j is the j-th element of the vector

�, which contains the model parameters. In this thesis the independent variables

will also be called predictor variables and the dependent variable will also be called

the response variable. In linear regression @f=@�j = Xi and thus is independentof �. For linear model equation (2.8) becomes

@S(�)

@�= �2XT (y�X�) = 0 (2.9)

and can easily be solved analytically. In the case of nonlinear regression an iter-

ative method is needed to solve the equations.

Chapter 2 22

One iterative method, the Steepest descent method is based on a gradient.

Writing equation (2.8) as a product of matrices gives:

@S(�)

@�= �2(y � f(X;�))[

@f(X;�)

@�] (2.10)

in which @S(�)=@� is a vector with size 1�p (p is the number of model parameters).

De�ne the following:

F: =@f(X;�)

@�(2.11)

and

e = (y � f(X;�)) (2.12)

F. is an n � p matrix (n is the number of design points) and is often called the

Jacobian matrix, e is a 1*n vector and �F:0e is the gradient along which S(�)

increases, thus � = F:0e is the direction of steepest descent. To calculate thisgradient initial estimates for the parameters � are required. The new estimate for� should be �1 = �0+ k�. This step is repeated several times untill the decreasein S(�) for successive iterations falls below a certain limit. For the mth iteration

k should be chosen so that S(�m + k�) < S(�m) (S=Sum of Squared Errors) isful�lled. The slow convergence of this method often leads to very long procedures.

The Gauss-Newton algorithm gives another iterative procedure. It uses a lin-earization based on a Taylor expansion in the parameter space to estimate para-meter values. For the Taylor expansion of f(Xi;�) around the point �0, if � is

close to �0, the following approximation holds:

f(Xi;�) � f(Xi;�0) +

pXj=1

[@f(Xi;�)

@�j]�=�

0 (�j � �0j ) (2.13)

if we set�0j = �j � �0j (2.14)

F 0ji = [

@f(Xi;�)

@�j]�=�

0 (2.15)

the following equation results:

yi � f(Xi;�0) =

pXj=1

�0jF0ji + "i (2.16)

with "i = yi � f(Xi;�) Equation (2.16) is linear in the parameters and can easilybe solved for �0j with normal linear least squares theory. The resulting vector

�0j gives the search direction. The size of this vector can be adjusted to avoiddivergence or too slow convergence. We can put this Gauss-Newton algorithm in


matrix form,while using the same notation, to compare it with the steepest descent

method:

e = F:� (2.17)

Solving this equation for � leads to:

� = (F:0F:)�1F:0e (2.18)

The Marquardt method gives a compromise between Gauss-Newton and steep-

est descent direction:

� = (F:0F:+ �diag(F:0F:))�1F:0e (2.19)

If �! 0 the Marquardt direction is the Gauss-Newton direction and if �! 1 theMarquardt direction approaches the direction of steepest descent. The parameter

� is normally adjusted on the outcome of the sum of squared errors. If S(�m+�) <S(�m) then � = �=� (� > 1) in the next iteration. In this case � decreases andthe search direction is more in the direction of steepest descent. If S(�m + �) >S(�m), � is increased � = � � for � > 1 and the search direction approaches morethe Gauss-Newton search direction. More detailed information on these and otheriterative techniques can be found in refs. [47] and [48].

All iterative procedures require initial estimates of parameter values. Theseinitial estimates may be based on certain experience or knowledge (e.g. fromprevious estimation of parameter values of equivalent models). Intelligent guessesby inspection of the shape of the response surface as suggested by Ratkowsky [49]or the use of Genetic Algorithms for a �rst rough search in the entire feasible

parameter space [37] are other options.

Bias-variance trade-o�

Measurements are subject to error and these random variations are the cause thata model can never exactly describe the uctuations in the response variable. The

prediction error may be divided into 3 di�erent terms:

E(y � y)2 = E(y � � + � � E(y) + E(y)� y)2

= E(y � �)2 + E(� � E(y))2 + E(E(y)� y)2 (2.20)

In this equation y is the predicted value of y, while � is the expected value of y.

The termE(y��)2 is the measurement error, while E(E(y)� y)2 is the variance iny. These two terms are usually taken together and called the variance componentof the prediction error. The second term E(� � E(y))2 is called the bias, this

Chapter 2 24

variance bias

Predictionerror

Model complexity

Figure 2.1: The bias-variance trade-o�

is the di�erence between the real value of y, �, and the expectation of y for the

studied model [50]. More complex models will reduce the bias, while at the sametime these models will increase the variance of the predictions (see �gure 2.1).The lower boundary of the variance is the measurement error. Model selection hasto do with the determination of the optimal bias-variance trade-o� and thus theestimation of model complexity. The performance of a model can be evaluated on

certain criteria, depending on the objective of the modeling procedure.

Model selection

In modeling a certain set of data more than one model may seem appropriate. In

that case these models have to be compared. In creating sets of possible models,

two situations may be distinguished. The �rst situation is formed when nested

models compose the total set of possible models. In this case all possible models

are part of a larger model and can be derived by setting certain parameters tozero. The second situation arises when non-nested models are compared. In this

thesis only the �rst situation is considered. For nested models many selectionstrategies are available. In reference [51, 52] a description is given on di�erent

methods for model selection. One example is the all subset regression strategy,

where all possible combinations of the individual parameters of the overall model

are evaluated, and the best subset-model is selected. A di�erent approach is the


forward selection procedure, where the starting point is a model with no terms

and subsequently terms are added. The backward elimination procedure follows

the reversed process, where the starting point is the most complex model, and

in each step one term is removed from this model. Combinations of the last

two procedures are also possible, for instance stepwise regression. Duineveld [53]

warns on a pitfall of model selection, data mining, which is more a problem with

all subset selection, than with forward selection or backward elimination.

Criteria for model selection

Model selection can be based on the selection of signi�cant parameters. The

signi�cance of parameters is determined by their con�dence interval and thus bytheir standard deviation. In this subsection two variance estimation methods for

the model parameters are described [54, 55, 56]. The �rst variance estimator is theasymptotic variance, which is based on linearization of the model. This linearizedmodel is then evaluated at the estimated parameter values, under the assumptionthat a large set of data is available (this assumption is why it is called asymptoticvariance). With the asymptotic variance con�dence intervals of the parameters �

can be estimated.

� � Np(�; �2C�1) C = F0F (2.21)

F is a matrix containing the �rst order derivatives of the model equation with

respect to the model parameters. C is evaluated at the estimated parameter

values. The uncertainty in the asymptotic variance can be very large when thereis much intrinsic curvature in the model [47]. Error heteroscedasticity may alsohave in uence on the resulting values of the asymptotic variances.

The second variance estimator is the jackknife variance estimator �2J , a varianceestimator which is based on resampling [55].

�J =

vuutn� 1

n

nXi=1

(�(i) � �(:))2 (2.22)

In this equation �(i) represents the parameter values, estimated without observa-tion i, and �(:) is the average of all �(i). Both the asymptotic and the jackknife

standard deviation follow a t-distribution. The jackknife variance is posed [57] tobe independent of the error distribution and intrinsic curvature.

Chapter 2 26

�0 �1 �2 �0 �1 �2 0 1 2

�truei 1.531 -4.805 2.113 19.336 -11.371 11.635 1.203 10�4 -7.523 3.168

�i 1.524 -4.830 2.207 19.264 -11.309 11.486 1.186 10�4 -7.352 2.695

�Ji 1.524 -4.831 2.207 19.268 -11.311 11.490 1.193 10�4 -7.384 2.763

Table 2.1: Mean parameter values �i estimated on total dataset and

mean parameter values �Ji estimated in jackknife procedure

compared to true parameter values �truei

To illustrate the performance of the two variance estimators a 9 parameter

model f(x;�) was used that will also be used in Chapter 4. The simulated datasetswere constructed according to the following model:

y = f(x;�) + "

where

f(x;�) =�0e

�1'+�2'2

0e 1'+ 2'

2

+ �0e�1'+�2'

2

[H+]

0e 1'+ 2'2 [H+]

(2.23)

and

" � N(0; �2f(x;�)2)

In this model ' and [H+] are the predictor variables. Parameter values used togenerate the simulated data are based on results of �tting the same model to real

data and can be found in the �rst row of Table 2.1 (�truei ). Errors were added tothe generated data according to the above mentioned model. A random number

generator was used to create the normally distributed ", while the scaling constant

� was given a value of 0.05. An estimate of the asymptotic variance s2a and thejackknife variance s2J were calculated on 100 simulated datasets each containing

30 observations. For each simulated dataset the pH (=-log[H+]) was varied over6 di�erent values (2-7), while the variable ' was varied over 5 di�erent values (0-

0.5). The average values (�sa and �sJ) and the standard deviations s(sa) and s(sJ)

of standard deviation of the variance estimators over the 100 di�erent simulateddatasets is calculated. The question is whether or not the use of the di�erent

variance estimators results in the same model; i.e. will both estimators cause thesame model parameters to be signi�cant?

In Table 2.2 the results of the simulation studies of the two model selection cri-

teria, the asymptotic standard deviation sa and the jackknife standard deviaiton

sJ , are collected. Especially the behavior of the asymptotic standard deviation

was of interest during the study, since it is easier to obtain than an estimate of the


�0 �1 �2 �0 �1 �2 0 1 2

�sa 0.127 1.866 6.275 0.174 0.302 1.091 6.77 10�6 2.315 9.296

�sJ 0.154 1.369 3.425 0.959 0.723 1.817 3.97 10�5 4.449 11.474

s(sa) 0.057 0.844 2.834 0.079 0.136 0.495 3.35 10�6 1.029 4.226

s(sJ) 0.067 0.495 1.115 0.722 0.377 0.853 3.26 10�5 1.996 4.458

Table 2.2: Mean values and standard deviation of the asymptotic stand-

ard deviation sa and the jackknife standard deviation sJ from

simulated data

jackknife standard deviation. The asymptotic standard deviation is standard out-put of the SAS nonlinear least squares procedure NLIN [58]. The table shows the

average value of the asymptotic variance and jackknife variance, together with thecorresponding variation over the 100 simulations. It can clearly be seen (Table 2.2)that the jackknife and asymptotic method act di�erently for di�erent parameters,but an interpretation of the results is di�cult. No real di�erence can be seenbetween the two variance estimators (compare �sa and �sJ), although for �0 and 0 the jackknife standard deviation is much larger than the asymptotic standard

deviation. For �1 and 1 a problem arises, because there is no agreement betweenthe two variance estimators, concerning the signi�cance of these parameters. Re-markable are the high values of the standard deviation for parameters �2 and 2.The variation caused by the curvature that results from the inclusion of a quad-ratic term with the given parameter values of the generating model seems to be

too small to be modeled. In Table 2.1 the means of estimated parameter valuesare given. Except for 2 these values are very similar to the true parameter values

�truei .

Another approach to model selection might be the application of an F-test.

The gain in the �t that is achieved by adding extra parameters when going froma partial model to the full model, when nested models are evaluated. In Table 2.3

the essential elements of the sum of squares analysis for model selection are sum-marized. In this table S is the residual sum of squares, � the degrees of freedom

and s2 the mean sum of squares, while N is the number of observation in the

dataset and P the number of parameters. The subscripts e, f and p refer to extra,full and partial. The partial model is accepted if the F-ratio is lower than thevalue of F (�e; �f : �). For nonlinear models the ratio of the sum of squares is only

approximately F-distributed as explained by Bates and Watts [48], but use of the

F-test is appropriate in general.

Chapter 2 28

Sum of Degrees of Mean Square F Ratio

Source Squares Freedom

Extra parameters Se = Sp � Sf �e = Pf � Pp s2e = Se=�e s2e=s2f

Full model Sf �f = N � Pf s2f = Sf=�f

Partial model Sp N � Pp

Table 2.3: Extra Sum of Squares analysis

Yet another approach to model selection can be the use of cross-validation

criteria [59], which will be described in the next section.

Criteria for model validation and method evaluation

For the selection of an appropriate model, model complexity, or the evaluation of amodeling method all kinds of criteria are available. An example is the correlationcoe�cient R2 = SSR

SST, in which SSR is the regression sum of squares and SST

is the total sum of squares, corrected for the mean. R2 is a measure for thepercentage of the variation in the data that is explained by the regression model.Other examples are the mean squared error, Mallows Cp [60] and the adjusted

correlation coe�cient R2adjusted = 1� SSE=(n�q�1)

SST=(n�1), which can only be used for linear

models and where SSE is the sum of squared errors. All these criteria give an

impression of how well a parametric model describes the data set.

For prediction purposes di�erent criteria must be used. If an independenttestset is available the mean squared error of prediction over the testset can beused:

mSEP =1

n

nXi=1

(yi � yi)2

(2.24)

in which n is the number of observations in the testset and yi and yi are, respect-ively, the experimental and predicted response values. When no testset is available

a cross-validation criterion can be used. The dataset is split into a trainingset, on

which a model is estimated, and a testset on which the model is evaluated. Thesplitting procedure is repeated until all observations have once and only once been

in the testset. The leave-one-out cross-validation criterion mPRESS (mean of thepredictive error sum of squares) is most often used [61, 62]:

mPRESS =1

n

nXi=1

(yi � y(i))2

(2.25)


In this case the predicted response value y(i) is predicted on a model which was

estimated on the dataset minus the i-th observation, while the testset contains

only one observation. These two criteria can be used for parametric as well as

for nonparametric exible modeling methods. A variant of the mPRESS is the

generalized cross-validation criterion [63], which contains a leverage correction

term:

GCV =nXi=1

(yi � y(i))2 1 � hii

1�Pnj=1 hjj=n

(2.26)

in which hii is the leverage of observation i. Not including a very in uential

observation (which has a high leverage) in the modeling will result in a bad leave-

one-out prediction of that observation and the cross-validation residual will also be

very large. Therefore, the cross-validated residual is weighted with a factor pro-

portional to its relative importance. The mPRESS and GCV criteria can also be

used for model selection (see for instance the paragraphs on Multivariate AdaptiveRegression Splines and arti�cial neural networks).

Measurement error

In analytical chemistry the measurement errors are sometimes assumed to varywith the measured analytical response. For capacity factor values this is alsotrue, as will be discussed in Chapter 5. For non-constant measurement errors,the coe�cient of variation may be a more useful estimate of the measurementerror than the standard deviation. This coe�cient of variation, also called relativestandard deviation (RSD) is the standard deviation s expressed as a percentage of

the mean:RSD =

s

�y� 100% (2.27)

Suppose that 2 sets of replicated experiments are collected for which the following

assumptions are true:

Y1 � N(�1; �21)

Y2 � N(�2; �22) (2.28)

and�1

�1=

�2

�2

Under these assumptions the RSD can be pooled over di�erent sets of replicated

experiments. The pooled RSD (pooled over n sets of replicated experiments) is

de�ned by:

RSD2pooled =

PdfiRSD

2iP

dfi(2.29)

where dfi is the number of degrees of freedom for the ith set of replicated measure-ments. For capacity factors the assumptions of equation (2.29) are approximately

Chapter 2 30

true for capacity factor values between 1 and 10. Small capacity factor values have

a measurement error that is approximately constant, which results in increasing

values for the relative standard deviation. In a recent paper by Rocke and Loren-

zato [64] a model for two types of measurement error (constant and multiplicative)

is proposed. It describes a method to estimate the measurement error due to these

two types of error over the entire range of possible response values.

Heteroscedasticity

Heteroscedasticity means nonconstant variance [65] and is an important phe-

nomenon that should be taken into account in modeling, because of its in uence

on model validation and the construction of (optimal) designs. In �gure 2.2 an

example of a response with a heteroscedastic variance structure is given. The un-derlying relation between de response y and the predictor variable for this exampleis:

y = exp (x=4) + " (2.30)

The standard deviation � of ", that is normally distributed, depends on the valueof the systematic part of y. For the example in �gure 2.2 the value of � was taken

to be proportional to the response: �y = �� exp (x=4), and the value of �� was setto 0.1. When replicated response values are generated for di�erent settings of thex-variable, a funnel shaped structure is the result. As will be discussed in the nextfew subsections, a logarithmic transformation of the response values, can trans-form this type of heterogeneous error (relative error) into a homogeneous error

(see �gure 2.3). Not all observations have the same in uence on the estimation ofmodel parameters. Observations with a large in uence on the model are expectedto give small residuals. Given a homogeneous error structure (var(")=�2) it canbe proven [47, 66] that the resulting residuals may be nonhomogeneous (var(e)=(I-

H)�2), in which I is the identity matrix and H is the Hat-matrix containing the

leverage values. To detect variance heterogeneity, Cook and Weisberg [67] sugges-

ted the use of leverage corrected residuals, or the so-called studentized residuals.

When leverage corrected residuals (residuals ei divided byp(1-hi)) are used

and they reveal a funnel-shaped structure, this can no longer be due to the ex-perimental design. This implicates that the funnel shape is caused by a variance

which is not constant over the entire response range.

If variance is non-constant, the quality of information about the response in

regions where the variance is large is inferior to the quality of information in re-gions where the variance is small [65, 68]. Thus, the heterogeneity of the varianceis an in uential factor in the modeling process.

There are two categories of methods for dealing with variance heterogeneity. The

�rst category contains methods based on weighted regression and weighted least


Figure 2.2: Heteroscedastic variance

Figure 2.3: Illustration of the e�ect of a logarithmic transformation on

error distribution

squares, while the second category contains methods based on data transforma-tions.

Weighted least squares

The most obvious method that can be used to correct for a heterogeneous variance

structure is Weighted Least Squares (WLS) [68]. One way to perform WLS is

Chapter 2 32

using replicate experiments to estimate the variance structure. The inverse of the,

possibly smoothed, error variances are the weights to be used in Weighted Least

Squares. To get a reasonably accurate estimation of these variances the number of

replicates should be large. In some cases this would cost too much experimental

e�ort. Fortunately there are ways to overcome this problem. For instance, a good

alternative to WLS is a generalized version of WLS, Generalized Least Squares.

Another alternative to WLS is the application of transformations on the response

variables.

Generalized least squares

For responses with a heterogeneous error structure a general model can be con-

structed [65]:

y = f(x;�) + �g(�; z;�)" (2.31)

in which " is random, with a normal distribution with mean 0 and a standarddeviation 1, and � is a scaling constant. f(x;�) describes the functional relationbetween the regressors x and the dependent variable y; � being the set of para-meters de�ning this relation. In the case of modeling capacity factors, y is equal tothe capacity factor k, and x consists of the variables pH and the modi�er content

of the mobile phase. g2(�; z;�) is the variance function, where z is a set of in-dependent variables, which may contain (part of) x, � the mean of the functionalresponse value f(x;�) for certain values of x, and � is a set of parameters for thevariance function. The variance function may be known completely, when para-meter values are known, or partially unknown, but at least the overall shape of the

function needs to be de�ned. When the variance function is partially unknown,the values of � have to be estimated. In the case where y is the capacity factor,it is reasonable to assume that the variance function looks like

g2(�; z;�) = f(x;�)2� (2.32)

This function is now de�ned as related to the theoretical model for the capacityfactor, represented by f(x;�). The corresponding errors are �f(x;�)�". If �=0,

the variance of y is homogeneous, while if �=1 the standard deviation is propor-

tional to the values of y. In the latter case, a logarithmic transformation of theresponse y, before modeling, would be appropriate.

To model this variance function both � and � need to be estimated, while formodeling the functional relation also � needs to be estimated. The estimation

of the parameter values of the functional relationship and the estimation of the

variance function can be implemented as an iterative process using the following

algorithm [65, 68]:

- First � of f(x;�) is estimated using unweighted nonlinear regression.


- Subsequently � of g(�; z;�) is estimated using nonlinear regression on the

residuals that result from the �rst step. In the case of modeling capacity

factors, � is estimated.

- Then again � needs to be estimated using weighted nonlinear regression, for

which the weights are given by w = 1=g2(�; z;�) = 1=f2�(x;�) . The �

used to calculate these weights result from the previous iteration step.

These three steps are usually repeated until convergence, or for a prede�ned

number of times assuming that after this number of times convergence has been

reached. A discussion on the choice of the number of cycles in the iteration can

be found in Carroll and Ruppert [68].

Transformations

To obtain a homogeneous variance a transformation can be applied to theresponse variable as well as to the predictor variables. Another reason for ap-

plying a transformation can be to obtain a less complex model (for instance lin-earization). When the structure of the variance heterogeneity is exactly known atransformation on response and predictor variables can be chosen to make the vari-ance constant. Examples of transformation functions are quadratic, logarithmicor the inverse transform. Models can be constructed on evidence or knowledge of

the functional form of an underlying chemical or physical relation. To maintainthese models, together with a transformation of the response the model describingthe relation is also transformed by the same function. This procedure is calledTransform Both Sides.

When no information is available on the structure of the variance, the appro-priate transformation has to be estimated. The Box-Cox power transformation

family covers the whole scale of TBS procedures, as long as the transformation is

not complicated by the need for transformations on predictor variables:

h(v; �) =

8<: (v� � 1)=� if � 6= 0

ln(v) if � = 0(2.33)

In �gure 2.3 the results of the logarithmic transformation of the response variabley of equation (2.30) are presented. The variation in the response values compared

to their mean now is the same for all settings of the predictor variables, and thevariance is called homoscedastic. As a consequence of the logarithmic transform-

ation the relation between the response and predictor variable has become linear,

which is an extra advantage.

Chapter 2 34

y

x

xxxxxx xx

x

t

xx

xx

x

x

xx

xxx

xx

x

Figure 2.4: Estimation of a knot position by MARS

Nonparametric methods

As was stated in the Introduction section of this chapter the modeling problem

can be de�ned as estimating the relation between a set of predictor and responsevariables

f(X) = E(Y jX) (2.34)

When exible models are used to estimate this relation, most often one of twotypes of techniques is used: smoothers or regression splines [69]. These methodsuse local models after splitting the design space. Regression splines use piecewisepolynomials on �xed intervals of the design regions. The splitting points are calledknot locations. Smoothers, on the other hand, use di�erent intervals for every

datapoint. This interval is oriented as a window around the speci�c datapoint,

and a smoothed estimate of the datapoint is obtained. Examples of smoothersare the local averaging method, spline-smoothing and kernel smoothing. Thesemethods require a relatively dense distribution of data in the predictor variable

space. In the next few subsections three di�erent exible modeling methods are

described.

Multivariate Adaptive Regression Splines

The main philosophy behind Multivariate Adaptive Regression Splines (MARS)

is that predictor variables may contribute more to the variation in response in oneregion than in others. MARS splits the design space and in each subregion a linearor cubic spline is �tted to the data. The subregions are de�ned by splitting points,

also called knot locations.

Each step in the MARS algorithm is made up of three parts. First the variable

has to be selected on which the split is made by placing a knot. Then the exact


y

x

xxxxxx xx

t2

x

xx

xx

xxx

x

xx

x xx

t1

Figure 2.5: Illustration of the backward elimination step in MARS: two

basisfunctions (dashed lines) are replaced by a single basis-

function (solid line).

place of the knot somewhere in the variable region has to be chosen and �nallya spline basisfunction has to be �t in the regions on both sides of the knot (see�gure 2.4). Each region may be split again, while the parent regions remainincorporated in the overall model and available for composing new split regions.

Successive splits can be performed on all earlier estimated basisfunctions, whileold basisfunctions are retained after splitting. The basisfunctions Bm are tensorproducts of univariate spline functions (each univariate spline functions involvesonly one predictor and one knot location):

Bqm(x) =

KmYk=1

[skm(xv(k;m) � tkm)]q+ (2.35)

in which q is the order of the spline (q=3 gives a cubic spline), t the knot location,and the subscript + indicates that only the part of the function that gives positive

values is evaluated. The parameter s can take the values + or -, and thus de�nes

which part of the design region becomes positive. Together with the + subscriptof the function this parameter s de�nes which regions of the splitting procedure is

evaluated. Km denotes the splitting degree, i.e. the number of splits that composethe basisfunction, and xv is the predictor variable to which the split is applied.

For the example in �gure 2.4 if s=-1 the part of the x-region that is smaller thant (the knot position) is evaluated and a linear basisfunction (q=1) is �tted to the

data. For s=+ the part of the x-region larger than t is evaluated and another

basisfunction is created. The splitting degree Km is 1.

Chapter 2 36

The overall model function is a weighted summation of all basisfunctions:

f(x) = a0 +MX

m=1

am

KmYk=1

[skm(xv(k;m) � tkm)]q+ (2.36)

The regression coe�cients am are estimated once the basisfunctions are calculated.

Decisions about model complexity are made on a lack-of-�t criterion (which is

a type of Generalized Cross-Validation criterion):

GCV (M) =1=N

PNi=1[yi � fM(xi)]

2

[1�C(M)=N ]2(2.37)

which contains a penalty for the number of model termsM (basisfunctions). After

the generation of a large amount of basisfunctions super uous model terms are

eliminated by a one-at-a-time backward stepwise procedure (see �gure 2.5). Seefor more details the original research paper by Friedman [70] and for some short

introductions [69] and [71].

Projection Pursuit Regression

Projection Pursuit in general is based on the projection of high dimensional dataon a lower dimensional space. This projection technique can be integrated as part

of all kinds of data-analytical methods [72].

Projection Pursuit Regression (PPR) was developed by Friedman and Stuet-zle [73]. The method is based on the iterative estimation of linear combinations ofthe original predictor variables (the lower dimensional projections) and the corres-ponding smooth functions that describe the relation between the projection andthe response. The model can be written as:

�(X) =MXm=1

S�m(�m �X) (2.38)

�m is a 1�n vector and �m �X gives a linear combination of the predictor variables.

To this projection a smooth S�mis applied. M is the total number of projections

and smooths needed to be able to describe variation in a response variable y.

An iterative algorithm de�nes the exact estimation of model terms. For a givenlinear combination Z = �m � X of the predictor variables X a smooth S�m

(Z)has to be constructed. The fraction of unexplained variance that is explained by

adding the term S�m(Z) can be used as a �t criterion:

I(�m) = 1�nXi=1

(yi � S�m(�m �X))2=

nXi=1

yi2 (2.39)


input layer

output layer

hidden layer

Figure 2.6: Topology of a feedforward neural network

The vector �m and the corresponding smooth S�mthat maximize I(�m) are

selected and the process is terminated when I(�m) is smaller than a user-speci�edtreshold.

TheM projections and smooths are produced in a stepwise procedure. As longas I(�m) is not smaller than a treshold value residuals are calculated:

ri = yi � S�m(�m �X) (2.40)

These residuals are then used as the response values in the next step of the stepwiseprocedure. The smooth can be estimated by any kind of smoothing procedure.

Neural Networks

Arti�cial neural networks are said to mimic the signal processing in the brain.Modeling is one of the application areas of neural networks. Neural networks arethe most extensively used exible modeling methods in chemometrics at this mo-

ment. All kinds of neural networks are available, each suited for speci�c kinds

of problems (see for instance references [74, 75] and [76]). A feedforward back-propagation network was used in this thesis and will be described here. A feed-

forward network passes the signals only in one direction (forward). The buildingblocks of a feedforward backpropagation network are units (also called neurons or

nodes), usually organized in three layers: an input layer, a hidden layer and an

output layer. The units of the input layer contain the input signals, the units of theoutput layer contain the response values. The hidden layer provides the possibility

Chapter 2 38

to model nonlinear relationships. The pattern of network connections is called the

topology of the network and de�nes the actual signal processing. One unit has

multiple inputs, but only one output. A transformation function of the weighted

sum of all the input signals produces the output of a unit. This transformation

function can be a linear function, a threshold function or a nonlinear function, e.g.

a sigmoid function [77]. The output of a unit is given by:

Zj =Xi

wjiXi (2.41)

A sigmoidal transfer function is applied to this output:

f(Zj) =1

1 + e�(Zj+�j)(2.42)

in which Zj is the weighted sum of the input signals Xi and f(Zj) is the transferfunction for unit j. �j is a bias parameter which de�nes the in ection point ofthe sigmoidal curve as described by the transfer function and is in practice oftenestimated as an extra weight to an extra input with a �xed value of 1.

At the start the values of the weights contain random numbers and during the

modeling procedure (training) the weights are adjusted. The weights can be seenas the modeling parameters. The estimation of the weights, when estimating arelation between input and output is called supervised learning. In an iterativeprocedure the weights are adjusted by comparing the produced output with thetarget output. To adjust the weights a so-called learning rule is applied. In this

study the delta rule was used by which the errors are used to adjust the weights(backpropagation):

�wij = ��j yi (2.43)

�wij is the change in the weight between unit i of one layer and unit j of the nextlayer, � is the learning rate and yi is the output of unit i. This delta rule is based

on a minimization of the sum of squared errors:

S = 0:5Xj

(yj � yj)2 (2.44)

where yj is the predicted and yj is the target output of the jth layer. The error

correction term �j for the output layer j that results for this minimization is:

�j = (yj � yj)yj(1� yj) (2.45)

For a unit of the hidden layer j the error correction term is calculated di�erentlybecause no estimate of the target output is available:

�j = yj(1 � yj)Xk

�kwjk (2.46)


in which k represents the output layer.

Yasui [78] developed an algorithm, called parametric lateral inhibition to es-

timate the optimal number of hidden neurons. In fact this algorithm eliminates

redundant network connections. Weights are adjusted by an extra correction term:

�w0ij = �wij +�Vij (2.47)

where �Vij is the extra term that provides for the elimination of redundant con-

nections:

�Vij = �"sgn(wij)

0@ nXi=1

nXj=1

jwijj1A� jwijj (2.48)

where " is a constant. Equation (2.48) shows that connections to a unit decrease

the weights of another connection. The growth of (weights of) certain connections

thus eliminates other connections.For applications of feedforward back-propagation neural networks in analytical

chemistry references [79, 80] are good examples.

Chapter 2 40

Optimum experimental design

In the �eld of optimum experimental design one is engaged in the constructions

of designs that give the best performance on a certain statistical criterion. One

such criterion can be that the expected maximum variance of a prediction should

be minimized. The design choice has in uence on this variance, which can be

explained by the following. For linear models the variance of a prediction for a

certain vector of predictor variable settings xi is:

�2x0i(X0X)�1xi (2.49)

where X is the augmented matrix of predictor variables (with quadratic, cross

terms etc.), and is also called the design matrix. It will be obvious that the choice

of the design will in uence the variance of predicted response variables. Another

statistical criterion is the minimal variance-covariance matrix of the estimated

parametersvar(�) = �2(X0X)�1 (2.50)

The type of design which minimizes the maximum (for all possible settings of

the predictor variables) predictor variance is called a G-optimum design. Manymore criteria exist, each optimizing a criterion that is dependent on the designchoice. The D-criterion minimizes the volume of the sphere that gives a combinedcon�dence interval of all parameters. This D-criterion is used most often. AD-optimal design is achieved when j(X0X)�1j is minimized. A G-optimal design

is achieved when the maximum value of x0i(X0X)�1xi over the design space is

minimized.It will be clear that to obtain an optimal design the model must be known. For

introductions on experimental design see references [81, 82, 83].

The General Equivalence Theorem states that the following three conditions of

an optimal design �� are equivalent:

1. �� minimizes j(X0X)�1j;2. �� minimizes the maximum of xi(X

0X)�1xi for all xi over the design spaceX;

3. the maximum of xi(X0X)�1xi over the design space is equal to p (the

number of model parameters).

is the diagonal matrix containing the weights associated to each of the design

points. This General Equivalence Theorem states that a D-optimal design is equi-valent to a G-optimal design and that the maximum of the G-criterion is equal to

p. For a G-optimal design the maximum values are attained at the design support

points.


Optimum experimental designs for nonlinear models

The main di�erence between optimum experimental designs for linear and nonlin-

ear models is their dependence on model parameter values. For linear models the

estimation of optimum experimental design only depends on the type of model.

For nonlinear models the optimum design depends on the true parameter value.

The contradiction lies in the fact that the design usually is constructed to be able

to estimate these parameters. Therefore, to obtain optimum designs for nonlinear

models a di�erent strategy has to be followed.

The variance of a prediction for nonlinear models for a certain setting of pre-

dictor variables becomes:

F0i(F

0F)�1Fi (2.51)

where F is the Jacobian matrix, which must be evaluated for the estimated para-

meters �. The variance-covariance matrix for estimated nonlinear model para-meters is:

var(�) = �2(F0F)�1 (2.52)

As was described in a previous section of this chapter (see equation (2.13) and fur-ther) the linearization of a nonlinear model by a Taylor expansion in the parameterspace, results in the following equation:

Y � f(Xi;�0) = F0�0 + " (2.53)

This linearization gives a linear regression model. A design that gives a minimumvariance-covariance of �0 will also give a minimum variance-covariance of �0.

Thus, the determinant of the matrix F0F has to be maximized. Atkinson [83]proves that for a minimal D-optimum design (minimal indicates that the numberof design points equals the number of model parameters and F is a square matrix)maximizing jF0Fj equals maximizing jFj.

To obtain a D-optimal design, initial values of the parameters have to be avail-

able. The D-optimal design is used to perform experiments and on these exper-

iments parameter values are estimated. If these values are signi�cantly di�erentfrom the initial parameter values or their accuracy is too small the new values can

be used in the search for a new optimal design.

Documents

Ch apt er 2 · Ch apt er 2 ST A TISTICAL INTR ODUCTION In tro d u ct ion In gen eral t h emod elin g prob lem can b e d e n ed as est im a t in gt e rela ion b et w een a s et of