42
Introduction to Nonlinear Regression Andreas Ruckstuhl * IDP Institute of Data Analysis and Process Design ZHAW Zurich University of Applied Sciences in Winterthur 2020 Contents 1. Estimation and Standard Inference 1 1.1. The Nonlinear Regression Model ...................... 1 1.2. Model Fitting Using an Iterative Algorithm ................ 5 1.3. Inference Based on Linear Approximations ................. 11 2. Improved Inference and Its Visualisation 17 2.1. Likelihood Based Inference ......................... 17 2.2. Profile t-Plot and Profile Traces ....................... 19 2.3. Parameter Transformations ......................... 21 2.4. Bootstrap ................................... 28 3. Prediction and Calibration 32 3.1. Prediction ................................... 32 3.2. Calibration .................................. 34 4. Closing Comments 36 A. Appendix 38 A.1. The Gauss-Newton Method ......................... 38 * E-Mail Address: [email protected]; Internet: http://www.idp.zhaw.ch The author thanks Werner Stahel for his valuable comments and Amanda Strong and Lukas Meier for their help in translating the original German version into English. 1

Introduction to Nonlinear Regression - ETH Z...Introduction to Nonlinear Regression Andreas Ruckstuhl∗ IDP Institute of Data Analysis and Process Design ZHAW Zurich University of

  • Upload
    others

  • View
    20

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Introduction to Nonlinear Regression - ETH Z...Introduction to Nonlinear Regression Andreas Ruckstuhl∗ IDP Institute of Data Analysis and Process Design ZHAW Zurich University of

Introduction to Nonlinear RegressionAndreas Ruckstuhl∗

IDP Institute of Data Analysis and Process DesignZHAW Zurich University of Applied Sciences in Winterthur

2020†

Contents

1. Estimation and Standard Inference 11.1. The Nonlinear Regression Model . . . . . . . . . . . . . . . . . . . . . . 11.2. Model Fitting Using an Iterative Algorithm . . . . . . . . . . . . . . . . 51.3. Inference Based on Linear Approximations . . . . . . . . . . . . . . . . . 11

2. Improved Inference and Its Visualisation 172.1. Likelihood Based Inference . . . . . . . . . . . . . . . . . . . . . . . . . 172.2. Profile t-Plot and Profile Traces . . . . . . . . . . . . . . . . . . . . . . . 192.3. Parameter Transformations . . . . . . . . . . . . . . . . . . . . . . . . . 212.4. Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3. Prediction and Calibration 323.1. Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.2. Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4. Closing Comments 36

A. Appendix 38A.1. The Gauss-Newton Method . . . . . . . . . . . . . . . . . . . . . . . . . 38

∗E-Mail Address: [email protected]; Internet: http://www.idp.zhaw.ch†The author thanks Werner Stahel for his valuable comments and Amanda Strong and Lukas Meier

for their help in translating the original German version into English.

1

Page 2: Introduction to Nonlinear Regression - ETH Z...Introduction to Nonlinear Regression Andreas Ruckstuhl∗ IDP Institute of Data Analysis and Process Design ZHAW Zurich University of

0 Contents

Goals

The nonlinear regression model block in the Weiterbildungslehrgang (WBL) in ange-wandter Statistik at the ETH Zurich should

1. introduce problems that are relevant to the fitting of nonlinear regression func-tions,

2. present graphical representations for assessing the quality of approximate confi-dence intervals, and

3. introduce some parts of the statistics software R that can help with solvingconcrete problems.

WBL Applied Statistics — Nonlinear Regression

Page 3: Introduction to Nonlinear Regression - ETH Z...Introduction to Nonlinear Regression Andreas Ruckstuhl∗ IDP Institute of Data Analysis and Process Design ZHAW Zurich University of

1 Estimation and Standard Inference

1.1 The Nonlinear Regression Model

a The Regression Model. Regression studies the relationship between a variable ofinterest Y and one or more explanatory or predictor variables x(j) . The generalmodel is

Yi = h〈x(1)i , x

(2)i , . . . , x

(m)i ; θ1, θ2, . . . , θp〉+ Ei.

Here, h is an appropriate function that depends on the predictor variables andparameters, that we want to combine into vectors x = [x(1)

i , x(2)i , . . . , x

(m)i ]T and

θ = [θ1, θ2, . . . , θp]T . We assume that the errors are all normally distributed andindependent, i.e.

Ei ∼ N⟨

0, σ2⟩, independent.

b The Linear Regression Model. In (multiple) linear regression, we considered functionsh that are linear in the parameters θj ,

h〈x(1)i , x

(2)i , . . . , x

(m)i ; θ1, θ2, . . . , θp〉 = θ1x

(1)i + θ2x

(2)i + . . .+ θpx

(p)i ,

where the x(j) can be arbitrary functions of the original explanatory variables x(j) .There the parameters were usually denoted by βj instead of θj .

c The Nonlinear Regression Model. In nonlinear regression, we use functions h thatare not linear in the parameters. Often, such a function is derived from theory. Inprinciple, there are unlimited possibilities for describing the deterministic part of themodel. As we will see, this flexibility often means a greater effort to make statisticalstatements.

Example d Puromycin The speed of an enzymatic reaction depends on the concentration of asubstrate. As outlined in Bates and Watts (1988), an experiment was performed toexamine how a treatment of the enzyme with an additional substance called Puromycininfluences the reaction speed. The initial speed of the reaction is chosen as the responsevariable, which is measured via radioactivity (the unit of the response variable iscount/min2 ; the number of registrations on a Geiger counter per time period measuresthe quantity of the substance, and the reaction speed is proportional to the change pertime unit).The relationship of the variable of interest with the substrate concentration x (in ppm)is described by the Michaelis-Menten function

h〈x; θ〉 = θ1x

θ2 + x.

1

Page 4: Introduction to Nonlinear Regression - ETH Z...Introduction to Nonlinear Regression Andreas Ruckstuhl∗ IDP Institute of Data Analysis and Process Design ZHAW Zurich University of

2 1.1. The Nonlinear Regression Model

0.0 0.2 0.4 0.6 0.8 1.0

50

100

150

200

Concentration

Vel

ocity

●●

●●●

Concentration

Vel

ocity

Figure 1.1.d.: Puromycin. (a) Data (• treated enzyme; 4 untreated enzyme) and (b) typical shapeof the regression function.

An infinitely large substrate concentration (x→∞) leads to the “asymptotic” speedθ1 . It was hypothesized that this parameter is influenced by the addition of Puromycin.The experiment is therefore carried out once with the enzyme treated with Puromycinand once with the untreated enzyme. Figure 1.1.d shows the result. In this sectionthe data of the treated enzyme is used.

Example e Biochemical Oxygen Demand To determine the biochemical oxygen demand, streamwater samples were enriched with soluble organic matter, with inorganic nutrientsand with dissolved oxygen, and subdivided into bottles (Marske, 1967, see Bates andWatts, 1988). Each bottle was inoculated with a mixed culture of microorganisms,sealed and put in a climate chamber with constant temperature. The bottles wereperiodically opened and their dissolved oxygen concentration was analyzed, from whichthe biochemical oxygen demand [mg/l] was calculated. The model used to connect thecumulative biochemical oxygen demand Y with the incubation time x , is based onexponential decay:

h〈x; θ〉 = θ1(1− e−θ2x

).

Figure 1.1.e shows the data and the regression function.

Example f Membrane Separation Technology (Rapold-Nydegger, 1994). The ratio of protonatedto deprotonated carboxyl groups in the pores of cellulose membranes depends on thepH-value x of the outer solution. The protonation of the carboxyl carbon atoms canbe detected by 13 C-NMR. We assume that the relationship can be described by theextended “Henderson-Hasselbach Equation” for polyelectrolytes

log10

⟨θ1 − yy − θ2

⟩= θ3 + θ4 x ,

where the unknown parameters are θ1, θ2 and θ3 > 0 and θ4 < 0. Solving for y leadsto the model

Yi = h〈xi; θ〉+ Ei = θ1 + θ2 10θ3+θ4xi

1 + 10θ3+θ4xi+ Ei .

The regression function h〈xi, θ〉 for a reasonably chosen θ is shown in Figure 1.1.f nextto the data.

WBL Applied Statistics — Nonlinear Regression

Page 5: Introduction to Nonlinear Regression - ETH Z...Introduction to Nonlinear Regression Andreas Ruckstuhl∗ IDP Institute of Data Analysis and Process Design ZHAW Zurich University of

1.1. The Nonlinear Regression Model 3

1 2 3 4 5 6 7

8

10

12

14

16

18

20

Days

Oxy

gen

Dem

and

Days

Oxy

gen

Dem

and

Figure 1.1.e.: Biochemical Oxygen Demand. (a) Data and (b) typical shape of the regressionfunction.

●●●

●●

●●

●●

●●●

●●●

●●

●●

●●●

●●

● ● ● ● ● ● ●

2 4 6 8 10 12

160

161

162

163

x (=pH)

y (=

che

m. s

hift)

(a)

x

y

(b)

Figure 1.1.f.: Membrane Separation Technology. (a) Data and (b) a typical shape of the regressionfunction.

g Some Further Examples of Nonlinear Regression Functions.• Hill model (enzyme kinetics): h〈xi, θ〉 = θ1x

θ3i /(θ2 + xθ3

i )For θ3 = 1 this is also known as the Michaelis-Menten model (1.1.d).

• Mitscherlich function (growth analysis): h〈xi, θ〉 = θ1 + θ2 exp〈θ3xi〉 .• From kinetics (chemistry) we get the function

h〈x(1)i , x

(2)i ; θ〉 = exp〈−θ1x

(1)i exp〈−θ2/x

(2)i 〉〉.

• Cobbs-Douglas production function

h⟨x

(1)i , x

(2)i ; θ

⟩= θ1

(x

(1)i

)θ2 (x

(2)i

)θ3.

Since useful regression functions are often derived from the theoretical background ofthe application of interest, a general overview of nonlinear regression functions is of

A. Ruckstuhl, ZHAW

Page 6: Introduction to Nonlinear Regression - ETH Z...Introduction to Nonlinear Regression Andreas Ruckstuhl∗ IDP Institute of Data Analysis and Process Design ZHAW Zurich University of

4 1.1. The Nonlinear Regression Model

very limited benefit. A compilation of functions from publications can be found inAppendix 7 of Bates and Watts (1988).

h Transformably Linear Regression Functions. Some nonlinear regression functionshave a very favourable structure. For example, a power function

h〈x; θ〉 = θ1xθ2

can be transformed to a linear model by expressing the logarithm of h〈x; θ〉 as alinear (in the parameters) function of the logarithm of the explanatory variable x

ln〈h〈x; θ〉〉 = ln 〈θ1〉+ θ2 ln 〈x〉 = β0 + β1x ,

where β0 = ln 〈θ1〉 , β1 = θ2 and x = ln 〈x〉 . We call such a regression function htransformably linear.

i The Statistically Complete Model.. The “regression fitting” of the “linearized” re-gression function given in the previous paragraph is based on the model

ln〈Yi〉 = β0 + β1xi + Ei ,

where the random errors Ei all have the same normal distribution. We transform thismodel back and get

Yi = θ1 · xθ2 · Ei,

with Ei = exp〈Ei〉 . The errors Ei , i = 1, . . . , n , now have a multiplicative effect andare log-normally distributed! The assumption about the random errors is now clearlydifferent than for a regression model based directly on h ,

Yi = θ1 · xθ2 + E∗i ,

where the random error E∗i contributes additively and is normally distributed.

A linearization of the regression function is therefore advisable only if the assumptionsabout the random errors can be better satisfied – in our example, if the errors actuallyact multiplicatively rather than additively and are log-normally rather than normallydistributed. These assumptions must be checked with residual analysis (see,e.g., Figure1.1.i).

j * Note: In linear regression it has been shown that the variance can be stabilized with certaintransformations (e.g. log〈·〉,

√·). If this is not possible, in certain circumstances one can also

perform a weighted linear regression. The process is analogous in nonlinear regression.

WBL Applied Statistics — Nonlinear Regression

Page 7: Introduction to Nonlinear Regression - ETH Z...Introduction to Nonlinear Regression Andreas Ruckstuhl∗ IDP Institute of Data Analysis and Process Design ZHAW Zurich University of

1.2. Model Fitting Using an Iterative Algorithm 5

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

0 200 400 600 800

−1.

0−

0.5

0.0

0.5

fitted(SD1.nls)

resi

d(S

D1.

nls)

●●

●● ●●●

●●

● ●●●● ●

● ●●

●●

●●

●●

●●●●

● ●

●●

●●

●●●●

●●●

●●

0 200 400 600 800 1000

−50

00

500

1000

fitted(SD2.nls)

resi

d(S

D2.

nls)

●●

●●

●●

●●

●●

●●●● ●●

●●

●●●

●●

●●●●

●●

●●

●●●●

●●

●●●

●●●

●●●●

●●●

●●●

● ●

●●●●

●●

●●

−2 0 2 4 6

−1.

2−

0.8

−0.

40.

0

fitted(SD1.lm)

resi

d(S

D1.

lm) ●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

● ●

−2 0 2 4 6−

1.0

−0.

50.

00.

51.

0

fitted(SD2.lm)

resi

d(S

D2.

lm)

lm(lo

g(y)

~ lo

g(x)

)nl

s(y

~ a

* x

^b)

y = a * x^b + E ln(y) = ln(a) + b*ln(x) + E

Figure 1.1.i.: Tukey-Anscombe plots of four different situations. In the left column, the data aresimulated from an additive error model, whereas in the right column the data are simulated from amultiplicative error model. The top row shows the Tukey-Anscombe plots of fits of a multiplicativeerror model to both data sets and the bottom row shows the Tukey-Anscombe plots of fits of anadditive error model to both data sets.

k Here are some more linearizable functions (see also Daniel and Wood, 1980):

h〈x; θ〉= 1/(θ1 + θ2 exp〈−x〉) ←→ 1/h〈x; θ〉= θ1 + θ2 exp〈−x〉

h〈x; θ〉= θ1x/(θ2 + x) ←→ 1/h〈x; θ〉= 1/θ1 + θ2/θ11x

h〈x; θ〉= θ1xθ2 ←→ ln〈h〈x; θ〉〉= ln〈θ1〉+ θ2 ln〈x〉

h〈x; θ〉= θ1 exp〈θ2g〈x〉〉 ←→ ln〈h〈x; θ〉〉= ln〈θ1〉+ θ2g〈x〉

h〈x; θ〉= exp〈−θ1x(1) exp〈−θ2/x

(2)〉〉 ←→ ln〈ln〈h〈x; θ〉〉〉= ln〈−θ1〉+ ln〈x(1)〉 − θ2/x(2)

h〈x; θ〉= θ1(x(1))θ2 (

x(2))θ3 ←→ ln〈h〈x; θ〉〉= ln〈θ1〉+ θ2 ln〈x(1)〉+ θ3 ln〈x(2)〉 .

The last one is the Cobbs-Douglas Model from 1.1.g.

l We have almost exclusively seen regression functions that only depend on one predictorvariable x . This was primarily because it was possible to graphically illustrate themodel. The following theory also works well for regression functions h〈x; θ〉 thatdepend on several predictor variables x = [x(1), x(2), . . . , x(m)] .

1.2 Model Fitting Using an Iterative Algorithm

a The Principle of Least Squares. To get estimates for the parameters θ = [θ1 , θ2 , . . . ,θp]T , one applies – like in linear regression – the principle of least squares. The sum

A. Ruckstuhl, ZHAW

Page 8: Introduction to Nonlinear Regression - ETH Z...Introduction to Nonlinear Regression Andreas Ruckstuhl∗ IDP Institute of Data Analysis and Process Design ZHAW Zurich University of

6 1.2. Model Fitting Using an Iterative Algorithm

of the squared deviations

S(θ) :=n∑i=1

(yi − ηi〈θ〉)2 where ηi〈θ〉 := h〈xi; θ〉

should be minimized. The notation that replaces h〈xi; θ〉 with ηi〈θ〉 is reasonablebecause [xi, yi] is given by the data and only the parameters θ remain to be determined.Unfortunately, the minimum of S(θ) and thus the estimater have no explicit solutionas it was the case for linear regression. Iterative numeric procedures are thereforeneeded. We will sketch the basic ideas of the most common algorithm. It is also thebasis for the easiest way to derive tests and confidence intervals.

b Geometrical Illustration. The observed values Y = [Y1, Y2, . . . , Yn]T define apoint in n-dimensional space. The same holds true for the “model values” η 〈θ〉 =[η1 〈θ〉 , η2 〈θ〉 , . . . , ηn 〈θ〉]T for a given θ .Please take note: In multivariate statistics where an observation consists of m variablesx(j) , j = 1, 2, . . . ,m , it’s common to illustrate the observations in the m-dimensionalspace. Here, we consider the Y - and η -values of all n observations as points in then-dimensional space.Unfortunately, geometrical interpretation stops with three dimensions (and thus withthree observations). Nevertheless, let us have a look at such a situation, first for simplelinear regression.

c As stated above, the observed values Y = [Y1, Y2, Y3]T determine a point in three-dimensional space. For given parameters β0 = 5 and β1 = 1 we can calculate the modelvalues ηi

⟨β⟩

= β0 + β1xi and represent the corresponding vector η⟨β⟩

= β01 + β1xas a point. We now ask: Where are all the points that can be achieved by varyingthe parameters? These are the possible linear combinations of the two vectors 1 andx and thus form the plane “spanned by 1 and x”. By estimating the parametersaccording to the principle of least squares, the squared distance between Y and η

⟨β⟩

is minimized. So, we want the point on the plane that has the least distance to Y .This is also called the projection of Y onto the plane. The parameter values thatcorrespond to this point η are therefore the estimated parameter values β = [β0, β1]T .An illustration can be found in Figure 1.2.c.

d Now we want to fit a nonlinear function, e.g. h〈x; θ〉 = θ1 exp〈1− θ2x〉 , to the samethree observations. We can again ask ourselves where are all the points η 〈θ〉 thatcan be achieved by varying the parameters θ1 and θ2 . They lie on a two-dimensionalcurved surface (called the model surface in the following) in three-dimensional space.The estimation problem again consists of finding the point η on the model surface thatis closest to Y . The parameter values that correspond to this point η are then theestimated parameter values θ = [θ1, θ2]T . Figure Figure 1.2.d illustrates the nonlinearcase.

Example e Biochemical Oxygen Demand (cont’d) The situation for our Biochemical OxygenDemand example can be found in Figure 1.2.e. Basically, we can read the estimatedparameters directly off the graph here: θ1 is a bit less than 21 and θ2 is a bit largerthan 0.6. In fact the (exact) solution is θ = [20.82, 0.6103] (note that these are theparameter estimates for the reduced data set only consisting of three observations).

WBL Applied Statistics — Nonlinear Regression

Page 9: Introduction to Nonlinear Regression - ETH Z...Introduction to Nonlinear Regression Andreas Ruckstuhl∗ IDP Institute of Data Analysis and Process Design ZHAW Zurich University of

1.2. Model Fitting Using an Iterative Algorithm 7

0 2 4 6 8 10

0 2

4 6

810

0 2

4 6

810

η1 | y1

η 2 |

y 2

η3 | y3

Y

[1,1,1]

x

0 2 4 6 8 10

0 2

4 6

810

0 2

4 6

810

η1 | y1

η 2 |

y 2

η3 | y3

Y

[1,1,1]

x

y

Figure 1.2.c.: Illustration of simple linear regression. Values of η⟨β⟩

= β0 + β1x for varyingparameters [β0, β1] lead to a plane in three-dimensional space. The right plot also shows the pointon the surface that is closest to Y = [Y1, Y2, Y3] . It is the fitted value y and determines theestimated parameters β .

5 6 7 8 9 10 11

1012

1416

1820

1819

2021

22

η1 | y1

η 2 |

y 2

η3 | y3

Y

Figure 1.2.d.: Geometrical illustration of nonlinear regression. The values of η 〈θ〉 = h〈x; θ1, θ2〉 forvarying parameters [θ1, θ2] lead to a two-dimensional “model surface” in three-dimensional space.The lines on the model surface correspond to constant η1 and η3 , respectively.

A. Ruckstuhl, ZHAW

Page 10: Introduction to Nonlinear Regression - ETH Z...Introduction to Nonlinear Regression Andreas Ruckstuhl∗ IDP Institute of Data Analysis and Process Design ZHAW Zurich University of

8 1.2. Model Fitting Using an Iterative Algorithm

5 6 7 8 9 10 11

1012

1416

1820

1819

2021

22

η1 | y1

η 2 |

y 2

η3 | y3

Y

θ1 = 20

θ1 = 21

θ1 = 22

0.3

0.4

0.5θ2 =

y

Figure 1.2.e.: Biochemical Oxygen Demand: Geometrical illustration of nonlinear regression. Inaddition, we can see here the lines of constant θ1 and θ2 , respectively. The vector of the estimatedmodel values y = h

⟨x; θ⟩

is the point on the model surface that is closest to Y .

f Approach for the Minimization Problem. The main idea of the usual algorithm forminimizing the sum of squares (see 1.2.a) is as follows: If a preliminary best value θ(`)

exists, we approximate the model surface with the plane that touches the surface atthe point η

⟨θ(`)

⟩= h

⟨x; θ(`)

⟩(tangent plane). Now we are looking for the point on

that plane that lies closest to Y . This is the same as estimation in a linear regressionproblem. This new point lies on the plane, but not on the surface that corresponds tothe nonlinear problem. However, it determines a parameter vector θ(`+1) that we useas starting value for the next iteration.

g Linear Approximation. To determine the tangent plane we need the partial derivatives

A(j)i 〈θ〉 := ∂ηi 〈θ〉

∂θj,

that can be summarized by an n × p matrix A . The approximation of the modelsurface η〈θ〉 by the tangent plane at a parameter value θ∗ is

ηi〈θ〉 ≈ ηi〈θ∗〉+A(1)i 〈θ

∗〉 (θ1 − θ∗1) + ...+A(p)i 〈θ

∗〉 (θp − θ∗p)

WBL Applied Statistics — Nonlinear Regression

Page 11: Introduction to Nonlinear Regression - ETH Z...Introduction to Nonlinear Regression Andreas Ruckstuhl∗ IDP Institute of Data Analysis and Process Design ZHAW Zurich University of

1.2. Model Fitting Using an Iterative Algorithm 9

or, in matrix notation,η〈θ〉 ≈ η〈θ∗〉+ A 〈θ∗〉 (θ − θ∗) .

If we now add a random error, we get a linear regression model

Y = A 〈θ∗〉β + E

with “preliminary residuals” Y i = Yi − ηi 〈θ∗〉 as response variable, the columns of Aas predictors and the coefficients βj = θj − θ∗j (a model without intercept β0 ).

h Gauss-Newton Algorithm. The Gauss-Newton algorithm starts with an initial valueθ(0) for θ , solving the just introduced linear regression problem for θ∗ = θ(0) to find acorrection β and hence an improved value θ(1) = θ(0) + β . Again, the approximatedmodel is calculated, and thus the “preliminary residuals” Y − η

⟨θ(1)

⟩and the partial

derivatives A⟨θ(1)

⟩are determined, leading to θ2 . This iteration step is continued

until the the correction β is small enough. (Further details can be found in AppendixA.1.)It can not be guaranteed that this procedure actually finds the minimum of the sum ofsquares. The better the p-dimensional model surface at the minimum θ = (θ1, . . . , θp)Tcan be locally approximated by a p-dimensional plane and the closer the initial valueθ(0) is to the solution, the higher are the chances of finding the optimal value.* Algorithms usually determine the derivative matrix A numerically. In more complex problems the

numerical approximation can be insufficient and cause convergence problems. For such situationsit is an advantage if explicit expressions for the partial derivatives can be used to determine thederivative matrix more reliably (see also Section 2.3).

i Starting Values. An iterative procedure always requires a starting value. Good start-ing values help to find a solution more quickly and more reliably.Several simple but useful principles for determining starting values can be used:• use prior knowledge;• interpret the behavior of the expectation function h〈〉 in terms of the parameters

analytically or graphically;• transform the expectation function h〈〉 analytically to obtain linear behavior;• reduce dimensionality by substituting values for some parameters or by evaluat-

ing the function at specific design values; and• use conditional linearity.

j Starting Value from Prior Knowledge. As already noted in the introduction, nonlinearmodels are often based on theoretical considerations of the corresponding applicationarea. Already existing prior knowledge from similar experiments can be used to geta starting value. To ensure the quality of the chosen starting value, it is advisableto graphically represent the regression function h〈x; θ〉 for various possible startingvalues θ = θ0 together with the data (e.g., as in Figure 1.2.k, right).

A. Ruckstuhl, ZHAW

Page 12: Introduction to Nonlinear Regression - ETH Z...Introduction to Nonlinear Regression Andreas Ruckstuhl∗ IDP Institute of Data Analysis and Process Design ZHAW Zurich University of

10 1.2. Model Fitting Using an Iterative Algorithm

●●

●●●●

0 10 20 30 40 50

0.005

0.010

0.015

0.020

1/Concentration

1/ve

loci

ty

0.0 0.2 0.4 0.6 0.8 1.0

50

100

150

200

Concentration

Vel

ocity

Figure 1.2.k.: Puromycin. Left: Regression function in the linearized problem. Right: Regressionfunction h〈x; θ〉 for the starting values θ = θ(0) ( ) and for the least squares estimationθ = θ (——–).

k Starting Values From Transform the Expectation Function. Often – because of thedistribution of the error term – one is forced to use a nonlinear regression functioneven though the expectation function h〈〉 cpuld be transformed to a linear function.However, the linearized expectation function can be used to get starting values.In the Puromycin example the regression function is linearizable: The reciprocal valuesof the two variables fulfill

y = 1y≈ 1h〈x; θ〉 = 1

θ1+ θ2θ1

1x

= β0 + β1x .

The least squares solution for this modified problem is β = [β0, β1]T = (0.00511, 0.000247)T(Figure 1.2.k, left). This leads to the starting values

θ(0)1 = 1/β0 = 196, θ

(0)2 = β1/β0 = 0.048.

l Starting Values from Interpreting the Behavior of h〈〉. It is often helpful to considerthe geometrical features of the regression function.In the Puromycin Example we can derive a starting value in another way: θ1 is theresponse value for x =∞ . Since the regression function is monotonically increasing, wecan use the maximal yi -value or a visually determined “asymptotic value” θ

(0)1 = 207

as starting value for θ1 . The parameter θ2 is the x-value, such that y reaches half ofthe asymptotic value θ1 . This leads to θ

(0)2 = 0.06.

The starting values thus result from a geometrical interpretation of the parametersand a rough estimate can be determined by “fitting by eye”.

Example m Membrane Separation Technology (cont’d) In the Membrane Separation Technologyexample we let x→∞ , so h〈x; θ〉 → θ1 (since θ4 < 0); for x→ −∞ , h〈x; θ〉 → θ2 .From Figure 1.1.f (a) we see that θ1 ≈ 163.7 and θ2 ≈ 159.5. Once we know θ1 andθ2 , we can linearize the regression function by

y := log10

⟨θ

(0)1 − yy − θ(0)

2

⟩= θ3 + θ4x .

WBL Applied Statistics — Nonlinear Regression

Page 13: Introduction to Nonlinear Regression - ETH Z...Introduction to Nonlinear Regression Andreas Ruckstuhl∗ IDP Institute of Data Analysis and Process Design ZHAW Zurich University of

1.3. Inference Based on Linear Approximations 11

●●

●●●●

●●●●

●●●●●●●●●●

●●●●●●

● ●

●●

2 4 6 8 10 12

−2

−1

0

1

2

x (=pH)

y

(a)

●●●

●●

●●

●●

●●●

●●●

●●

●●

●●●

●●

● ● ● ● ● ● ●

2 4 6 8 10 12

160

161

162

163

x (=pH)

y (=

che

m. s

hift)

(b)

Figure 1.2.m.: Membrane Separation Technology. (a) Regression line that is used for determiningthe starting values for θ3 and θ4 . (b) Regression function h〈x; θ〉 for the starting value θ = θ(0)

( ) and for the least squares estimator θ = θ (——–).

This is called conditional linearity of the expectation function. The linear regressionmodel leads to the starting values θ(0)

3 = 1.83 and θ(0)4 = −0.36, respectively.

With this starting value the algorithm converges to the solution θ1 = 163.7, θ2 =159.8, θ3 = 2.675 and θ4 = −0.512. The functions h〈·; θ(0)〉 and h〈·; θ〉 are shown inFigure 1.2.m (b).* The property of conditional linearity of a function can also be useful to develop an algorithm

specifically suited for this situation (see e.g. Bates and Watts, 1988).

n Self-Starter Function. For repeated use of the same nonlinear regression modelsome automated way of providing starting values is demanded nowadays. Basically,we should be able to collect all the manual steps which are necessary to obtain theinitial values for a nonlinear regression model into a function, and use it to generatethe starting values. Such a function is called a self-starter function and should allowto run the estimating procedure easily and smoothly.Self-starter functions are specific for a given mean function and calculate starting valuesfor a given dataset. The challenge is to design a self-starter function robustly. That is,its resulting values must allow the estimation algorithm to converge to the parameterestimates.One of the self-starter knowledge bases comes with the standard installation of R. Youfind an overview in Table 1.2.n and some more details, including informations how towriter a self-starter yourself, in Ritz and Streibig (2008).

1.3 Inference Based on Linear Approximations

a The estimator θ is the value of θ that optimally fits the data. We now ask whichparameter values θ are compatible with the observations. The confidence region isthe set of all these values. For an individual parameter θj the confidence region is aconfidence interval.These concepts have been discussed in great detail in the modul “Linear Regression

A. Ruckstuhl, ZHAW

Page 14: Introduction to Nonlinear Regression - ETH Z...Introduction to Nonlinear Regression Andreas Ruckstuhl∗ IDP Institute of Data Analysis and Process Design ZHAW Zurich University of

12 1.3. Inference Based on Linear Approximations

Model Mean Function Name of Self-Starter Function

Biexponential A1 · e−x·elrc1 +A2 · e−x·elrc2SSbiexp(x, A1, lrc1, A2, lrc2)

Asymptotic regression Asym+ (R0−Asym) · e−x·elrc

SSasymp(x, Asym, R0, lrc)

Asymptotic regressionwith offset

Asym · (1− e−(x−c0)·elrc) SSasympOff(x, Asym, lrc, c0)

Asymptotic regression(c0 = 0)

Asym · (1− e−x·elrc) SSasympOrig(x, Asym, lrc)

First-order x1 · elKe+lKa−lCl

elKa−elKe SSfol(x1, x2, lKe, lKa, lCl)compartment ·(e−x2·elKe − e−x2·elKa)

Gompertz Asym · e−b2·b3x SSgompertz(x, Asym, b2, b3)

Logistic A+ B−A1+e(xmid−x)/scal SSfpl(x, A, B, xmid, scal)

Logistic (A = 0) Asym1+e(xmid−x)/scal SSlogis(x, Asym, xmid, scal)

Michaelis-Menten V m · xK+x SSmicmen(x, Vm, K)

Weibull Asym−Drop · e−elrc·xpwr SSweibull(x, Asym, Drop, lrc, pwr)

Table 1.2.n.: Available self-starter functions for nls() which come with the standard installationof R.

Analysis”. As a look on the summary output of a nls object shows, it look very similarto the summary output of a fitted linear regression model.

Example b Membrane Separation Technology (cont’d) The R summary output for the Mem-brane Separation Technology example can be found in R Output 1.3.b. The parameterestimates are in column Estimate, followed by the estimated approximate standarderror (Std. Error) and the test statistics (t value), that are approximately tn−pdistributed. The corresponding p-values can be found in column Pr(>|t|). The es-timated standard deviation σ of the random error Ei is here labelled as “residualstandard error”.

> Mem.fit <- nls(delta ∼ (T1 + T2*10ˆ(T3+T4*pH))/(10ˆ(T3+T4*pH)+1),D.membran, start=list(T1=163.7, T2=159.5, T3=1.83, T4=-0.36))

> summary(Mem.fit)Formula: delta ∼ (T1 + T2 * 10ˆ(T3 + T4 * pH)) / (10ˆ(T3 + T4 * pH) + 1)

Parameters:Estimate Std. Error t value Pr(> |t|)

T1 163.7056 0.1262 1297.256 < 2e-16T2 159.7846 0.1594 1002.194 < 2e-16T3 2.6751 0.3813 7.015 3.65e-08T4 -0.5119 0.0703 -7.281 1.66e-08

Residual standard error:

0.2931 on 35 degrees of freedomNumber of iterations to convergence: 7Achieved convergence tolerance: 5.517e-06

R-Output 1.3.b: Summary of the fit of the Membrane Separation Technology example.

c Going into details, it is immediately clear that the inference is a matter more complex.It is not possible to write down exact results. However, one can derive inference results

WBL Applied Statistics — Nonlinear Regression

Page 15: Introduction to Nonlinear Regression - ETH Z...Introduction to Nonlinear Regression Andreas Ruckstuhl∗ IDP Institute of Data Analysis and Process Design ZHAW Zurich University of

1.3. Inference Based on Linear Approximations 13

if the number n of observations goes to infinity. Then they look like in linear regressionanalysis. Such results can be used now for finite n , but hold just approximately.In this section, we will explore these so-called asymptotic results in some more details.

d The asymptotic properties of the estimator can be derived from the linear approx-imation. The problem of nonlinear regression is indeed approximately equal to thelinear regression problem mentioned in 1.2.g

Y = A 〈θ∗〉β + E ,

if the parameter vector θ∗ that is used for the linearization is close to the solution. Ifthe estimation procedure has converged (i.e. θ∗ = θ ), then β = 0 (otherwise this wouldnot be the solution). The standard error of the coefficients β – or more generally thecovariance matrix of β – then approximate the corresponding values of θ .* More precisely: The standard errors characterize the uncertainties that are generated by the random

fluctuations of the data. The available data have led to the estimator θ . If the data would havebeen slightly different, then θ would still be approximately correct, thus we accept the fact thatit is good enough for the linearization. The estimation of β for the new data set would thus lieas far from the estimated value for the available data, as is quantified by the distribution of theparameters in the linear problem.

e Asymptotic Distribution of the Least Squares Estimator.. It follows that the leastsquares estimator θ is asymptotically normally distributed

θas.∼ N 〈θ, V 〈θ〉〉 ,

with asymptotic covariance matrix V 〈θ〉 = σ2(A 〈θ〉T A 〈θ〉)−1 , where A 〈θ〉 is then× p matrix of partial derivatives (see 1.2.g).To explicitly determine the covariance matrix V 〈θ〉 , A 〈θ〉 is calculated using θ insteadof the unknown θ and is denoted as A . For the error variance σ2 we plug-in the usualestimator. Hence, we can write

V = σ2(

AT A)−1

where

σ2 = S〈θ〉n− p

= 1n− p

n∑i=1

(yi − ηi

⟨θ⟩ )2

and A = A⟨θ⟩.

Hence, the distribution of the estimated parameters is approximately determined andwe can (like in linear regression) derive standard errors and confidence intervals, orconfidence ellipses (or ellipsoids) if multiple variables are considered jointly.The denominator n−p in the estimator σ2 was already introduced in linear regressionto ensure that the estimator is unbiased. Tests and confidence intervals were not basedon the normal and Chi-square distribution but on the t- and F-distribution. Theytake into account that the estimation of σ2 causes additional random fluctuation. Evenif the distributions are no longer exact, the approximations are more exact if we dothis in nonlinear regression too. Asymptotically the difference goes to zero.Based on these considerations we can construct approximate (1−α) ·100% confidenceintervals:

θk ± qtn−p1−α/2 ·

√V kk ,

A. Ruckstuhl, ZHAW

Page 16: Introduction to Nonlinear Regression - ETH Z...Introduction to Nonlinear Regression Andreas Ruckstuhl∗ IDP Institute of Data Analysis and Process Design ZHAW Zurich University of

14 1.3. Inference Based on Linear Approximations

where V kk is the k th diagonal element of V .

Example f Puromycin (cont’d) Based on the summary output given in 1.3.b the approximate95% confidence interval for the parameter θ1 is

163.706± qt350.975 · 0.1262 = 163.706± 0.256 .

Based on analogues calculation we obtain the following 95% confidence intervall forthe parameters θ1 , θ2 , θ3 , and θ4 :θ1: [163.45, 163.96]θ2: [159.46, 160.11]θ3: [1.90, 3.45]θ4: [-0.65, -0.37]

Example g Puromycin (cont’d) In order to check the influence of treating an enzyme withPuromycin a general model for the data (with and without treatment) can be for-mulated as follows:

Yi = (θ1 + θ3zi)xiθ2 + θ4zi + xi

+ Ei,

where z is the indicator variable for the treatment (zi = 1 if treated, zi = 0 otherwise).R Output 1.3.g shows that the parameter θ4 is not significantly different from 0 at the5% level since the p-value of 0.167 is larger then the level (5%). However, the treatmenthas a clear influence that is expressed through θ3 ; the 95% confidence interval coversthe region 52.398±9.5513 ·2.09 = [32.4, 72.4] (the value 2.09 corresponds to the 97.5%quantile of the t19 distribution).

Formula: velocity ∼ (T1 + T3 * (treated == T)) * conc/(T2 + T4* (treated == T) + conc)

Parameters:Estimate Std. Error t value Pr(> |t|)

T1 160.280 6.896 23.242 2.04e-15T2 0.048 0.008 5.761 1.50e-05T3 52.404 9.551 5.487 2.71e-05T4 0.016 0.011 1.436 0.167

Residual standard error: 10.4

on 19 degrees of freedomNumber of iterations to convergence: 6Achieved convergence tolerance: 4.267e-06

R-Output 1.3.g: Computer output of the fit for the Puromycin example.

h Confidence Intervals for Function Values. Besides the parameters, the function valueh〈x0, θ〉 for a given x0 is often of interest. In linear regression the function valueh⟨x0, β

⟩= xT0 β =: η0 is estimated by η0 = xT0 β and the corresponding (1−α) · 100%

confidence interval isη0 ± q

tn−p1−α/2 · se〈η0〉

wherese〈η0〉 = σ

√xT0(X T X

)−1x0.

WBL Applied Statistics — Nonlinear Regression

Page 17: Introduction to Nonlinear Regression - ETH Z...Introduction to Nonlinear Regression Andreas Ruckstuhl∗ IDP Institute of Data Analysis and Process Design ZHAW Zurich University of

1.3. Inference Based on Linear Approximations 15

Using asymptotic approximations we can specify confidence intervals for the functionvalues h〈x0; θ〉 for nonlinear h . If η0 := h

⟨x0, θ

⟩is linearly approximated at θ we

getη0 ≈ h〈x0, θ〉+ aT0 (θ − θ) where a0 = ∂h〈x0, θ〉

∂θ.

Based on the asymptotic results of 1.3.e, we obtain the following asymptotic distribu-tion of η0 :

η0as.∼ N

⟨h〈x0, θ〉 , aT0 V 〈θ〉 a0

⟩.

Again, to explicitly use this expression for calculation an approximate (1 − α) confi-dence interval for the function value η0 〈θ〉 := h〈x0, θ〉 we must replace the unkownparameter θ by its estimate θ . This yields

η0⟨θ⟩± qtn−p1−α/2 · se

⟨η0⟨θ⟩⟩,

where

se⟨η0⟨θ⟩⟩

= σ√a T0(AT A

)−1a0 =

√a T0 V a0

with

a0 = ∂h〈x0, θ〉∂θ

∣∣∣θ=θ

.

(definition of V see 1.3.e). If x0 is equal to an observed xi , a0 equals the correspond-ing row of the matrix A .

i Confidence Band. The expression for the (1 − α) confidence interval for η0〈θ〉 :=h〈x0, θ〉 also holds for arbitrary x0 . As in linear regression, it is illustrative to representthe limits of these intervals as a “confidence band” that is a function of x0 . See Figure1.3.i for the confidence bands for the examples “Puromycin” and “Biochemical OxygenDemand”.Confidence bands for linear and nonlinear regression functions behave differently: Forlinear functions the confidence band has minimal width at the center of gravity of thepredictor variables and gets wider the further away one moves from the center (seeFigure 1.3.i, left). In the nonlinear case, the bands can have arbitrary shape. Becausethe functions in the “Puromycin” and “Biochemical Oxygen Demand” examples mustgo through zero, the interval shrinks to a point there. Both models have a horizontalasymptote and therefore the band reaches a constant width for large x (see Figure1.3.i, right).

j Prediction Interval. The confidence band gives us an idea of the function values h〈x〉(the expected values of Y for a given x). However, it does not answer the questionwhere future observations Y0 for given x0 will lie. This is often more interestingthan the question of the function value itself; for example, we would like to know wherethe measured value of oxygen demand will lie for an incubation time of 6 days.Such a statement is a prediction about a random variable and should be distinguishedfrom a confidence interval, which says something about a parameter, which is a fixed(but unknown) number. Hence, we call the region prediction interval. More aboutthis in Chapter 3.1.

A. Ruckstuhl, ZHAW

Page 18: Introduction to Nonlinear Regression - ETH Z...Introduction to Nonlinear Regression Andreas Ruckstuhl∗ IDP Institute of Data Analysis and Process Design ZHAW Zurich University of

16 1.3. Inference Based on Linear Approximations

●●●

●●

●●

●●

1.0 1.2 1.4 1.6 1.8 2.0 2.2

0

1

2

3

Years^(1/3)

log(

PC

B C

once

ntra

tion)

● ●

0 2 4 6 8

0

5

10

15

20

25

30

Days

Oxy

gen

Dem

and

Figure 1.3.i.: Left: Confidence band for an estimated line for a linear problem. Right: Confidenceband for the estimated curve h〈x, θ〉 in the oxygen demand example.

k Variable Selection. In nonlinear regression, unlike in linear regression, variable selec-tion is usually not an important topic, because• there is no one-to-one relationship between parameters and predictor variables.

Usually, the number of parameters is different than the number of predictors.• there are seldom problems where we need to clarify whether an explanatory

variable is necessary or not – the model is derived from the underlying theory(e.g., “enzyme kinetics”).

However, there is sometimes the reasonable question whether a subset of the parame-ters in the nonlinear regression model can appropriately describe the data (see example“Puromycin”).

l Model Selection. Sometimes, there is the situation where we need to find the mostappropriate model for a dataset, for which a collection of candidate models is available.Most of the time, these models are not nested submodels of each other. Then one canuse Akaike’s information criterion (AIC) to select the best model and run a residualanalysis to confirm the selection.

WBL Applied Statistics — Nonlinear Regression

Page 19: Introduction to Nonlinear Regression - ETH Z...Introduction to Nonlinear Regression Andreas Ruckstuhl∗ IDP Institute of Data Analysis and Process Design ZHAW Zurich University of

2 Improved Inference and Its Visualisation

2.1 Likelihood Based Inference

a The quality of the approximate confidence region strongly depends on the quality of thelinear approximation. Also, the convergence properties of the optimization algorithmare influenced by the quality of the linear approximation. With a somewhat largercomputational effort, linearity can be checked graphically and – at the same time – weget more precise confidence intervals.

b F-Test for Model Comparison. To test a null hypothesis θ = θ∗ for the whole param-eter vector or also θj = θ∗j for an individual component, we can use an F -test formodel comparison like in linear regression. Here, we compare the sum of squaresS〈θ∗〉 that arises under the null hypothesis with the sum of squares S〈θ〉 (for n→∞the F - test is the same as the so-called likelihood-ratio test, and the sum of squaresis, up to a constant, equal to the negative log-likelihood).Let us first consider a null-hypothesis θ = θ∗ for the whole parameter vector. The teststatistic is

T = n− pp

S〈θ∗〉 − S〈θ〉S〈θ〉

(as.)∼ Fp,n−p.

Searching for all null-hypotheses that are not rejected leads us to the confidence region{θ∣∣∣S〈θ〉 ≤ S〈θ〉 (1 + p

n−p q)}

,

where q = qFp,n−p1−α is the (1 − α) quantile of the F -distribution with p and n − p

degrees of freedom.In linear regression we get the same (exact) confidence region if we use the (multi-variate) normal distribution of the estimator β . In the nonlinear case the results aredifferent. The region that is based on the F -test is not based on the linear approxi-mation in 1.2.g and hence is (much) more exact.

c Confidence Regions for p=2. For p = 2, we can find the confidence regions by cal-culating S〈θ〉 on a grid of θ values and determine the borders of the region throughinterpolation, as is common for contour plots. Figure 2.1.c illustrates both the con-fidence region based on the linear approximation and based on the F -test for theexample “Puromycin” (left) and for “Biochemical Oxygen Demand” (right).For p > 2 contour plots do not exist. In the next chapter we will introduce graphicaltools that also work in higher dimensions. They depend on the following concepts.

17

Page 20: Introduction to Nonlinear Regression - ETH Z...Introduction to Nonlinear Regression Andreas Ruckstuhl∗ IDP Institute of Data Analysis and Process Design ZHAW Zurich University of

18 2.1. Likelihood Based Inference

190 200 210 220 230 240

0.04

0.05

0.06

0.07

0.08

0.09

0.10

θ1

θ2

0 10 20 30 40 50 60

0

2

4

6

8

10

θ1

θ2

Figure 2.1.c.: Nominal 80 and 95% likelihood contours (——) and the confidence ellipses fromthe asymptotic approximation (– – – –). + denotes the least squares solution. In the Puromycinexample (left) the agreement is good and in the oxygen demand example (right) it is bad.

d F-Test for Individual Parameters. Now the question of interest is whether an indi-vidual parameter θk is equal to a certain value θ∗k . Such a null hypothesis makes nostatement about the remaining parameters. The model that corresponds to the nullhypothesis that best fits to the data is given for a fixed θk = θ∗k by the least squaressolution of the remaining parameters. So, S 〈θ1, . . . , θ

∗k, . . . , θp〉 is minimized with re-

spect to θj , j 6= k . We denote the minimum by Sk and the minimizer θj by θj . Bothvalues depend on θ∗k . We therefore write Sk 〈θ∗k〉 and θj 〈θ∗k〉 .The test statistic for the F -test (with null hypothesis H0 : θk = θ∗k ) is

T k = (n− p)Sk 〈θ∗k〉 − S

⟨θ⟩

S⟨θ⟩ .

It follows (approximately) an F1,n−p distribution.We can now construct a confidence interval by (numerically) solving the equationT k = q

F1,n−p0.95 for θ∗k . It has a solution that is less than θk and one that is larger.

e t-Test via F-Test. In linear regression and in the previous chapter we have calculatedtests and confidence intervals from a test value that follows a t-distribution (t-test forthe coefficients). Is this another test?It turns out that the test statistic of the t-test in linear regression turns into the teststatistic of the F -test if we square it. Hence, both tests are equivalent. In nonlinearregression, the F -test is not equivalent to the t-test discussed in the last chapter(1.3.e). However, we can transform the F -test to a t-test that is more accurate thanthe one of the last chapter:From the test statistic of the F -test, we take the square-root and add the sign ofθk − θ∗k ,

Tk 〈θ∗k〉 := sign⟨θk − θ∗k

⟩ √Sk⟨θ∗k⟩− S

⟨θ⟩

σ.

WBL Applied Statistics — Nonlinear Regression

Page 21: Introduction to Nonlinear Regression - ETH Z...Introduction to Nonlinear Regression Andreas Ruckstuhl∗ IDP Institute of Data Analysis and Process Design ZHAW Zurich University of

2.2. Profile t-Plot and Profile Traces 19

sign〈a〉 denotes the sign of a and σ2 = S⟨θ⟩/(n− p). This test statistic is (approxi-

mately) tn−p distributed.In the linear regression model, Tk is – as already pointed out – equal to the teststatistic of the usual t-test,

Tk 〈θ∗k〉 = θk − θ∗kse⟨θk⟩ .

f Confidence Intervals for Function Values via F -test. With this technique we canalso determine confidence intervals for a function value at a point x0 . For this we re-parameterize the original problem so that a parameter, say φ1 , represents the functionvalue h〈x0〉 and proceed as in 2.1.d.

2.2 Profile t-Plot and Profile Traces

a Profile t-Function and Profile t-Plot. The graphical tools for checking the linearapproximation are based on the just discussed t-test, that actually doesn’t use thisapproximation. We consider the test statistic Tk (2.1.e) as a function of its argumentsθk and call it profile t-function (in the last chapter the arguments were denotedwith θ∗k , now for simplicity we leave out the ∗ ). For linear regression we get, as can beseen from 2.1.e, a straight line, while for nonlinear regression the result is a monotoneincreasing function. The graphical comparison of Tk〈θk〉 with a straight line is theso-called profile t-plot. Instead of θk , it is common to use a standardized version

δk 〈θk〉 := θk − θkse⟨θk⟩

on the horizontal axis because it is used in the linear approximation. The comparisonline is then the “diagonal”, i.e. the line with slope 1 and intercept 0.The more the profile t-function is curved, the stronger the nonlinearity in a neighbor-hood of θk . Therefore, this representation shows how good the linear approximationis in a neighborhood of θk (the neighborhood that is statistically important is ap-proximately determined by |δk〈θk〉| ≤ 2.5). In Figure 2.2.a it is apparent that inthe Puromycin example the nonlinearity is minimal, while in the Biochemical OxygenDemand example it is large.In Figure 2.2.a we can also read off the confidence intervals according to 2.1.e. For con-venience, the probabilites P 〈Tk ≤ t〉 of the corresponding t-distributions are markedon the right vertical axis. For the Biochemical Oxygen Demand example this resultsin a confidence interval without upper bound!

Example b Membrane Separation Technology (cont’d) As 2.2.a shows, from the profile t-plotwe can graphically read out corresponding confidence intervals that are based on theprofile t-function. The R function confint(...) numerically calculates the desiredconfidence interval on the basis of the profile t-function. R Output 2.2.b shows thecorresponding R output from the membrane separation example. In this case, no largedifferences from the classical calculation method are apparent (cf. 1.3.f).

A. Ruckstuhl, ZHAW

Page 22: Introduction to Nonlinear Regression - ETH Z...Introduction to Nonlinear Regression Andreas Ruckstuhl∗ IDP Institute of Data Analysis and Process Design ZHAW Zurich University of

20 2.2. Profile t-Plot and Profile Traces

> confint(Mem.fit, level=0.95)Waiting for profiling to be done...

2.5% 97.5%T1 163.4661095 163.9623685T2 159.3562568 160.0953953T3 1.9262495 3.6406832T4 -0.6881818 -0.3797545

R-Output 2.2.b: Membrane separation technology example: R output for the confidence intervalsthat are based on the profile t-function.

c Likelihood Profile Traces. The likelihood profile traces are another useful graphicaltool. Here the estimated parameters θj , j 6= k for fixed θk (see 2.1.d) are consideredas functions θ(k)

j 〈θk〉 .The graphical representation of these functions would fill a whole matrix of diagrams,but without diagonals. It is worthwhile to combine the “opposite” diagrams of thismatrix: Over the representation of θ(k)

j 〈θk〉 we superimpose θ(j)k 〈θj〉 in mirrored form

so that the axes have the same meaning for both functions.Figure 2.2.c shows one of these diagrams for both our two examples. Additionally,contours of confidence regions for [θ1, θ2] are plotted. It can be seen that that theprofile traces intersect the contours at points where they have horizontal or verticaltangents.The representation does not only show the nonlinearities, but is also useful for the un-derstanding of how the parameters influence each other. To understand this, wego back to the case of a linear regression function. The profile traces in the individualdiagrams then consist of two lines, that intersect at the point [θ1, θ2] . If we standardizethe parameter by using δk 〈θk〉 from 2.2.a, one can show that the slope of the traceθ

(k)j 〈θk〉 is equal to the correlation coefficient ckj of the estimated coefficients θj andθk . The “reverse line” θ

(j)k 〈θj〉 then has, compared with the horizontal axis, a slope

−3

−2

−1

0

1

2

3

190 200 210 220 230 240

δ(θ1)

T1(θ

1)

θ1

−2 0 2 4

0.99

0.80

0.0

0.80

0.99

Leve

l

−4

−2

0

2

4

20 40 60 80 100

δ(θ1)

T1(θ

1)

θ1

0 10 20 30

0.99

0.80

0.0

0.80

0.99

Leve

l

Figure 2.2.a.: Profile t -plot for the first parameter for both the Puromycin (left) and the BiochemicalOxygen Demand example (right). The dashed lines show the applied linear approximation and thedotted line the construction of the 99% confidence interval with the help of T1〈θ1〉 .

WBL Applied Statistics — Nonlinear Regression

Page 23: Introduction to Nonlinear Regression - ETH Z...Introduction to Nonlinear Regression Andreas Ruckstuhl∗ IDP Institute of Data Analysis and Process Design ZHAW Zurich University of

2.3. Parameter Transformations 21

190 200 210 220 230 240

0.04

0.05

0.06

0.07

0.08

0.09

0.10

θ1

θ 2

15 20 25 30 35 40

0.5

1.0

1.5

2.0

θ1

θ 2

Figure 2.2.c.: Likelihood profile traces for the Puromycin and Oxygen Demand examples, with 80%-and 95% confidence regions (gray curves).

of 1/ckj . The angle between the lines is thus a monotone function of the correlation.It therefore measures the collinearity between the two predictor variables. If thecorrelation between the parameter estimates is zero, then the traces are orthogonal toeach other.For a nonlinear regression function, both traces are curved. The angle between themstill shows how strongly the two parameters θj and θk interplay, and hence how theirestimators are correlated.

Example d Membrane Separation Technology (cont’d) All profile t-plots and profile traces canbe put in a triangular matrix, as can be seen in Figure 2.2.d. Most profile traces arestrongly curved, meaning that the regression function tends to a strong nonlinearityaround the estimated parameter values. Even though the profile traces for θ3 and θ4are straight lines, a further problem is apparent: The profile traces lie on top of eachother! This means that the parameters θ3 and θ4 are strongly collinear. Parameterθ2 is also collinear with θ3 and θ4 , although more weakly.

e * Good Approximation of Two Dimensional Likelihood Contours. The profile traces can be used toconstruct very accurate approximations for two dimensional projections of the likelihood contours(see Bates and Watts, 1988). Their calculation is computationally less demanding than for thecorresponding exact likelihood contours.

2.3 Parameter Transformations

a In this section we study the effects of transforming the parameters. This topic isbased on the fact that the mean regression function can usually be written downby mathematically equivalent expressions. For example, the two expression for theMichaelis-Menten function are equivalent:

θ1x

θ2 + x= x

ϑ1 + ϑ2x.

A. Ruckstuhl, ZHAW

Page 24: Introduction to Nonlinear Regression - ETH Z...Introduction to Nonlinear Regression Andreas Ruckstuhl∗ IDP Institute of Data Analysis and Process Design ZHAW Zurich University of

22 2.3. Parameter Transformations

163.4 163.8

−3

−2

−1

0

1

2

3

θ1

θ1

163.4 163.8

159.2

159.4

159.6

159.8

160.0

160.2

159.2 159.6 160.0

−3

−2

−1

0

1

2

3

θ2

θ2

163.4 163.8

2.0

2.5

3.0

3.5

4.0

159.2 159.6 160.0

2.0

2.5

3.0

3.5

4.0

2.0 3.0 4.0

−3

−2

−1

0

1

2

3

θ3

θ3

163.4 163.8

−0.8

−0.7

−0.6

−0.5

−0.4

159.2 159.6 160.0

−0.8

−0.7

−0.6

−0.5

−0.4

2.0 3.0 4.0

−0.8

−0.7

−0.6

−0.5

−0.4

−0.8 −0.6 −0.4

−3

−2

−1

0

1

2

3

θ4

θ4

Figure 2.2.d.: Profile t -plots and Profile Traces for the Example “Membrane Separation Technology”.The + in the profile t -plot denotes the least squares solution.

Hence, we must have the following relations between the two sets of parameters:

ϑ1 = θ2θ1

and ϑ1 = 1θ1.

In another example we have the two equivalent expressions

θ1eθ2x = ϑ1ϑ

x2

and the following relations between the two sets of parameters:

ϑ1 = θ1 and ϑ1 = eθ2 .

Why should such equivalent expressions of the mean regression function be of anyinterest to us?

b

WBL Applied Statistics — Nonlinear Regression

Page 25: Introduction to Nonlinear Regression - ETH Z...Introduction to Nonlinear Regression Andreas Ruckstuhl∗ IDP Institute of Data Analysis and Process Design ZHAW Zurich University of

2.3. Parameter Transformations 23

Parameter transformations. are primarily used to improve the linear approximationand therefore improve the convergence behavior and the quality of the confidenceinterval.We point out that parameter transformations, unlike transformations of the responsevariable (see 1.1.k), do not change the error part of the model. Hence, they are nothelpful if the assumptions about the distribution of the random error are violated. Itis the quality of the linear approximation and the statistical statements based on itthat are being changed.

Sometimes the transformed parameters are very difficult to interpret. The importantquestions often concern individual parameters of the original parameter set. Neverthe-less, we can work with transformations: We derive more accurate confidence regionsfor the transformed parameters and can transform them back to get results for theoriginal parameters.

c Restricted Parameter Regions. Often the admissible region of a parameter is re-stricted, e.g. because the regression function is only defined for positive values of aparameter. Usually, such a constraint is ignored to begin with and we wait to seewhether and where the algorithm converges. According to experience, parameter esti-mation will end up in a reasonable range if the model describes the data well and thedata contain enough information for determining the parameters.Sometimes, though, problems occur in the course of the computation, especially if theparameter value that best fits the data lies near the border of the admissible region.The simplest way to deal with such problems is via transformation of the parameter.Examples• The parameter θ should be positive. Through a transformation θ −→ φ = ln〈θ〉 ,

θ = exp〈φ〉 is always positive for all possible values of φ ∈ R :

h〈x, θ〉 −→ h〈x, exp〈φ〉〉.

• The parameter should lie in the interval (a, b). With the log transformationθ = a + (b − a)/(1 + exp〈φ〉), θ can (for arbitrary φ ∈ R) only take values in(a, b).

• In the modelh〈x, θ〉 = θ1 exp〈−θ2x〉+ θ3 exp〈−θ4x〉

with θ2, θ4 > 0 the parameter pairs (θ1, θ2) and (θ3, θ4) are interchangeable, i.e.h〈x, θ〉 does not change. This can create uncomfortable optimization problems,because the solution is not unique. The constraint 0 < θ2 < θ4 that ensures theuniqueness is achieved via the transformation θ2 = exp〈φ2〉 und θ4 = exp〈φ2〉(1+exp〈φ4〉). The function is now

h〈x, (θ1, φ2, θ3, φ4)〉 = θ1 exp 〈− exp〈φ2〉x〉 + θ3 exp 〈− exp〈φ2〉(1 + exp〈φ4〉)x〉 .

A. Ruckstuhl, ZHAW

Page 26: Introduction to Nonlinear Regression - ETH Z...Introduction to Nonlinear Regression Andreas Ruckstuhl∗ IDP Institute of Data Analysis and Process Design ZHAW Zurich University of

24 2.3. Parameter Transformations

d Parameter Transformation for Collinearity. A simultaneous variable and parametertransformation can be helpful to weaken collinearity in the partial derivative vectors.For example, the model h〈x, θ〉 = θ1 exp〈−θ2x〉 has derivatives

∂h

∂θ1= exp〈−θ2x〉 ,

∂h

∂θ2= −θ1x exp〈−θ2x〉 .

If all x values are positive, both vectors

a1 := (exp〈−θ2x1〉, . . . , exp〈−θ2xn〉)T

a2 := (−θ1x1 exp〈−θ2x1〉, . . . ,−θ1xn exp〈−θ2xn〉)T

tend to disturbing collinearity. This collinearity can be avoided if we use center-ing. The model can be written as h〈x; θ〉 = θ1 exp〈−θ2(x− x∗ + x∗)〉 With the re-parameterization φ1 := θ1 exp〈−θ2x

∗〉 and φ2 := θ2 we get

h⟨x;φ

⟩= φ1 exp〈−φ2(x− x∗)〉 .

The derivative vectors are approximately orthogonal if we chose the mean value of thexi for x∗ .

Example e Membrane Separation Technology (cont’d) In this example it is apparent from theapproximate correlation matrix (Table 2.3.e, left half) that the parameters θ3 and θ4are strongly correlated (we have already observed this in 2.2.d using the profile traces).If the model is re-parameterized to

yi = θ1 + θ2 10θ3+θ4(xi−med〈xj〉)

1 + 10θ3+θ4(xi−med〈xj〉)+ Ei, i = 1 . . . n ,

where x∗ = med〈xj〉 , with θ3 = θ3 + θ4 med〈xj〉 , an improvement is achieved (righthalf of Table 2.3.e).

θ1 θ2 θ3 θ1 θ2 θ3

θ2 −0.256 θ2 −0.256θ3 −0.434 0.771 θ3 0.323 0.679θ4 0.515 −0.708 −0.989 θ4 0.515 −0.708 −0.312

Table 2.3.e.: Correlation matrices for the Membrane Separation Technology example for the originalparameters (left) and the transformed parameters θ3 (right).

WBL Applied Statistics — Nonlinear Regression

Page 27: Introduction to Nonlinear Regression - ETH Z...Introduction to Nonlinear Regression Andreas Ruckstuhl∗ IDP Institute of Data Analysis and Process Design ZHAW Zurich University of

2.3. Parameter Transformations 25

f Reparameterization. In Chapter 2.2 we have presented means for graphical evaluationof the linear approximation. If the approximation is considered inadequate we wouldlike to improve it. An appropriate reparameterization can contribute to this. So, forexample, for the model

h〈x, θ〉 = θ1θ3(x(2) − x(3))1 + θ2x(1) + θ3x(2) + θ4x(3)

the reparameterization

h〈x, φ〉 = x(2)−(3)

φ1 + φ2x(1) + φ3x(2) + φ4x(3)

is recommended by (Ratkowsky, 1985) (see also exercises).

Example g Membrane Separation Technology (cont’d) The parameter transformation in 2.3.eleads to a satisfactory result, as far as correlation is concerned. If we look at thelikelihood contours or the profile t-plot and the profile traces, the parameterization isstill not satisfactory.An intensive search for further improvements leads to the following transformationsthat turn out to have satisfactory profile traces (see Figure 2.3.g):

θ1:=θ1 + θ2 10θ3

10θ3 + 1, θ2:= log10

(θ1 − θ2

10θ3 + 110θ3

),

θ3:=θ3 + θ4 med〈xj〉 θ4:=10θ4 .

The model is now

Yi = θ1 + 10θ2 1− θ4(xi−med〈xj)〉

1 + 10θ3 θ4(xi−med〈xj〉) + Ei .

and we get the result shown in R Output 2.3.g

h More Successful Reparametrization. It turned out that a successfulreparametrization is very data set specific. A reason is that nonlinearities andcorrelations between estimated parameters depend on the (estimated) parameter vec-tor itself. Therefore, no generally valid recipe can be given. This makes the search forappropriate reparametrizations often very difficult.

A. Ruckstuhl, ZHAW

Page 28: Introduction to Nonlinear Regression - ETH Z...Introduction to Nonlinear Regression Andreas Ruckstuhl∗ IDP Institute of Data Analysis and Process Design ZHAW Zurich University of

26 2.3. Parameter Transformations

0.28 0.30 0.32 0.34 0.36

−1.0

−0.5

0.0

0.5

1.0

θ1

θ1

Figure 2.3.g.: Profile t -plot and profile traces for the Membrane Separation Technology exampleaccording to the given transformations.

i Failure of Gaussian Error Propagation.. Even if a parameter transformation helps usdeal with difficulties with the convergence behavior of the algorithm or the quality ofthe confidence intervals, the original parameters often have a physical interpretation.We take the simple transformation example θ −→ φ = ln 〈θ〉 from 2.3.c. The fitting ofthe model opens with an estimator φwith estimated standard error σ

φ. An obvious

estimator for θ is then θ = exp〈φ〉 . The standard error for θ can be determined withthe help of the Gaussian error propagation law (see Stahel, 2000, Abschnitt 6.10):

σ2θ≈(∂ exp〈φ〉∂φ

∣∣∣∣φ=φ

)2

σ2φ

=(exp

⟨φ⟩)2

σ2φ

or

σθ≈ exp

⟨φ⟩σφ.

WBL Applied Statistics — Nonlinear Regression

Page 29: Introduction to Nonlinear Regression - ETH Z...Introduction to Nonlinear Regression Andreas Ruckstuhl∗ IDP Institute of Data Analysis and Process Design ZHAW Zurich University of

2.3. Parameter Transformations 27

Formula: delta ∼ TT1 + 10ˆTT2 * (1 - TT4ˆpHR)/(1 + 10ˆTT3 * TT4ˆpHR)

Parameters:Estimate Std. Error t value Pr(> |t|)

TT1 161.60008 0.07389 2187.122 < 2e-16TT2 0.32336 0.03133 10.322 3.67e-12TT3 0.06437 0.05951 1.082 0.287TT4 0.30767 0.04981 6.177 4.51e-07

Residual standard error:

0.2931 on 35 degrees of freedom

Correlation of Parameter Estimates:TT1 TT2 TT3

TT2 -0.56TT3 -0.77 0.64TT4 0.15 0.35 -0.31

Number of iterations to convergence: 5

Achieved convergence tolerance: 9.838e-06

R-Output 2.3.g: Membrane Separation Technology: Summary of the fit after parameter transfor-mation.

From this we get the approximate 95% confidence interval for θ :

exp⟨φ⟩± σ

θqtn−p0.975 = exp

⟨φ⟩ (

1± σφqtn−p0.975

).

The Gaussian error propagation law is based on the linearization of the transformationfunction g 〈·〉 ; concretely on exp〈·〉 . We have carried out the parameter transformationbecause the quality of the confidence intervals left a lot to be desired, then unfortu-nately this linearization negates what has been achieved and we are back where westarted before the transformation.The correct approach in such cases is to determine the confidence interval as presentedin Chapter 2.1. If this is impossible for whatever reason, we can fall back on thefollowing approximation.

j Confidence Intervals on the Original Scale (Alternative Approach). Even thoughparameter transformations help us in situations where we have problems with conver-gence of the algorithm or the quality of confidence intervals, the original parametersoften remain the quantity of interest (e.g., because they have a nice physical interpre-tation). Consider the transformation θ −→ φ = ln 〈θ〉 . Fitting the model results in anestimator φ and an estimated standard error σ

φ. Now we can construct a confidence

interval for θ . We have to search all θ for which ln 〈θ〉 lies in the interval

φ± σφqtdf0.975.

Generally formulated: Let g be the transformation of φ to θ = g〈φ〉 . Then{θ : g−1 〈θ〉 ∈

[φ− σ

φqtdf0.975, φ+ σ

φqtdf0.975

]}is an approximate 95% confidence interval for θ . If g−1 〈·〉 is strictly monotone in-creasing, this confidence interval is identical to[

g⟨φ− σ

φqtdf0.975

⟩, g⟨φ+ σ

φqtdf0.975

⟩].

A. Ruckstuhl, ZHAW

Page 30: Introduction to Nonlinear Regression - ETH Z...Introduction to Nonlinear Regression Andreas Ruckstuhl∗ IDP Institute of Data Analysis and Process Design ZHAW Zurich University of

28 2.4. Bootstrap

This procedure also ensures that, unlike the Gaussian error propagation law, the con-fidence interval is entirely in the region that is predetermined for the parameter. It isthus impossible that, for example, the confidence interval in the example θ = exp〈φ〉 ,unlike the interval from 2.3.i, can contain negative values.

However, this approach should only be used if the calculation based on the F -test fromChapter 2.1 is not possible. On the other hand, the author does prefer this approachto the Gaussian error propagation law.

2.4 Bootstrap

a An alternative to profile confidence intervals is to apply the bootstrap method. It isa resampling method and does not rely on the linear approximation as well. Bootstrapallows an estimation of the sampling distribution of almost any statistic using onlyvery simple techniques.The basic idea of bootstrapping is that inference about a parameter from sample datacan be modelled by resampling the sample data and performing inference based onthese resampled datasets. That is, bootstrapping treats inference of a parameter θ ,given the sample data, as being analogous to inference of the estimated parameter θ ,given the resampled data.The simplest bootstrap method involves taking the original dataset of n observationsand sampling from it to form a new sample (called a ’resample’ or bootstrap sample)that is also of size n . The bootstrap sample is taken from the original dataset usingsampling with replacement so it is not identical with the original “real” sample. Thisprocess is repeated a large number of times (typically 1,000 or 10,000 times), and foreach of these bootstrap samples we compute its parameter estimation (each of theseare called bootstrap estimates). We now have a histogram of bootstrap estimates.This provides an estimate of the shape of the distribution of the parameter estimatesfrom which we can answer questions about how much the parameter estimates varies.

b In regression analysis, one may resample from the complete observations, that are pairsof response and explanatory variables (yi, xi), or just from the residuals ri . In thelatter case the bootstrap observations ((y∗i , xi)) are constructed by y∗i = h

⟨xi, θ

⟩+r∗i ,

where r∗i is a resampled residual. In either case, we will call the bootstrap methodnonparametric. Hence, there are also parametric bootstrap methods. A parametricbootstrap version is to assume that the residuals are Gaussian distributed and hencewe resample from a Gaussian distribution with expectation 0 and variance σ2 , thevariance estimate from the nonlinear regression fit.

c We will use the R function nlsBoot() in the package nlstools. This means thatwe use a nonparametric bootstrap approach where the mean centred residuals arebootstrapped. The residuals are mean centred because the residuals may have a non-zero mean with a nonlinear regression model (Venables and Ripley, 2002, Sec 8.4,). Bydefault, nlsBoot() generates B = 999 datasets, and for each dataset the considerednonlinear regression model is fitted and the resulting parameter estimates stored.

WBL Applied Statistics — Nonlinear Regression

Page 31: Introduction to Nonlinear Regression - ETH Z...Introduction to Nonlinear Regression Andreas Ruckstuhl∗ IDP Institute of Data Analysis and Process Design ZHAW Zurich University of

2.4. Bootstrap 29

The bootstrapped values are useful for assessing the marginal distributions of the pa-rameters estimates. So we can start out by comparing them with a normal distribution,which they should ideally follow if the linear approximation holds.

d A basic method to construct confidence limits is based on quantiles in the empiricaldistribution of the bootstrap parameter estimates. If B = 999 bootstrap simulationsare used, then the empirical distribution is based on 1’000 values: The original estimateand the 999 bootstrap estimates. Thus, to construct a 95% bootstrap confidenceinterval, we take the 25th value and the 975th value among the 1’000 ordered estimates(ordered from smallest to largest) as left and right endpoints, respectively. This type ofbootstrap confidence interval is called a bootstrap percentile confidence interval.There are better ways of constructing bootstrap confidence interval based on a naturalderivation for statisticians.The bootstrap confidence interval has the advantage of lying entirely within the rangeof plausible parameter values. This is in contrast to Wald confidence intervals, whichfor small sample size occasionally may give unrealistic lower and upper limits.The bootstrap approach is somewhat computer-intensive, as the nonlinear regressionmodel considered has to be refitted numerous times, but for many applications it willstill be a feasible approach. Moreover, the linear approximation will improve as thesample size increases, and therefore the bootstrap approach may be most useful forsmall datasets.

Example e Biochemical Oxygen Demand (cont’d). In R Output 2.4.e the three introducedprocedure to determine confidence intervals for the parameters in the BiochemicalOxygen Demand model are calculated. The results of the three approaches differconsiderably. As we know from 2.1.c, the linear approximation approach is poor inthis example. Hence, we are not surprised to see the difference between the result ofthe Wald approach and the results of the profile likelihood approach.The difference between the results of the profile likelihood approach and the result ofthe bootstrap approach is rather nebulous. Taking a look at the marginal bootstrapdistributions of both parameter estimates and their joint distribution in Figure 2.4.e,we once again can reassure that the linear approximation is unsuitable. Compared tothe likelihood contours in 2.1.c, which rely on the Gaussian error assumption, the jointbootstrap distribution does not identify a vertical “funnel” at the left-hand-side of thejoint distribution. This difference yields different confidence intervals.

A. Ruckstuhl, ZHAW

Page 32: Introduction to Nonlinear Regression - ETH Z...Introduction to Nonlinear Regression Andreas Ruckstuhl∗ IDP Institute of Data Analysis and Process Design ZHAW Zurich University of

30 2.4. Bootstrap

## Wald:> h <- summary(D.bod.nls)$coefficients[,2]> coef(D.bod.nls) + qt(0.975, 19)*cbind(‘‘2.5%’’=-h, ‘‘97.5%’’=h)

2.5% 97.5%Th1 13.9185602 24.3665858Th2 0.1060357 0.9561475

## Profile Likelihood:

> confint(D.bod.nls)Waiting for profiling to be done...

2.5% 97.5%Th1 14.0845447 38.482510Th2 0.1356242 1.810125

## Bootstrap:

> library(nlstools)> D.bod.nls.Boot <- nlsBoot(D.bod.nls)> summary(D.bod.nls.Boot)------Bootstrap estimates

Th1 Th219.0667361 0.5365059

------

Bootstrap confidence intervals2.5% 97.5%

Th1 16.0755863 25.699375Th2 0.2815905 1.088552

R-Output 2.4.e: Biochemical Oxygen Demand. 95% confidence interval determined by three differ-ent approaches: Based on Wald, profile likelihoods and bootstrap, respectively.

WBL Applied Statistics — Nonlinear Regression

Page 33: Introduction to Nonlinear Regression - ETH Z...Introduction to Nonlinear Regression Andreas Ruckstuhl∗ IDP Institute of Data Analysis and Process Design ZHAW Zurich University of

2.4. Bootstrap 31

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●●

●●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●●

● ●●

●●

●●

●●●

● ●●

●●

●●

●●

●●●

●●●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

−3 −2 −1 0 1 2 3

15

20

25

30

Th1

Theoretical Quantiles

Sam

ple

Qua

ntile

s

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●●

●●

●●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

−3 −2 −1 0 1 2 3

0.2

0.4

0.6

0.8

1.0

1.2

1.4

Th2

Theoretical Quantiles

Sam

ple

Qua

ntile

s

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●●

●● ●●

●●

● ●

●●●

●●

●●

● ●

●●

●● ●

●●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●● ●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

● ●

●●

●●●

●●●

● ●●

●●

●●●●

● ●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

● ● ●

●●

●●

●● ●

● ●

●●

●●

●●

●●

●●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

● ●

● ●●

● ●

●●

●●

●●

●●

● ●

● ●●

●●

●●

● ●

●●

●●

●●

15 20 25 30

0.2

0.4

0.6

0.8

1.0

1.2

1.4

Th1

Th2

Figure 2.4.e.: Biochemical Oxygen Demand. Both the normal plots of the marginal bootstrapdistributions of the parameters estimates in the top row and the scatterplot of the joint bootstrapdistribution (the bottom left graphic) clearly shows the distortion from a Gaussian distribution.

A. Ruckstuhl, ZHAW

Page 34: Introduction to Nonlinear Regression - ETH Z...Introduction to Nonlinear Regression Andreas Ruckstuhl∗ IDP Institute of Data Analysis and Process Design ZHAW Zurich University of

3 Prediction and Calibration

3.1 Prediction

a Besides the question of the set of plausible parameters (with respect to the given data,which we also call training data set), the question of the range of future observationsis often of central interest. The difference between these two questions was alreadydiscussed in 1.3.j. In this chapter we want to answer the second question. We assumethat the parameter θ is estimated using the least squares method. What can we nowsay about a future observation Y0 at a given point x0 ?

Example b Cress The concentration of an agrochemical material in soil samples can be studiedthrough the growth behavior of a certain type of cress (nasturtium). 6 measurementsof the response variable Y were made on each of 7 soil samples with predetermined(or measured with the largest possible precision) concentrations x . Hence, we assumethat the x-values have no measurement error. The variable of interest is the weightof the cress per unit area after 3 weeks. A “logit-log” model is used to describe therelationship between concentration and weight:

h〈x; θ〉 =

θ1 if x = 0θ1

1+exp〈θ2+θ3 ln〈x〉〉 if x > 0.

The data and the function h〈·〉 are illustrated in Figure 3.1.b. We can now askourselves which weight values will we see at a concentration of e.g. x0 = 3?

●●

●●

●●●●●●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●

0 1 2 3 4

200

400

600

800

1000

Concentration

Wei

ght

Concentration

Wei

ght

Figure 3.1.b.: Cress Example. Left: Representation of the data. Right: A typical shape of theapplied regression function.

32

Page 35: Introduction to Nonlinear Regression - ETH Z...Introduction to Nonlinear Regression Andreas Ruckstuhl∗ IDP Institute of Data Analysis and Process Design ZHAW Zurich University of

3.1. Prediction 33

c Approximate Prediction Intervals. We can estimate the expected value E〈Y0〉 =h〈x0, θ〉 of the variable of interest Y at the point x0 by η0 := h〈x0, θ〉 . We also wantto get an interval where a future observation will lie with high probability. So, we donot only have to take into account the randomness of the estimate η0 , but also therandom error E0 . Analogous to linear regression, an at least approximate (1 − α/2)prediction interval is given by

η0 ± qtn−p1−α/2 ·

√σ2 +

(se〈η0〉

)2.

The calculation of se〈η0〉 can be found in 1.3.h.* Derivation The random variable Y0 is the value of interest for an observation with predictor

variable value x0 . Since we do not know the true curve (actually only the parameters), we have nochoice but to study the deviations of the observations from the estimated curve,

R0 = Y0 − h⟨x0, θ

⟩=(Y0 − h〈x0, θ〉

)−(h⟨x0, θ

⟩− h〈x0, θ〉

).

Even if θ is unknown, we know the distribution of the expressions in the parentheses: Both arenormally distributed random variables and they are independent because the first only dependson the “future” observation Y0 , the second only on the observations Y1, . . . , Yn that led to theestimated curve. Both have expected value 0; the variances add up to

var〈R0〉 ≈ σ2 + σ2aT0 (ATA)−1a0.

The described prediction interval follows by replacing the unknown values by their correspondingestimates.

d Prediction Versus Confidence Intervals. If the sample size n of the training dataset is very large, the estimated variance is dominated by the error variance σ2 . Thismeans that the uncertainty in the prediction is then primarily caused by the randomerror. The second term in the expression for the variance reflects the uncertainty thatis caused by the estimation of θ .

It is therefore clear that the prediction interval is wider than the confidence intervalfor the expected value, since the random error of the observation must be taken intoaccount. The endpoints of such intervals are shown in Figure 3.2.c (left).

e * Quality of the Approximation. The derivation of the prediction interval in 3.1.c is based on thesame approximation as in Chapter 1.3. The quality of the approximation can again be checkedgraphically.

f Interpretation of the “Prediction Band”. The interpretation of the “prediction band”(as shown in Figure 3.2.c), is not straightforward. From our derivation it holds that

P⟨V ∗0 〈x0〉 ≤ Y0 ≤ V ∗1 〈x0〉

⟩= 0.95,

where V ∗0 〈x0〉 is the lower and V ∗1 〈x0〉 the upper bound of the prediction interval forh〈x0〉 . However, if we want to make a prediction about more than one future observa-tion, then the number of the observations in the prediction interval is not binomiallydistributed with π = 0.95. The events that the individual future observations fall inthe band are not independent; they depend on each other through the random bordersV0 and V1 . If, for example, the estimation of σ randomly turns out to be too small,the band is too narrow for all future observations, and too many observations wouldlie outside the band.

A. Ruckstuhl, ZHAW

Page 36: Introduction to Nonlinear Regression - ETH Z...Introduction to Nonlinear Regression Andreas Ruckstuhl∗ IDP Institute of Data Analysis and Process Design ZHAW Zurich University of

34 3.2. Calibration

3.2 Calibration

a The actual goal of the experiment in the cress example is to estimate the concen-tration of the agrochemical material from the weight of the cress. This means that wewould like to use the regression relationship in the “wrong” direction. This will causeproblems with statistical inference. Such a procedure is often desired to calibratea measurement method or to predict the result of a more expensive measurementmethod from a cheaper one. The regression curve in this relationship is often called acalibration curve. Another keyword for finding this topic is inverse regression.Here, we would like to present a simple method that gives a useable result if simplifyingassumptions hold.

b Procedure under Simplifying Assumptions. We assume that the predictor values xhave no measurement error. In our example this is achieved if the concentrations ofthe agrochemical material are determined very carefully. For several soil samples withmany different possible concentrations we carry out several independent measurementsof the response value Y . This results in a training data set that is used to estimatethe unknown parameters and the corresponding parameter errors.Now, for a given value y0 it is obvious to determine the corresponding x0 value bysimply inverting the regression function:

x0 = h−1⟨y0, θ⟩.

Here, h−1 denotes the inverse function of h . However, this procedure is only correctif h〈·〉 is monotone increasing or decreasing. Usually, this condition is fulfilled incalibration problems.

c Accuracy of the Obtained Values. Of course we now face the question about theaccuracy of x0 . The problem seems to be similar to the prediction problem. However,here we observe y0 and the corresponding value x0 has to be estimated.The answer can be formulated as follows: We treat x0 as a parameter for which wewant a confidence interval. Such an interval can be constructed (as always) from a test.We take as null hypothesis x = x0 . As we have seen in 3.1.c, Y lies with probability0.95 in the prediction interval

η0 ± qtn−p1−α/2 ·

√σ2 +

(se〈η0〉

)2,

where η0 was a compact notation for h〈x0, θ〉 . Therefore, this interval is an acceptanceinterval for the value Y0 (which here plays the role of a test statistic) under the nullhypothesis x = x0 . Figure 3.2.c illustrates all prediction intervals for all possible valuesof x0 for the given interval in the Cress example.

d Illustration. Figure 3.2.c (right) illustrates the approach for the Cress example: Mea-sured values y0 are compatible with parameter values x0 in the sense of the test, ifthe point [x0, y0] lies in the (prediction interval) band. Hence, we can thus determinethe set of x0 values that are compatible with a given observation y0 . They form thedashed interval, which can also be described as the set{

x :∣∣y0 − h

⟨x, θ

⟩∣∣ ≤ qtn−p1−α/2 ·√σ2 +

(se⟨h⟨x, θ

⟩⟩)2}.

WBL Applied Statistics — Nonlinear Regression

Page 37: Introduction to Nonlinear Regression - ETH Z...Introduction to Nonlinear Regression Andreas Ruckstuhl∗ IDP Institute of Data Analysis and Process Design ZHAW Zurich University of

3.2. Calibration 35

●●●

●●

●●●●●●

●●

● ●

●●●

●●●●

●●

●●

●●

●●

●●●

0 1 2 3 4 5

0

200

400

600

800

1000

Concentration

Wei

ght

●●●

●●

●●●●●●

●●

● ●

●●●

●●●●

●●

●●

●●

●●

●●●

0 1 2 3 4 5

0

200

400

600

800

1000

Concentration

Wei

ght

Figure 3.2.c.: Cress example. Left: Confidence band for the estimated regression curve (dashed) andprediction band (solid). Right: Schematic representation of how a calibration interval is determined,at the points y0 = 650 and y0 = 350 . The resulting intervals are [0.4, 1.22] and [1.73, 4.34] ,respectively.

This interval has been constructed by inverting the prediction interval of 3.2.c. It isthe desired confidence interval (or calibration interval) for x0 .

If we have m values to determine y0 , we apply the above method to y0 =∑mj=0 y0j/m

an: x : |y0 − h⟨x, θ

⟩| ≤

√σ2

m+(se⟨h⟨x, θ

⟩⟩)2· qtn−p1−α/2

.e * In the exercises, you will apply the R function f.calib(...) to calculate calibration intervals. The

basic idea is to calculate the intersection of the nonlinear curve and one of the horizontal lines bythe following R code:

# cal.x ## x-values of calibration data: x

# cal.fit ## fitted/predicted values for cal.x: h⟨x, θ⟩

# limits ## qtn−p

1−α/2 ·√σ2 +

(se⟨h⟨x, θ⟩⟩)2

# obs.y ## observed y value for which we want to know x: y0

I1 <- uniroot(function(x){obs.y - approx(cal.x, cal.fit - limits, xout=x)$y, range(cal.x)})$root

Ic <- uniroot(function(x){obs.y - approx(cal.x, cal.fit, xout=x)$y, range(cal.x)})$root

I2 <- uniroot(function(x){obs.y - approx(cal.x, cal.fit + limits, xout=x)$y, range(cal.x)})$root

The first call of the function uniroot(...) calculates the x position of intersection of the lowerprediction bound with the horizontal line; that is, it solves the equation

y0 − h⟨x, θ⟩− qtn−p

1−α/2 ·

√σ2 +

(se⟨h⟨x, θ⟩⟩)2

= 0

with respect to x . uniroot(...) searches for the root in the interval range(cal.x). The functionapprox(...) is needed in order to interpolate the lower prediction bound of the estimated calibra-tion curve defined by cal.x and cal.fit - limits.The other two calls of the function uniroot(...) calculate the other intersections.

f In this chapter, only one of many possibilities for determining a calibration intervalwas presented.

A. Ruckstuhl, ZHAW

Page 38: Introduction to Nonlinear Regression - ETH Z...Introduction to Nonlinear Regression Andreas Ruckstuhl∗ IDP Institute of Data Analysis and Process Design ZHAW Zurich University of

4 Closing Comments

a Reason for the Difficulty in the Biochemical Oxygen Demand Example. Why didwe have so many problems with the Biochemical Oxygen Demand example? Let ushave a look at Figure 1.1.e and remind ourselves that the parameter θ1 represents theexpected oxygen demand for infinite incubation time, so it is clear that it is difficultto estimate θ1 , because the horizontal asymptote is badly determined by the givendata. If we had more observations with longer incubation times, we could avoid thedifficulties with the quality of the confidence intervals of θ .

Also in nonlinear models, a good (statistical) experimental design is essential. Theinformation content of the data is determined through the choice of the experimentalconditions and no (statistical) procedure can deliver information that is not containedin the data.

b Robust Fitting. The original data set of the example ’Cellulose Membrane’ is presentedin the scatter plot of Figure 4.0.b. It shows clearly that there are outliers with respectto a sigmoid curve. Since many of such experiments had to be analysed, removingthe outliers case by case by hand is cumbersome. Robust methods as they have beenintroduced in Ruckstuhl (2020) is very timesafing and even more objective.The function nlrob(...) of the R package robustbase implements several methods(see help page of nlrob(...) for details). Two of the methods are introduced inRuckstuhl (2020) in case of the multiple linear regression model: The M-estimator(method="M") and the MM-estimator (method="MM"). The MM-estimator has as initialestimator an S-estimator. In Figure 4.0.b (right) the fit of a robust M-estimator withbisquared ψ function ist compared with the result of a least-squares estimator. Thefunction call is given in R Output 4.0.b.

library(MASS)library(robustbase)Mem.rFit0A <- nlrob(delta ∼ (T1 + T2*10 (T3+T4*pH))/(10 (T3+T4*pH)+1),

I.nmr, start=list(T1=163.7, T2=160, T3=3.3, T4=-0.6),psi=function(u, c, derive) psi.bisquare(u, c=4, deriv=0))

R-Output 4.0.b: R code to fit a robust M-estimator with bisquared ψ function to the MembraneSeparation Technology dataset.

c Correlated Errors. Here we always assumed that the errors Ei are independent. Likein linear regression analysis, nonlinear regression models can also be extended to handlecorrelated errors and random effects.

36

Page 39: Introduction to Nonlinear Regression - ETH Z...Introduction to Nonlinear Regression Andreas Ruckstuhl∗ IDP Institute of Data Analysis and Process Design ZHAW Zurich University of

3.2. Calibration 37

●●

●●

●●

●●

●●●

●●

●●

●●

●●

● ● ● ●

● ● ●

2 4 6 8 10 12

160

161

162

163

164

x (=pH)

y (=

che

m. s

hift)

●●

●●

●●

●●

●●●

●●

●●

●●

●●

● ● ● ●

● ● ●

2 4 6 8 10 12

160

161

162

163

164

x (=pH)

y (=

che

mic

al s

hift)

LSLS without outliersrobust

Figure 4.0.b.: Original Data of the Membrane Separation Technology dataset shown in a scatterplot (left). On the right-hand side, fits of using different estimators are shown: Least-squares(LS), Least-squares without outliers as in the lecture, and a robust M-estimator with bisquared ψfunction.

d Statistics Programs. Today most statistics packages contain a procedure that cancalculate asymptotic confidence intervals for the parameters. In principle it is thenpossible to calculate “t-profiles” and profile traces because they are also based on thefitting of nonlinear models (on a reduced set of parameters).In R, the function nls is available, that is based on the work of Bates and Watts (1988).The “library” nlme contains R functions that can fit nonlinear regression models withcorrelated errors (gnls) and random effects (nlme). These implementations are basedon the book “Mixed Effects Models in S and S-Plus” from Pinheiro and Bates (2000).

e Literature Notes. These notes are mainly based on the book of Bates and Watts(1988). A mathematical discussion about the statistical and numerical methods innonlinear regression can be found in Seber and Wild (1989). The book of Ratkowsky(1989) contains many nonlinear functions h〈·〉 that are primarily used in biologicalapplications.A short introduction to this topic using the statistics program R can be found inVenables and Ripley (2002) and in Ritz and Streibig (2008).More details on the application of the bootstrap method in nonlinear regression mod-elling can be found, e.g., in Ritz and Streibig (2008) and Huet, Bouvier, Poursat andJolivet (2010). In the latter there is also a discussion about non-constant variance(i.e., heteroscedastic) models. It is also worth taking a look at the book of Carroll andRuppert (1988).

A. Ruckstuhl, ZHAW

Page 40: Introduction to Nonlinear Regression - ETH Z...Introduction to Nonlinear Regression Andreas Ruckstuhl∗ IDP Institute of Data Analysis and Process Design ZHAW Zurich University of

A Appendix

A.1 The Gauss-Newton Method

a The Optimization Problem.. To determine the least squares estimator we must min-imize the squared sum

S(θ) :=n∑i=1

(yi − ηi〈θ〉)2.

Unfortunately this can not be carried out explicitly like in linear regression. Withlocal linear approximations of the regression function, however, the difficulties can beovercome.

b Solution Procedure.. The procedure is set out in four steps. We consider the (j+ 1)-th iteration in the procedure.To simplify, we assume that the regression function h〈〉has only one unknown parameter. The solution from the previous iteration is denotedby θ(j) . θ(0) is then the notation for the starting value.1. Approximation: The regression function h〈xi, θ〉 = η〈θ〉i with the one dimensional

parameter θ at the point θ(j) is approximated by a line:

ηi〈θ〉 ≈ ηi〈θ(j)〉+ ∂ηi〈θ〉∂θ

∣∣∣∣θ(j)

(θ − θ(j)) = ηi〈θ(j)〉+ ai〈θ(j)〉(θ − θ(j)) .

* For a multidimensional θ with help of a hyper plane it is approximated:

h〈xi, θ〉 = ηi〈θ〉 ≈ ηi〈θ(j)〉+ a(1)i 〈θ

(j)〉(θ1 − θ(j)1 ) + . . . + a

(p)i 〈θ

(j)〉(θp − θ(j)p ),

wherea

(k)i 〈θ〉 := ∂h〈xi, θ〉

∂θk, k = 1, . . . , p.

With vectors and matrices the above equation can be written as

η〈θ〉 ≈ η〈θ(j)〉+ A (j)(θ − θ(j)) .

Here the (n × p) derivative matrix A (j) consists of the j -th iteration of the elements{aik = a

(k)i 〈θ〉} at the point θ = θ(j) .

2. Local Linear Model: We now assume that the approximation in the 1st step holdsexactly for the true model. With this we get for the residuals

r(j+1)i = yi − {η〈θ(j)〉i + ai〈θ(j)〉(θ − θ(j))} = y

(j+1)i − ai〈θ(j)〉β(j)

with y(j+1)i := yi − η〈θ(j)〉 and β(j+1) := θ − θ(j) .

* For a multidimensional θ it holds that:

r(j+1) = y − {η(θ(j)) + A (j)(θ − θ(j))} = y(j+1) − A (j)β(j+1)

with y(j+1) := y− η(θ(j)) and β(j+1) := θ − θ(j) .

38

Page 41: Introduction to Nonlinear Regression - ETH Z...Introduction to Nonlinear Regression Andreas Ruckstuhl∗ IDP Institute of Data Analysis and Process Design ZHAW Zurich University of

A.1. The Gauss-Newton Method 39

3. Least Square Estimation in the Locally Linear Model: To find the best-fittingβ(j+1) for the data, we minimize the sum of squared residuals:

∑ni=1(r(j+1)

i )2 .This is the usual linear least squares problem with y

(j+1)i ≡ yi , ai〈θ(j)〉 ≡ xi and

β(j+1) ≡ β (The line goes through the origin). The solution to this problem,β(j+1) , gives the best solution on the approximated line.

* To find the best-fitting β(j+1) in the multidimensional case, we minimize the sum of squaredresiduals:‖r(j+1)‖2 . This is the usual linear least squares problem with y(j+1) ≡ y , A (j) ≡X and β(j+1) ≡ β . The solution to this problem gives a β(j+1) , so that the point

η〈θ(j+1)〉 with θ(j+1) = θ(j) + β(j)

lies nearer to y than η〈θ(j)〉 .

4. Iteration: Now with θ(j+1) = θ(j) + β(j) we return to step 1 and repeat steps 1,2, and 3 until this procedure converges. The converged solution minimizes S〈θ〉and thus corresponds to the desired estimation value θ .

c Further Details.. This minimization procedure that is known as the Gauss-Newtonmethod, can be further refined. However, there are also other minimization methodsavailable. It must be addressed in more detail what ”converged” should mean and howthe convergence can be achieved. The details about these technical questions can beread in, e.g. Bates and Watts (1988). For us it is of primary importance to see thatiterative procedures must be applied to solve the optimization problems in 1.2.a.

A. Ruckstuhl, ZHAW

Page 42: Introduction to Nonlinear Regression - ETH Z...Introduction to Nonlinear Regression Andreas Ruckstuhl∗ IDP Institute of Data Analysis and Process Design ZHAW Zurich University of

Bibliography

Bates, D. M. and Watts, D. G. (1988). Nonlinear Regression Analysis & Its Applica-tions, John Wiley & Sons.

Carroll, R. and Ruppert, D. (1988). Transformation and Weighting in Regression,Wiley, New York.

Daniel, C. and Wood, F. S. (1980). Fitting Equations to Data, John Wiley & Sons,New York.

Huet, S., Bouvier, A., Poursat, M.-A. and Jolivet, E. (2010). Statistical Tools forNonlinear Regression: A Practical Guide with S-Plus and R Examples, 2nd edn,Springer-Verlag, New York.

Pinheiro, J. C. and Bates, D. M. (2000). Mixed-Effects Models in S and S-PLUS,Statistics and Computing, Springer.

Rapold-Nydegger, I. (1994). Untersuchungen zum Diffusionsverhalten von Anionen incarboxylierten Cellulosemembranen, PhD thesis, ETH Zurich.

Ratkowsky, D. A. (1985). A statistically suitable general formulation for modellingcatalytic chemical reactions, Chemical Engineering Science 40(9): 1623–1628.

Ratkowsky, D. A. (1989). Handbook of Nonlinear Regression Models, Marcel Dekker,New York.

Ritz, C. and Streibig, J. C. (2008). Nonlinear Regression with R, Use R!, SpringerVerlag.

Ruckstuhl, A. (2018). Robust fitting of parametric models based on M-estimation.Lecture Note in WBL Angewandte Statistik, ETHZ.

Ruckstuhl, A. (2020). Robust fitting of parametric models based on M-estimation.Lecture Note in WBL Angewandte Statistik, ETHZ.

Seber, G. and Wild, C. (1989). Nonlinear regression, Wiley, New York.

Stahel, W. (2000). Statistische Datenanalyse: Eine Einfuhrung fur Naturwis-senschaftler, 3. Auflage edn, Vieweg, Braunschweig/Wiesbaden.

Venables, W. N. and Ripley, B. (2002). Modern Applied Statistics with S, Statisticsand Computing, fourth edn, Springer-Verlag, New York.

40