R E G L E RTEKNIK AUT L - Automatic Control · vec A The vector of the matrix A, obtained by stacking the columns of Aunderneath each other in order from left to right vech A The

Linkoping Studies in Science and TechnologyThesis No. 601

Just-in-Time Models withApplications to Dynamical

Systems

Anders Stenman

REGLERTEKNIK

AUTOMATIC CONTROL

LINKÖPING

Division of Automatic ControlDepartment of Electrical Engineering

Linkoping University, S-581 83 Linkoping, Sweden

March 1997

Just-in-Time Models with Applications to Dynamical Systems

c©1997 Anders Stenman,

[email protected]

Department of Electrical Engineering,Linkoping University,S-581 83 Linkoping,

Sweden

LIU-TEK-LIC-1997:02

ISBN 91-7871-898-8 ISSN 0280-7971

To Maria

Abstract

System identification deals with the problem of estimating models of dynamicalsystems given observations from the systems. In this thesis we focus on the non-linear modeling problem, and, in particular, on the situation that occurs when avery large amount of data is available.

Traditional treatments of the estimation problem in statistics and system iden-tification have mainly focused on global modeling approaches, i.e., the model hasbeen optimized using the entire data set. However, when the number of samplesbecomes large, this approach becomes less attractive mainly because of the com-putational complexity.

We instead assume that all observations are stored in a database, and thatmodels are built dynamically as the actual need arises. When a model is reallyneeded in a neighborhood around an operating point, a subset of the data closest tothe operating point is retrieved from the database, and a local modeling operationis performed on that subset. For this concept, the name Just-in-Time models hasbeen adopted.

It is proposed that the Just-in-Time estimator is formed as a weighted averageof the data in the neighborhood, where the weights are optimized such that thepointwise mean square error (MSE) measure is minimized. The number of dataretrieved from the database is determined using a local bias/variance error trade-off. This is closely related to the nonparametric kernel estimation concept whichis commonly used in statistics. A review of kernel methods is therefore presentedin one of the introductory chapters.

The asymptotical properties of the method are investigated. It is shown thatthe Just-in-Time estimator produces consistent estimates, and that the convergencerate as a function of the sample size is of the same order as for the kernel methods.

Two important applications for the concept are presented. The first one con-siders nonlinear time domain identification, which is the problem of predicting theoutputs of nonlinear dynamical systems given data sets of past inputs and outputsof the systems. The second one occurs within frequency domain identification whenone is faced with the problem of estimating the frequency response function of alinear system.

Compared to global methods, the advantage with Just-in-Time models is thatthey are optimized locally, which might increase the performance. A possible draw-back is the computational complexity, both because we have to search for neigh-borhoods in a multidimensional regressor space, and because the derived estimatoris quite demanding in terms of computational effort.

i

ii

Acknowledgments

I am very grateful to all the people that have supported me during the work withthe thesis.

First of all, I would like to thank my supervisors, Prof. Lennart Ljung andDr. Fredrik Gustafsson, for their excellent guidance through the work. EspeciallyFredrik deserves my deepest gratitude for putting up with all my stupid questions.I am also indebted to our former visitors, Daniel Rivera and Alexander Nazin,for insightful discussions about my work and for giving interesting new ideas andproposals for future research.

Dr. Peter Lindskog and Dr. Jonas Sjoberg have read the thesis thoroughly andhave given valuable comments and suggestions for improvements. For this I amvery grateful. Peter has also provided the pictures that are used in the examplesin the end of Chapter 6.

I also want to thank Mattias Olofsson for keeping the computers running, andall the hackers and volunteers all around the world that provide such excellent andfree software tools as LATEX and XEmacs1.

Finally, I would like to thank Maria for all ♥ and support during the writingof this thesis. I guess that it is my turn to do all the cleaning and washing thefollowing months. ;-)

This work was supported by the Swedish Research Council for Engineering Sci-ences (TFR), which is gratefully acknowledged.

Linkoping, March 1997

Anders Stenman

1C-u 50 M-x all-hail-xemacs

iii

iv

Contents

Notation ix

1 Introduction 11.1 The Regression Problem . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Parametric Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Nonparametric Methods . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 The System Identification Problem . . . . . . . . . . . . . . . . . . . 41.5 Just-in-Time Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.6 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.7 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.8 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 Parametric Methods 132.1 Parametric Regression Models . . . . . . . . . . . . . . . . . . . . . . 132.2 Parametric Models in System Identification . . . . . . . . . . . . . . 14

2.2.1 Linear Black-box Models . . . . . . . . . . . . . . . . . . . . 142.2.2 Nonlinear Black-box Models . . . . . . . . . . . . . . . . . . . 15

2.3 Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 182.3.1 Linear Least Squares . . . . . . . . . . . . . . . . . . . . . . . 192.3.2 Nonlinear Least Squares . . . . . . . . . . . . . . . . . . . . . 19

2.4 Asymptotic Properties of the Model . . . . . . . . . . . . . . . . . . 19

3 Nonparametric Methods 213.1 The Basic Smoothing Problem . . . . . . . . . . . . . . . . . . . . . 22

v

vi

3.2 Local Polynomial Kernel Estimators . . . . . . . . . . . . . . . . . . 233.3 K-Nearest Neighbor Estimators . . . . . . . . . . . . . . . . . . . . . 273.4 Statistical Properties of Kernel Estimators . . . . . . . . . . . . . . . 27

3.4.1 The MSE and MISE Criteria . . . . . . . . . . . . . . . . . . 283.4.2 Asymptotic MSE Approximation . . . . . . . . . . . . . . . . 28

3.5 Bandwidth Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.6 Extensions to the Multivariable Case . . . . . . . . . . . . . . . . . . 313.A Appendix: Proof of Theorem 3.1 . . . . . . . . . . . . . . . . . . . . 33

4 Just-in-Time Models 374.1 The Just-in-Time Idea . . . . . . . . . . . . . . . . . . . . . . . . . . 374.2 The Just-in-Time Estimator . . . . . . . . . . . . . . . . . . . . . . . 394.3 Optimal Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.3.1 The MSE Formula . . . . . . . . . . . . . . . . . . . . . . . . 424.3.2 Optimizing the Weights . . . . . . . . . . . . . . . . . . . . . 434.3.3 Properties of the Weight Sequence . . . . . . . . . . . . . . . 454.3.4 Comparison with Kernel Weights . . . . . . . . . . . . . . . . 46

4.4 Estimation of Hessian and Noise Variance . . . . . . . . . . . . . . . 494.4.1 Using Linear Least Squares . . . . . . . . . . . . . . . . . . . 494.4.2 Using a Weighted Mean . . . . . . . . . . . . . . . . . . . . . 50

4.5 Data selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.5.1 Neighborhood size . . . . . . . . . . . . . . . . . . . . . . . . 514.5.2 Neighborhood shape . . . . . . . . . . . . . . . . . . . . . . . 53

4.6 The Just-in-Time Algorithm . . . . . . . . . . . . . . . . . . . . . . . 544.7 Properties of the Just-in-Time Estimator . . . . . . . . . . . . . . . 55

4.7.1 Computational Aspects . . . . . . . . . . . . . . . . . . . . . 564.7.2 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5 Asymptotic Properties of the Just-in-Time Estimator 595.1 Asymptotic Properties of the Scalar Estimator . . . . . . . . . . . . 595.A Appendix: Some Power Series Formulas . . . . . . . . . . . . . . . . 645.B Appendix: Moments of a Uniformly Distributed Random Variable . 65

6 Applications to Dynamical Systems 676.1 Nonparametric System Identification . . . . . . . . . . . . . . . . . . 676.2 A Linear System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 696.3 A Nonlinear System . . . . . . . . . . . . . . . . . . . . . . . . . . . 706.4 Tank Level Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 726.5 Water Heating Process . . . . . . . . . . . . . . . . . . . . . . . . . . 73

7 Applications to Frequency Response Estimation 777.1 Traditional Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

7.1.1 Properties of the ETFE . . . . . . . . . . . . . . . . . . . . . 787.1.2 Smoothing the ETFE . . . . . . . . . . . . . . . . . . . . . . 797.1.3 Asymptotic Properties of the Estimate . . . . . . . . . . . . . 80

Contents vii

7.1.4 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 817.2 Using the Just-in-Time Approach . . . . . . . . . . . . . . . . . . . . 827.3 Aircraft Flight Flutter Data . . . . . . . . . . . . . . . . . . . . . . . 867.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

8 Summary & Conclusions 89

Bibliography 91

Subject Index 95

viii Contents

Notation

Abbreviations

AIC Akaike’s Information Theoretic CriterionAMISE Asymptotic Mean Integrated Squared ErrorAMSE Asymptotic Mean Squared ErrorETFE Empirical Transfer Function EstimateFPE Akaike’s Final Prediction ErrorJIT Just-in-TimeMISE Mean Integrated Squared ErrorMSE Mean Squared ErrorRMSE Root Mean Squared Error

Symbols

C The set of complex numbersR The set of real numbersRd Euclidean d-dimensional spaceId The d× d identity matrix∼ aN ∼ bN , if and only if limN→∞(aN/bN) = 1o aN = o(bN ), if and only if lim supN→∞ |aN/bN | = 0O aN = O(bN ), if and only if lim supN→∞ |aN/bN | <∞ΩM A neighborhood around the current operating point that

contains M dataλ Noise variance

ix

x Notation

Operators and Functions

arg minx

f(x) The minimizing argument of the function f(·) w.r.t. x

EX Mathematical expectation of the random variable XVarX Variance of the random vector XvecA The vector of the matrix A, obtained by stacking the

columns of A underneath each other in order from left toright

vechA The vector-half of the matrix A, obtained from vecA byeliminating the above-diagonal entries of A

Df (x) The d × 1 derivative vector which ith entry is equal to(∂/∂xi)f(x)

Hf (x) The d × d Hessian matrix which (i, j)th entry is equal to∂2/(∂xi∂xj)f(x)

µk(f)∫xkf(x) dx

R(f)∫f2(x) dx

AT The transpose of the matrix AtrA The trace of the matrix A

1Introduction

The problem considered in this thesis is how to derive relationships between inputsand outputs of a dynamical system when very little a priori knowledge is available.In traditional system identification literature, this is usually known as black-boxmodeling. Very rich and well established theory for black-box modeling of linearsystems exists, see for example [30] and [43]. In recent years the interest for non-linear system identification has been growing, and the attention has been focusedon a number of nonlinear black-box structures; neural networks, wavelets as tomention some of them [42]. However, non-linear identification has been studied fora long time within the statistical community, where it is known under the namenonparametric regression.

1.1 The Regression Problem

Within many areas of science one often wishes to study the relationship betweena number of variables. The purpose could for example be to predict the outcomeof one of the variables, on the basis of information provided by the others. Inthe statistical theory this is usually referred to as the regression problem. Theobjective is to determine a functional relationship between a predictor variable(or regression variable) X ∈ Rn and a response variable Y ∈ R, given a set ofobservations (X1, Y1), . . . , (XN , YN ). In mathematical sense, the problem is tofind a function of X , f(X), such that the difference

Y − f(X) (1.1)

1

2 Chapter 1 Introduction

becomes small in some sense. It is a well-known fact that the function f thatminimizes

E(Y − f(X))2 (1.2)

is the conditional expectation of Y given X ,

f(X) = E(Y |X). (1.3)

This function, the best mean square predictor of Y given X , is often called theregression function or the regression of Y on X . If N data points have beencollected, the regression relation can be modeled as

Yi = f(Xi) + ei, i = 1, . . . , N, (1.4)

where ei are identically distributed random variables with zero means, which areindependent of the predictor data Xi.

The task of estimating the regression function f from observations can be donein essentially two different ways. The quite commonly used parametric approach isto assume that the function f has a pre-specified form, for instance a hyperplanewith unknown slope and offset. As an alternative one could try to estimate fnonparametrically without reference to a specific form.

1.2 Parametric Methods

Parametric estimation methods rely on the assumption that the true regressionfunction f has a pre-specified functional form, which can be fully described by afinite dimensional parameter vector θ:

f(x, θ). (1.5)

The structure of the model is chosen from families that are known to be flexibleand which have been successful in previous applications. This means that theparameters are not necessarily required to have any physical meaning, they arejust tuned to fit the observed data as well as possible.

It is natural to consider the parameterization (1.5) as an expansion of basisfunctions gi(·), i.e.

f(x, θ) =r∑i=1

αigi(x, βi, γi).

This formulation allows that the dependence of f in some of the components of θcan be linear while it on others can be nonlinear. This is for instance the situationin some commonly used special cases of basis functions as shown in Table 1.1, whichwill be more thoroughly described in Chapter 2.

1.3 Nonparametric Methods 3

Modeling Approach gi(x, βi, γi)

Fourier series sin(xβi + γi)Feedforward neural networks σ(xT βi + γi)

σ(x) = 1/(1 + e−x)Radial basis functions κ(‖x− γi‖βi)

κ(x) = e−x2

Linear regression xi

Table 1.1 Basis functions used in some common modeling approaches.

Once a particular model structure is chosen, the parameters can be obtainedfrom the observations by an optimization procedure that minimizes the predictionerrors in a global least squares fashion

θ = arg minθ

N∑i=1

(Yi − f(Xi, θ))2 . (1.6)

This optimization problem usually has to be solved using a numeric search routine,except in the linear case where an explicit solution exists. Hence it will typicallyhave numerous local minima which in general makes the search for the desiredglobal minimum hard [9].

The greatest advantage with parametric models is that they give a very com-pact description of the data set once the parameter vector estimate θ is computed.A drawback, however, is the required assumption of the imposed parameterization.Sometimes the assumed function family (or model structure) might be too restric-tive or too low-dimensional (i.e. too few parameters) to fit unexpected features inthe data.

1.3 Nonparametric Methods

The problems with parametric regression methods can be overcome by removingthe restriction that the regression function belongs to a parametric function family.This leads to an approach which is usually referred to as nonparametric regression.The basic idea behind nonparametric methods is that one should let the data decidewhich function that fits them best without the restrictions imposed by a parametricmodel. There exist several methods for obtaining nonparametric estimates of thefunctional relationship, ranging from the simple nearest neighbor method to moreadvanced smoothing techniques. A fundamental assumption is that observationslocated close to each other are related, so that an estimate at a certain operationpoint x can be constructed from observations in a small neighborhood around x.

The simplest nonparametric method is perhaps the nearest neighbor approach.The estimate f(x) is taken as the response variable Yk that corresponds to the


regression vector Xk that is the nearest neighbor of x, i.e.

f(x) = Yk, k = arg mink

|Xk − x|.

Hence the estimation problem is essentially reduced to a data set searching problem,rather than a modeling problem.

Although its simplicity, the nearest neighbor method suffers from a major draw-back. The observations are almost always corrupted by measurement noise. Hencethe nearest neighbor estimate is in general a very poor and noisy estimate of thetrue function value. Significant improvements can therefore be achieved using aninterpolation or smoothing operation,

f(x) =∑i

wiYi, (1.7)

where wi denotes a sequence of weights which may depend on x and the predictorvariable data Xi. This weight sequence can of course be selected in many ways.An often used approach in statistics is to select the weights according to a kernelfunction [21],

wi = Kh(Xi − x),

which explicitly specifies the shape of the weight sequence. A similar approach isconsidered in signal processing applications where the so-called Hamming windowis frequently used for smoothing [2].

The weights in (1.7) are typically tuned by a smoothing parameter which con-trols the degree of local averaging, i.e. the size of the neighborhood around x. Atoo large neighborhood will include observations located far away from x, whoseexpected values may differ significantly from f(x), and as a result the estimator willproduce an “over-smoothed” or biased estimate. When using a too small neigh-borhood, on the other hand, only a few number of observations will contribute tothe estimate at x, hence making it “under-smoothed” or noisy. The basic prob-lem in nonparametric methods is thus to find the optimal choice of the smoothingparameter that will balance the bias error against the variance error.

The advantage with nonparametric models is their flexibility, since they allowpredictions to be computed without reference to a fixed parametric model. Theprice that has to be paid for that is the computational complexity. In generalnonparametric methods require more computations than parametric ones. Theconvergence rate with respect to sample size N is also slower than for parametricmethods.

1.4 The System Identification Problem

System identification is a special case of the regression problem presented in Section1.1. It deals with the problem of determining mathematical models of dynamical

1.4 The System Identification Problem 5

systems on the basis of observed data from the systems. Having collected a dataset of paired inputs and outputs

S = (u(t), y(t))Nt=1

from a system, the goal in time domain system identification is typically to try tomodel future outputs of the system as a function of past inputs and outputs,

y(t) = f(ϕ(t)) + e(t), (1.8)

where ϕ(t) is a so-called regression vector which consists of past data,

ϕ(t) = (y(t− 1), y(t− 2), . . . , u(t− 1), u(t− 2), . . . )T ,

and e(t) is an error term which accounts for the fact that in general it is notpossible to model y(t) as an exact function of past observations. Nevertheless, arequirement must be that the error term is small or white, so that we can treatf(ϕ(t)) as a good prediction of y(t),

y(t|t− 1) = f(ϕ(t)).

The system identification problem is thus to find a “good” function f(·) such thatthe discrepancy between the true and the predicted outputs,

y(t)− y(t|t− 1),

is minimized.The problem of estimating y(t|t−1) = f(ϕ(t)) from experimental data with poor

or no a priori knowledge of the system is usually referred to as black-box modeling[30]. It has traditionally been solved using parametric linear models of differentsophistication, but problems usually occur when encountering highly nonlinearsystems which poorly allow them self to be approximated by linear models. Asa consequence of this, the interest for nonlinear modeling alternatives like neuralnetworks and radial basis functions has been growing in recent years [42, 6].

As an alternative one could apply nonparametric methods of the type describedin Section 1.3. Then the predictor will be of the form

y(t|t− 1) =t−1∑

k=−∞wky(k), (1.9)

where the weights are constructed such that they give measurements located closeto ϕ(t) more influence than those located far away from it.

Example 1.1 (Lindskog [29])Consider the laboratory-scale tank system shown in Figure 1.1 (a). Suppose themodeling aim is to describe how the water level h(t) changes with the voltage u(t)that controls the pump, given a data set that consists of 1000 observations of u(t)


and h(t). The data set is plotted in Figure 1.1 (b).

A reasonable assumption is that the water level at the current time instant t canbe expressed in terms of the water level and the pump voltage at the previous timeinstant t− 1, i.e.,

h(t|t− 1) = f(h(t− 1), u(t− 1)).

Assuming that the function f(·) can be described by a linear regression

h(t|t− 1) = θ1h(t− 1) + θ2u(t− 1) + θ0,

the parameters θi can easily be estimated using linear least squares, resulting in

θ1 = 0.9063, θ2 = 1.2064, and θ0 = −5.1611.

The result from a simulation is shown in Figure 1.1 (c). The solid line representsthe measured water level, and the dashed line corresponds to a simulation usingthe estimated parameters. As shown, the simulated water level follows the truelevel quite well except at levels close to zero, where the linear model producesnegative levels. This indicates that the true system is nonlinear, and that betterresults could be achieved using a nonlinear or a nonparametric model. Figure1.1 (d) shows a simulation using a nonparametric model of the type (1.9). Theperformance of the model is clearly much better at low water levels in this case.

A traditional application of nonparametric methods in system identification isin the frequency domain when estimating the transfer function of a system. If thesystem considered is linear, i.e. if it can be modeled by the input-output relation

y(t) = G0(q)u(t) + e(t), t = 1, . . . , N, (1.10)

an estimate of the transfer function G0(q) can be formed as the ratio between theFourier transforms of the input and output signals,

ˆGN (eiω) =

YN (ω)UN(ω)

. (1.11)

This estimate is often called the empirical transfer function estimate (ETFE), sinceit is formed with no other assumptions than linearity of the system [30].

It is well-known that the ETFE is a very crude estimate of the true transferfunction. This is due to the fact that the observations (y(t), u(t)) are corruptedby measurement noise e(t) which propagates to the ETFE through the Fouriertransform. In particular, for sufficiently large N , the ETFE can be written

ˆGN (eiω) = G0(eiω) + ρN (ω), (1.12)

where ρN (ω) is a complex disturbance with zero mean and variance proportionalto the noise to input signal ratio. Hence the transfer function can be estimated ina nonparametric fashion

G(eiω0) =∑k

wkˆGN(eiωk),

1.4 The System Identification Problem 7

35 [cm]

u(t)

h(t)

0

(a)

0 2 4 6 8 10 12 14 160

10

20

30

40

0 2 4 6 8 10 12 14 164

4.5

5

5.5

6

6.5

7

Time [min]

h(t

)[c

m]

Time [min]

u(t

)[V

]

(b)

0 2 4 6 8 10 12 14 16−5

0

5

10

15

20

25

30

35

Time [min]

h(t

)[c

m]

(c)

0 2 4 6 8 10 12 14 160

5

10

15

20

25

30

35

Time [min]

h(t

)[c

m]

(d)

Figure 1.1 (a) A simple tank system. (b) Experimental data. (c) Theresult of a simulation with a linear model. Solid: True water level, Dashed:Simulated water level. (d) The result of a simulation with a nonparametricmodel. Solid: True water level. Dashed: Simulated water level.

where the weights again are selected so that a good trade-off between the bias andthe variance is achieved.


1.5 Just-in-Time Models

The main contribution in this thesis is Just-in-Time estimators, which is anotherapproach of getting nonparametric estimates of nonlinear regression functions onthe basis of observed data. Traditionally in system identification literature andstatistics, the regression problems have been solved by global modeling methods,like kernel methods, neural networks or other non-linear parametric models [42],but when dealing with very large data sets, this approach becomes less attractiveto deal with. For real industrial applications, for example in the chemical processindustry, the volume of data may occupy several Gigabytes.

The global modeling process is in general associated with an optimization stepas in (1.6). This optimization problem is typically non-convex and will have anumber of local minima which makes the solution difficult. Although the globalmodel has the appealing feature of giving a high degree of data compression, itseems both inefficient and unnecessary to spend a large amount of calculations tooptimize a model which is valid over the whole regressor space, while in most casesit is more likely that we only will visit a very restricted subset of it.

Inspired by ideas and concepts from the database research area, we will take aconceptually different point of view. We assume that all observations are stored ina database, and that the models are built dynamically as the actual need arises.When a model is really needed in a neighborhood of an operating point x, a subsetof the data closest to the operating point is retrieved from the database, and alocal modeling operation is performed on that subset. For this concept, we haveadopted the name Just-in-Time models, suggested by [9].

As in (1.7) it is assumed that the Just-in-Time predictor is formed as a weightedaverage of the response variables in a neighborhood around x,

fJIT(x) =∑i

wiYi,

where the weights wi are optimized in such a way that the pointwise mean squareerror (MSE) measure is minimized.

Compared to global methods, the advantage with Just-in-Time models is thatthe modeling is optimized locally, which might increase the performance. A pos-sible drawback is the computational complexity, as we both have to search for aneighborhood of x in a multidimensional space, and as the derived estimator isquite computationally intensive. In this thesis, however, we will only investigatethe properties of the modeling part of the problem. The searching problem will beleft as a topic for future research.

1.6 Applications

The reasons and needs for estimating a model of a system can of course be many.When dealing with dynamical systems, some of the reasons can be as follows.

1.7 Thesis Outline 9

• One obvious reason which already has been mentioned briefly in Section 1.4 isprediction or forecasting. Based on the observations we have collected so far,we will be able to predict the future behavior of the system. Conceptuallyspeaking, this can be described as in Figure 1.2. The predictor/estimatortakes a data set and an operation point x as inputs, and uses some suitablemodeling approach, parametric or nonparametric, to produce an estimatef(x).

• Modern control theory usually requires a model of the process to be con-trolled. One example is predictive control where the control signal from theregulator is optimized on the basis of predictions of future outputs of thesystem.

• System analysis and fault detection in general requires investigation or mon-itoring of certain parameters which may not be directly available throughmeasurements. Therefore we will have to derive their values using a modelof the system.

Estimator

x

f(x)

(Yk, Xk)Nk=1

Figure 1.2 A conceptual view of modeling.

1.7 Thesis Outline

The thesis is divided into six chapters, excluding the introductory and the conclud-ing chapters. The first two chapters give an overview of existing parametric andnonparametric methods that relate to the Just-in-Time modeling concept, and thelast four chapters derive, analyze and exemplify the proposed method.

The purpose of Chapter 2 is to give the reader a brief background on parametricestimation methods, especially in system identification applications. Examples ofsome commonly used linear and non-linear black-box models are given, along withthe two basic parameter estimation methods.

Chapter 3 serves as an introduction to nonparametric smoothing methods. Thechapter is mainly focused on a special class of so-called kernel estimation methodswhich is widely used in statistics. The fundamental ideas and terminology arepresented as well as the statistical and asymptotical properties that are associatedwith these methods.

Chapter 4 is the core chapter of the thesis. It presents the basic ideas behindthe Just-in-Time concept and proposes a possible implementation of a Just-in-Time


estimator. The chapter is concluded with a discussion regarding different aspectsand properties of the method.

Chapter 5 presents an analysis of the asymptotic properties of the Just-in-Timeestimator. The aim is to investigate the consistency, i.e., if the estimator tendsto the true regression function when the sample size tends to infinity, and theconvergence rate, i.e., how fast it tends to this function.

In Chapter 6 the Just-in-Time method is applied to the time domain systemidentification problem. First two simulated examples are considered, and then tworeal data applications, a tank and a water heating system are successfully modeledby the method.

Chapter 7 gives an example of an important application of the Just-in-Timemethod in the field of frequency response estimation. The chapter starts by givinga review of the traditional treatments of the topic, whereafter the Just-in-Timemodeling concept is modified to fit into this framework.

Finally, Chapter 8 gives a summary and directions for future work.

1.8 Contributions

The contributions of this thesis is mainly the material contained in Chapter 4 toChapter 7. They can be summarized as follows:

• The concept of Just-in-Time models is advocated as a method for obtainingpredictions of a system given large data volumes.

• A particular implementation of a Just-in-Time estimator is proposed, thatforms the estimate in a nonparametric fashion as a weighted average of theresponse variables. The weights are optimized so that the local mean squareerror is minimized.

• An analysis of the asymptotic properties of the Just-in-Time smoother ispresented. It is shown that the method produces consistent estimates andthat the convergence rate is in the same order as for nonparametric methods.

• A comparison to kernel estimators is made and it is shown that the Just-in-Time estimator is easier to generalize to higher regressor dimensions.

• Examples with dynamical systems shows that the Just-in-Time method forsome systems gives smaller prediction errors than other proposed methods.

• It is shown that the method is quite efficient in matter of performance forsmoothing frequency response estimates.

The thesis is based on two papers that have been, or will be, presented atdifferent conferences. The papers are:

[44] A. Stenman, F. Gustafsson, and L. Ljung. Just in time models for dynam-ical systems. In Proceedings of the 35th IEEE Conference on Decision andControl, Kobe, Japan, 1996.

1.8 Contributions 11

[45] A. Stenman, A.V. Nazin, and F. Gustafsson. Asymptotic properties of Just-in-Time models. 1997. To be presented at SYSID ’97 in Fukuoka, Japan.


2Parametric Methods

This chapter gives a brief review of parametric estimation methods, which quiteoften are considered when solving the regression problem described in Chapter 1.The basic concept of parametric regression methods is given in Section 2.1. Section2.2 gives some examples of common black-box models, both linear and nonlinear,that are frequently used in system identification. Section 2.3 describes the twobasic parameter estimation methods used when a certain model class is chosen.Section 2.4, finally, briefly states the basic asymptotic properties concerned withparametric models.

2.1 Parametric Regression Models

A very commonly used way of estimating the regression function f in a regressionrelationship

Yi = f(Xi) + ei, (2.1)

on basis of observed data (Xi, Yi)Ni=1, is the parametric approach. The basicassumption is that f belongs to a family of functions with a pre-specified functionalform, and that this family can be parameterized by a finite-dimensional parametervector θ,

f(Xi, θ). (2.2)

The simplest example, which is very often used, is the linear regression,

f(Xi, θ, φ) = XTi θ + φ, (2.3)

13

14 Chapter 2 Parametric Methods

where it is assumed that the relation between the variables can be described bya hyperplane, whose slope and offset are controlled by the parameters θ and φ.In the general case, though, a wide range of different nonlinear model structuresis possible. The choice of parameterization depends very much on the situation.Sometimes there are physical reasons for modeling Y as a particular function of X ,while at other times the choice is based on previous experience with similar datasets.

Once a particular model structure is chosen, the parameter vector θ can natu-rally be assessed by means of the fit between the model and the data set

‖Yi − f(Xi, θ)‖. (2.4)

As will be described in Section 2.3, this fit can be performed in two major ways,depending on which norm that is used and how the parameter vector appears inthe parameterization. When the parameters enter linearly as in (2.3), they canbe easily computed using simple and powerful methods. In general though, thisoptimization problem is non-convex and may have a number of local minima whichmakes its solution difficult.

An advantage with parametric models is that they give a very compact descrip-tion of the data set once the parameter vector θ is estimated. In some applications,the data set may occupy several megabytes while the model is represented by onlya handful of parameters. A major drawback, however, is the particular parameter-ization that must be imposed. Sometimes the assumed function family might betoo restrictive or too low-dimensional to fit unexpected features in the data.

2.2 Parametric Models in System Identification

System identification is a special case of the regression relationship (2.1) where theresponse variable Yt represents the output of a dynamical system at time t,

Yt = y(t),

and the predictor variable Xt (usually denoted by ϕ(t) rather than Xt) consists ofinputs and outputs to the system at previous time instants,

Xt = (y(t− 1), y(t− 2), . . . , u(t− 1), u(t− 2), . . . )T .

Over the years, different names and concepts have been associated with differentparameterizations. We will in the following two subsections briefly describe someof the most commonly used ones.

2.2.1 Linear Black-box Models

Linear black-box models have been thoroughly discussed and analyzed in the sys-tem identification literature during the last decades, see for example [30] and [43].

2.2 Parametric Models in System Identification 15

The simplest linear model is the finite impulse response (FIR) model

y(t) = b1u(t− 1) + . . .+ bnu(t− n) + e(t) = B(q)u(t) + e(t).

where B(q) is a polynomial in the time shift operator q. Allowing that the modelorder n tends to infinity and using noise models of varying sophistication, all linearmodels can, as in [30], be described by the general model structure family,

A(q)y(t) = q−nkB(q)F (q)

u(t) +C(q)D(q)

e(t), (2.5)

where nk is the delay from u(t) to y(t) and

A(q) = 1 + a1q−1 + . . .+ anaq

−na

B(q) = b1 + b2q−1 + . . .+ bnbq

−nb+1

C(q) = 1 + c1q−1 + . . .+ cncq

−nc

D(q) = 1 + d1q−1 + . . .+ dndq

−nd

F (q) = 1 + f1q−1 + . . .+ fnf q

−nf .

An often used special case of (2.5) is the ARX (Auto Regressive with eXogeneousinput) model,

y(t) + a1y(t− 1) + . . .+ anay(t− na)= b1u(t− nk) + . . .+ bnbu(t− nb − nk + 1) + e(t) (2.6)

which corresponds to F (q) = C(q) = D(q) = 1. It has the nice property of beingexpressible in terms of a linear regression

y(t) = ϕT (t)θ + e(t),

and hence the parameter vector θ can be determined using simple and powerfulestimation methods, see Section 2.3.1. Note that the parametric model used inExample 1.1 in Chapter 1 is of ARX type, with na = nb = nk = 1.

2.2.2 Nonlinear Black-box Models

When turning into nonlinear modeling, things in general become much more com-plicated. The reason for that is that almost nothing is excluded, and a very richspectra of possible model structures are possible.

It is natural to think of the parameterization (2.2) as a function expansion [42],

f(ϕ(t), θ) =r∑

k=1

αkgk(ϕ(t), βk, γk). (2.7)

The functions gk(·) are usually referred to as basis functions, because the role theyplay in (2.7) is very similar to that of a functional space basis. Typically, the basis


functions are constructed from a simple scalar “mother” basis function, κ(·), whichis scaled and translated according to the parameters βk and γk.

Using scalar basis functions, there are three basic methods of expanding theminto higher regressor dimensions:

Ridge construction. A ridge basis function has the form

gk(ϕ(t), βk, γk) = κ(βTk ϕ(t) + γk), (2.8)

where κ(·) is a scalar basis function, βk ∈ Rn and γk ∈ R. The ridge functionis constant for all ϕ(t) in the direction where βTk ϕ(t) is constant. Hence thebasis functions will have unbounded support in this subspace, although themother basis function κ(·) has local support. See Figure 2.1 (a).

Radial construction. In contrast to the ridge construction, the radial basis func-tions have true local support as is illustrated in Figure 2.1 (b). The radialsupport can be obtained using basis functions of the form

gk(ϕ(t), βk, γk) = κ(‖ϕ(t)− γk‖βk), (2.9)

where γk ∈ Rn is a center point and ‖ · ‖βk denotes an arbritrary norm onthe regressor space. The norm is often taken as a scaled identity matrix.

Composition. A composition is obtained when the ridge and radial constructionsare combined when forming the basis functions. A typical example is illus-trated in Figure 2.1 (c). In general the composition can be written as a tensorproduct

gk(ϕ(t), βk, γk) = gk,1(ϕ1(t), βk,1, γk,1) · · · gk,r(ϕr(t), βk,r , γk,r),(2.10)

where each gk,i(·) is either a ridge or a radial function.

Using the function expansion (2.7) and the different basis function constructions(2.8)-(2.10), a number of well-known nonlinear model structures can be formed –for example neural networks, radial basis function networks and wavelets.

Neural Networks

The combination of (2.7), the ridge construction (2.8), and the so-called sigmoidmother basis function,

κ(x) = σ(x) =1

1 + e−x, (2.11)

results in the celebrated one hidden layer feedforward neural net. See Figure 2.2.Many different generalizations of this basic structure is possible. If the outputs ofthe κ(·) blocks are weighted, summed and fed through a new layer of κ(·) blocks, one

2.2 Parametric Models in System Identification 17

(a) Ridge (b) Radial (c) Composition

Figure 2.1 Three different methods of expanding into higher regressor di-mensions.

usually talks about multi-layer feedforward neural nets. So-called recurrent neuralnetworks are obtained if instead some of the internal signals in the network arefed back to the input layer. See [23] for further structural issues. Neural networkmodels are highly nonlinear in the parameters, and have thus to be estimatedthrough numerical optimization schemes as will be described in Section 2.3.2.

Radial Basis Networks

A closely related concept is the radial basis function (RBF) network [6]. It isconstructed using the expansion (2.7) and the radial construction (2.9). The radialmother basis function κ(·) is often taken as a Gaussian function

κ(x) = e−x2.

Compared to neural networks, the RBF network has the advantage of being linearin the parameters (provided that the location parameters are fixed). This makesthe estimation process easier.

Wavelets

Wavelet decomposition of a function is another example of the parameterization(2.7) [10]. A mother basis function (usually referred to as the mother wavelet anddenoted by ψ(·) rather than κ(·)) is scaled and translated to form a wavelet basis.The mother wavelet is usually a small wave (a pulse) with bounded support.

It is common to let the expansion (2.7) be double indexed according to scaleand location. For the scalar case and the specific choices βj = 2j and γk = k, thebasis functions can therefore be written as

gj,k = 2j/2κ(2jϕ(t) − k). (2.12)

Multivariable wavelet functions can be constructed from scalar ones using the com-position method (2.10).


rr7

SSSSSSSw

7

SSSSSSSw

ϕ1(t)

ϕn(t)

...

−γ1

β1,1

...

β1,n

@@@RHHHj

n∑ - κ(·) - α1

AAAAAAAU...

......

...

n∑ - κ(·) - αr

−γr

βr,1

...

βr,n

@@@RHHHj

n∑ - y

︸︷︷︸Hidden layer

︸︷︷︸Input layer

︸︷︷︸Output layer

Figure 2.2 A one hidden layer feedforward neural net.

Wavelets have multiresolution capabilities. Several different scale parametersare used simultaneously and overlappingly. With a suitable chosen mother waveletalong with scaling and translation parameters, the wavelet basis can be made or-thonormal, which makes it easy to compute the coordinates αj,k in (2.7). See forexample [42] for details.

2.3 Parameter Estimation

When a particular linear or nonlinear model structure is chosen the next step is toestimate the parameters on the basis of the observations SN = (Xi, Yi)Ni=1. Thisis usually done by minimizing the mean square error loss function

VN (θ,SN ) =1N

N∑i=1

(Yi − f(Xi, θ))2. (2.13)

The parameter estimate is then given by

θN = arg minθ

VN (θ,SN ). (2.14)

2.4 Asymptotic Properties of the Model 19

Depending on how the parameters appear in the parameterization, this minimiza-tion can be performed either using a linear least squares approach or a nonlinearleast squares approach.

2.3.1 Linear Least Squares

When the parameters enter linearly in the predictor, an explicit solution that min-imizes (2.13) exists. The optimal parameter estimate is then simply given by

θN =

(N∑i=1

XiXTi

)−1 N∑i=1

XiYi (2.15)

provided that the inverse in (2.15) exists. For numerical reasons this inverse israrely formed. Instead the estimate is computed using QR- or singular value de-composition [29].

2.3.2 Nonlinear Least Squares

When the predictor is nonlinear in the parameters, the minimum of the loss function(2.13) cannot be computed analytically. Instead one has to search for the minimumnumerically. An often used numeric optimization method is Newton’s algorithm[12],

θk+1N = θkN −

[V ′′N (θkN ,SN )

]−1

V ′N (θkN ,SN ), (2.16)

where V ′N (·) and V ′′N (·) denote the gradient and the Hessian of the loss functionrespectively. The parameter vector estimate θkN is in each iteration updated in thenegative gradient direction with a step size according to the inverse Hessian. Formodel structures like neural networks which are highly nonlinear in the parameters,this introduces a problem since several local minima exist. There are no guaranteesthat the parameter estimate converges to the global minimum of the loss function(2.13).

2.4 Asymptotic Properties of the Model

An interesting question is what properties the estimate resulting from (2.13) willhave. These will naturally depend on the properties of the data set SN . In generalit is a difficult problem to characterize the quality of θN exactly. Instead onenormally investigates the asymptotic properties of θN as the number of data, N ,tends to infinity.

It is an important aspect of the general parameter estimation method (2.13)that the asymptotic properties of the resulting estimate can be expressed in generalterms for arbitrary model structures.


The first basic result is the following one:

θN → θ∗ as N →∞, (2.17)

where

θ∗ = arg minθ

E(Yi − f(Xi, θ))2. (2.18)

That is, as more and more data become available, the estimate converges to thatvalue θ∗, that would minimize the expected value of the squared prediction errors.This is in a sense the best possible approximation of the true regression functionthat is available within the model structure. The expectation E in (2.18) is takenwith respect to all random disturbances that affect the data and it also includesaveraging over the predictor variables.

The second basic result is the following one: If the prediction error εi(θ∗) =Yi − f(Xi, θ

∗) is approximately white noise, then the covariance matrix of θN isapproximately given by

E(θN − θ∗)(θN − θ∗)T ∼λ

N

[Eψiψ

Ti

]−1, (2.19)

where

λ = Eε2i (θ∗) (2.20)

and

ψi =d

dθf(Xi, θ)|θ=θ∗ . (2.21)

The results (2.17) through (2.21) are general and hold for all model structures,both linear and non-linear ones, subject only to some regularity and smoothnessconditions. See [30] for more details around this.

3Nonparametric Methods

In Chapter 2, a brief review of parametric estimation methods was given. It wasconcluded that parametric methods are good modeling alternatives, since theygive a high degree of data compression, i.e., the major features of the data arecondensed into a few number of parameters. A problem with the approach is therequirement of a certain parameterization. This must be selected such that it matchthe properties of the underlying regression function. Otherwise quite poor resultsare often obtained.

The problem with parametric regression models mentioned above can be solvedby removing the restriction that the regression function belongs to a parametricfunction family. This leads to an approach which is usually referred to as non-parametric regression. The basic idea behind nonparametric methods is that oneshould let the data decide which function fits them best without the restrictionsimposed by a parametric model.

Local nonparametric regression models have been discussed and analyzed inthe statistical literature for a long time. In the context of so-called kernel regres-sion methods, traditional approaches have involved the Nadaraya-Watson estima-tor [35, 51] and some alternative kernel estimators, for example the Priestly-Chaoestimator [37] and the Gasser-Muller estimator [18]. In this chapter we give a briefintroduction to a special class of such models, local polynomial kernel estimators[46, 7, 34, 15]. These estimate the regression function at a certain point by locallyfitting a polynomial of degree p to the data using weighted least squares. TheNadaraya-Watson estimator can in this framework be seen as a special case sinceit corresponds to fitting a zero degree polynomial, i.e., a local constant, to data.

The presentation here is neither formal nor complete, the purpose is just to

21

22 Chapter 3 Nonparametric Methods

introduce concepts and notation used in the area. More comprehensive treatmentsof the topic are given in the books [50] and [21], upon which this survey is based.

The outline is as follows; Section 3.1 describes the basic nonparametric smooth-ing problem. Section 3.2 gives an introduction to local polynomial kernel estimatorswhich is one possible solution to the smoothing problem.

3.1 The Basic Smoothing Problem

Smoothing of a noisy data set (Xi, Yi)Ni=1 concerns the problem of estimating thefunction f in the regression relationship

Yi = f(Xi) + ei, i = 1, . . . , N. (3.1)

without the imposition that f belongs to a parametric family of functions. De-pending on how the data have been collected, several alternatives exist. If thereare multiple observations at a certain point x, an estimate of f(x) can be obtainedby just taking the average of the corresponding Y -values. In most cases however,repeated observations at a given x are not available, and one has to resort to othersolutions that deduce the value of f(x) using observations at other positions thanx. In the trivial case where the regression function f is constant, estimation off(x) reduces to taking the average over the response variables Y . In general sit-uations, though, it is unlikely that the true regression curve is constant. Ratherthe assumed function is modeled as a smooth continuous function which is nearlyconstant in a small neighborhood around x.

A natural approach is therefore the mean of the response variables near thepoint x. This local average should then be constructed so that it is defined onlyfrom observations in a small neighborhood around x. This local averaging can beseen as the basic idea of smoothing. Almost all smoothing methods can, at leastasymptotically, be described as a weighted average of the Y ’s near x,

f(x) =N∑i=1

wiYi, (3.2)

where wi is a sequence of weights that may depend on x and the predictordata Xi. The estimator f(x) is usually called a smoother and the result of asmoothing operation is called the smooth [49]. A simple smooth can be obtainedby defining the weights as constant over adjacent intervals. This is quite similar tothe histogram concept. Therefore it is sometimes referred to as the regressogram[48]. In more sophisticated methods like the kernel estimator approach, the weightsare chosen to follow a kernel function Kh(·) of fixed form

wi = Kh(Xi − x).

Kernel estimators will be described more detailed in the following section.The fact that smoothers, by definition, average over observations with consid-

erably different expected values has been paid special attention in the statistical

3.2 Local Polynomial Kernel Estimators 23

literature. The weights wi are typically tuned by a smoothing parameter whichcontrols the degree of local averaging, i.e., the size of the neighborhood around x.A too large neighborhood will include observations located far away from x, whoseexpected values may differ considerably from f(x), and as a result the estimator willproduce an “over-smoothed” or biased estimate. When using a too small neighbor-hood, on the other hand, only a few number of observations will contribute to theestimate at x, hence making it “under-smoothed” or noisy. The basic problem innonparametric methods is thus to find the optimal choice of smoothing parameterthat will balance the bias error against the variance error.

Before going into details about kernel regression models, we will give some basicterminology and notation. Nonparametric regression is studied in both fixed designand random design contexts. In the fixed design case, the predictor variables consistof ordered non-random numbers. A special case is the equally spaced fixed designwhere the difference Xi+1 − Xi is constant for all i, for example Xi = i/N, i =1, . . . , N . The random design occurs when the predictor variables instead areindependent, identically distributed random variables. The regression relationshipis in both cases assumed to be modeled as in (3.1), where ei are independentrandom variables with zero means and variances λ, which are independent of Xi.The overview is concentrated on the scalar case, because of its simpler notation.However, the results are generalized to the multivariable case in Section 3.6.

3.2 Local Polynomial Kernel Estimators

Local polynomial kernel estimators is a special class of nonparametric regressionmodels and was first discussed by Stone [46] and Cleveland [7]. The basic idea isto estimate the regression function f(·) at a particular point x, by locally fitting apth degree polynomial

θ0 + θ1(Xi − x) + . . .+ θp(Xi − x)p (3.3)

to the data (Yi, Xi) via weighted least squares, where the weights are chosenaccording to a kernel function, K(·), centered about x and scaled according to aparameter h. A kernel estimate, f(x, h), of the true regression function at the pointx, is thus obtained as

f(x, h) = θ0, (3.4)

where θ = (θ0, . . . , θp)T is the solution to the weighted least squares problem

θ = arg minθ

N∑i=1

Yi − θ0 − θ1(Xi − x)− . . .− θp(Xi − x)p2Kh(Xi − x),(3.5)

and

Kh(Xi − x) = h−1K((Xi − x)/h) . (3.6)


The parameter h is usually referred to as the bandwidth, and can be interpreted asa scaling parameter that controls the size of the neighborhood in (3.5).

The kernel is normally chosen as a symmetric function satisfying∫ 1

−1

K(u) du = 1,∫ 1

−1

uK(u) du = 0, (3.7)

K(u) ≥ 0, ∀u. (3.8)

A wide range of different kernel functions is possible in general, but often bothpractical and performance theoretic considerations limit the choice. A commonlyused kernel function, which has been proved to have some optimal properties, is ofparabolic shape [14],

K(u) =

0.75(1− u2), |u| < 1

0, |u| ≥ 1. (3.9)

A plot of this so called Epanechnikov kernel is shown in Figure 3.1.

−1.5 −1 −0.5 0 0.5 1 1.5

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

u

K(u

)

Figure 3.1 The Epanechnikov kernel.

The expression for f(x, h) = θ0 in the solution to (3.5) is quite complicatedin general, but simple explicit formulas exist for the so-called Nadaraya-Watsonestimator [35, 51] (p = 0),

fNW(x, h) =∑Ni=1 Kh(Xi − x)Yi∑Ni=1 Kh(Xi − x)

, (3.10)

3.2 Local Polynomial Kernel Estimators 25

which corresponds to fitting a local constant to the data, and the local linearestimator [15] (p = 1),

fLL(x, h) = N−1N∑i=1

s2(x, h)− s1(x, h)(Xi − x)Kh(Xi − x)Yis2(x, h)s0(x, h)− s2

1(x, h),

(3.11)

where

sr(x, h) = N−1N∑i=1

(Xi − x)rKh(Xi − x). (3.12)

Note that both the Nadaraya-Watson estimator (3.10) and the local linear estimator(3.11) can be written in the weighted average form (3.2), with weights

wi =Kh(Xi − x)∑Ni=1 Kh(Xi − x)

and

wi = N−1 s2(x, h)− s1(x, h)(Xi − x)Kh(Xi − x)s2(x, h)s0(x, h)− s2

1(x, h)

respectively.The selection of bandwidth parameter is crucial for the performance of the

kernel estimator. This fact is best illustrated with an example.

Example 3.1 (Adapted from Wand and Jones [50])Consider the regression function

f(x) = 3 e−x2/0.32

+ 2 e−(x−1)2/0.72, x ∈ [0, 1], (3.13)

which is represented by the dashed curve in Figure 3.2. A data set (Xi, Yi)(represented by crosses) was generated according to

Yi = f(Xi) + ei, i = 1, . . . , 100

where Xi = i/100, and the ei are independent Gaussian random variables withzero mean and variance 0.16. The solid line corresponds to a local linear estimateas in (3.11) using an Epanechnikov kernel and bandwidth 0.1. In Figure 3.3, thesame experiment is repeated using bandwidths 0.01 and 1 respectively.

As shown in Example 3.1, if the bandwidth h is small, the local linear fittingdepends heavily on the measurements that are close to x, thus producing an es-timate that is very noisy and wiggly. This is shown in Figure 3.3 (a) where thebandwidth h = 0.01 is used. A large bandwidth, on the other hand, tends to weightthe measurements more equally, hence resulting in an estimate that approaches a


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.5

1

1.5

2

2.5

3

3.5

4

4.5

x

Reg

ress

ion

funct

ion

Figure 3.2 Local linear estimate (solid) of the regression function (3.13),based on 100 simulated and noisy observations (crosses), and using band-width 0.1. The dashed curve is the true regression function f .

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

2.5

3

3.5

4

4.5

x

Reg

ress

ion

funct

ion

(a) h = 0.01

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.5

1

1.5

2

2.5

3

3.5

4

4.5

x

Reg

ress

ion

funct

ion

(b) h = 1

Figure 3.3 Local linear estimates based on the same data as in Figure 3.2,but with (a) a very small bandwidth (b) a very large bandwidth.

3.3 K-Nearest Neighbor Estimators 27

straight line through the data. This is illustrated in Figure 3.3 (b), where a band-width h = 1 is used. A good choice of bandwidth is a compromise between thesetwo extremes, as shown in Figure 3.2 where a bandwidth h = 0.1 is used.

In practice, it is undesirable to have to specify the bandwidth explicitly whenusing a kernel estimator. A better idea is to make use of the available data set(Xi, Yi), and on basis on these data automatically estimate a good bandwidth.This procedure is usually referred to as bandwidth selection, and is described moredetailed in Section 3.5.

3.3 K-Nearest Neighbor Estimators

The k-nearest neighbor (k-NN) estimator is a variant of the kernel estimator. Thek-NN weight sequence has been introduced by [31] in the related density estimationproblem, and has been used by [8] for classification purposes.

The k-NN estimator is defined as

fk(x) =N∑i=1

wikYi, (3.14)

where the weights are defined through the index set,

Jk(x) = i : Xi is one of the k nearest neighbors of x. (3.15)

such that

wik =

1/k, if i ∈ Jk(x)0 otherwise.

(3.16)

In general the weights can be thought as being generated by a kernel function

wik = Kr(Xi − x), (3.17)

where r is the distance between x and its kth nearest neighbor,

r = maxi|Xi − x|, i ∈ Jk(x).

3.4 Statistical Properties of Kernel Estimators

It is in general of interest to investigate the performance and the statistical proper-ties of kernel estimators. Typically, this concerns questions regarding consistency,i.e. whether or not the estimate converges to the true regression function f , andconvergence rate, i.e. how fast the estimate tends to f with respect to the numberof samples N .


3.4.1 The MSE and MISE Criteria

When analyzing the performance it is necessary to have some kind of measure thatspecifies the accuracy of the estimator. An often used pointwise error measure isthe mean square error (MSE),

MSE(f(x, h)) = E(f(x, h)− f(x)

)2

. (3.18)

The MSE has the nice feature of being decomposable into a squared bias part anda variance error part,

MSE(f(x, h)) =(Ef(x, h)− f(x)

)2

︸︷︷︸bias2

+ Var(f(x, h)

)︸︷︷︸

variance

. (3.19)

where the variance error typically decreases and the bias error increases with in-creasing h. As we shall see in Section 3.4.2, this implies that a good choice ofbandwidth is one that balances the bias error versus the variance error.

If one instead is interested in a global error measure, it is natural to integratethe squared error over all x, and take expectation. This leads to the mean integratedsquare error (MISE) [50],

MISE(f(x, h)) = E

∫f(x, h)− f(x)2 dx. (3.20)

By changing the order of expectation and integration, (3.20) can be rewritten

MISE(f(x, h)) =∫Ef(x, h)− f(x)2 dx =

∫MSE(f(x, h)) dx,

(3.21)

i.e. the MISE can be obtained by integrating the MSE over all x.

3.4.2 Asymptotic MSE Approximation

A non-trivial problem with the MSE formula (3.19) is that it depends on thebandwidth h in a complicated way, which makes it difficult to analyze the influenceof the bandwidth on the performance of the kernel estimator. One way to overcomethis problem is to use large sample approximations (i.e., N large) for the bias andvariance terms. This leads to what is referred to as the asymptotic mean squareerror, AMSE [50].

To simplify the notation in the sequel of the section, the following notations areintroduced;

µ2(K) =∫u2K(u) du, R(K) =

∫K2(u) du.

For the local linear estimator (3.11), we then have the following asymptotic result:

3.4 Statistical Properties of Kernel Estimators 29

Theorem 3.1 (Wand and Jones [50]) Consider the fixed design case

Yi = f(Xi) + ei, i = 1, . . . , N,

where Xi = i/N and ei are identically distributed random variables with zero

means and variances λ. Let fLL(x, hN ) be a local linear kernel estimator as in(3.11), and assume that

(i) The second order derivative f ′′(x) is continuous on [0, 1].

(ii) The kernel function K is symmetric about 0, and has support on [−1, 1].

(iii) The bandwidth h = hN is a sequence satisfying hN → 0 and NhN → ∞ asN →∞.

(iv) The estimation point x is an interior point satisfying h < x < 1− h.

Then the bias error is asymptotically given by

EfLL(x, hN )− f(x) = 12h

2Nf′′(x)µ2(K) + o(h2) +O(N−1),

(3.22)

and the variance by

Var(fLL(x, hN )) =1

NhNR(K)λ+ o((Nh)−1). (3.23)

These two expressions can be combined to form the AMSE

AMSE(fLL(x, hN )

)=(

12h

2Nf′′(x)µ2(K)

)2 +1

NhNR(K)λ.

(3.24)

Hence basic calculus yields that

infhN>0

AMSE(fLL(x, hN )

)=

54µ2(K)R2(K)f ′′(x)λ2

2/5N−4/5,

(3.25)

with asymptotic optimal bandwidth

hAMSE =(

R(K)λ(f ′′(x))2µ2

2(K)

)1/5

N−1/5. (3.26)

The proof is given in Appendix 3.A for future reference.Equation (3.24) shows that the squared bias error is asymptotically proportional

to h4, which means that in order to decrease the bias error, h has to be small.However, a small value of h yields that the variance error part becomes large, sinceit is asymptotically proportional to (Nh)−1. The bandwidth must thus be chosenso that the bias is balanced against the variance.


From equation (3.25) we see that the mean square error tends to zero as thesample size N tends to infinity. This implies that the kernel estimator convergesin probability to the true regression function f(x). The best achievable rate of thisconvergence is of order N−4/5, which is slower than the typical rate of order N−1

for parametric models as described in Section 2.4. To obtain the rate of N−4/5,the bandwidth must be selected in order of N−1/5.

Theorem 3.1 is stated for the pointwise MSE error measure. Determining thecorresponding asymptotic result for the MISE measure (3.20) is straightforward.Integrating (3.24) over all x yields

AMISE(fLL(x, hN )

)=

14h4NR(f ′′(x))µ2

2(K) +1

NhNR(K)λ.

(3.27)

Hence the asymptotic global optimal bandwidth is given by

hAMISE =(

R(K)λR(f ′′(x))µ2

2(K)

)1/5

N−1/5. (3.28)

Example 3.2Consider again the smoothing problem in Example 3.1, where we used a locallinear kernel estimator and an Epanechnikov kernel. Suppose we are interestedin minimizing the MSE for all x ∈ [0, 1]. The bandwidth that asymptoticallyminimizes the MISE is given by (3.28). Since

N = 100, λ = 0.16,

R(K) =35, µ2(K) =

15

and

R(f ′′(x)) =∫ 1

0

(f ′′(x))2 dx = 605.916

we gethAMISE = 0.13

which is close to the value h = 0.1 used in Figure 3.2.

Note that all results stated in this section are valid only if the point x is aninterior point of the interval. At the boundary the situation in general gets degen-erated, which results in slower convergence rates there.

3.5 Bandwidth Selection

Practical implementations of kernel estimators require that the bandwidth h isspecified, and as was shown in Section 3.2, this choice is crucial for the performanceof the kernel estimator. A method that uses the data set (Xi, Yi) to produce a

3.6 Extensions to the Multivariable Case 31

bandwidth h is called a bandwidth selector [50]. Several methods for doing thisexist. One solution is the one leave-out cross-validation method [21], which isbased on estimators where one of the observations is left out when computing theestimate,

f(Xj , h) =∑i6=j

wi(Xj)Yi. (3.29)

A cross-validation function [47],

CV (h) = N−1N∑j=1

(Yj − f(Xj , h))2 (3.30)

is then formed, and the bandwidth h is taken as the minimizing argument of CV (h).Another class of bandwidth selectors, so-called direct plug-in methods [50], is

based on the simple idea of “plugging in” values of λ and R(f ′′(x)) into the asymp-totically optimal bandwidth formula (3.28).

In practice, however, the values of λ and R(f ′′(x)) are unknown and have tobe estimated from data. Quite often they are estimated on the basis of somepreliminary smoothing stage, which then raises a second order bandwidth selectionproblem. For example, an estimator for R(f ′′) is given by

R(f ′′) = N−1N∑i=1

(f ′′(Xi, g))2,

and an estimator for λ by

λ = ν−1N∑i=1

(Yi − f(Xi, l))2

where ν is chosen so that λ is conditionally unbiased when the true regressionfunction f is linear. Consult [50] for details.

The unknown quantities in (3.28) can thus be estimated using two other kernelestimators f ′′(Xi, g) and f(Xi, l) with auxiliary bandwidths g and l, respectively.

3.6 Extensions to the Multivariable Case

The theory for kernel estimators can easily be extended to the multivariable casewhere Xi ∈ Rn and Yi ∈ R. The multivariable local linear estimator is given by

fLL(x,H) = θ0, (3.31)

where θ = (θ0, θT1 )T is the solution to the weighted least squares problem

θ = arg minθ

N∑i=1

Yi − θ0 − (Xi − x)T θ1

2KH(Xi − x). (3.32)


Here

KH(u) = |H |−1/2K(H−1/2u

), (3.33)

and K(·) is an n-dimensional kernel function satisfying∫RnK(u) du = 1,

∫RnuK(u) du = 0, (3.34)

and

K(u) > 0, ∀u. (3.35)

The symmetric positive definite n× n matrix H1/2 is called the bandwidth matrix,since it is the multivariable counterpart to the usual scalar bandwidth parameter.

The kernel function is often taken as an n-dimensional probability density func-tion. There are two common methods for constructing multidimensional kernelfunctions from a scalar kernel κ; the product construction,

K(u) =n∏i=1

κ(ui) (3.36)

and the radial construction

K(u) = c · κ(

(uTu)1/2), (3.37)

where c is a normalizing constant. These correspond directly to the parametricconstruction methods described in Section 2.2.2.

A problem with multidimensional kernel estimators is the large number of pa-rameters. In general, the bandwidth matrix H has 1

2n(n+ 1) independent entries(H is symmetric), which, even for a quite moderate value of n, gives a non trivialnumber of smoothing parameters to choose. Considerable simplifications can beobtained by restricting H to be a diagonal matrix H = diag(h2

1, . . . , h2n), which

leads to a kernel of the type

KH(u) =

(n∏k=1

hk

)−1

K

(u1

h1, . . . ,

unhn

), (3.38)

or by letting H be specified by a single bandwidth parameter H = h2I, whichresults in the kernel

KH(u) = h−nK(u/h). (3.39)

The multivariable fixed design counterpart of the asymptotic mean square error inTheorem 3.1 is (see [39])

AMSE(fLL(x,H)) =(

12µ2(K) trHHf(x)

)2

+N−1λ|H |−1/2R(K),(3.40)

Appendix 3.A: Proof of Theorem 3.1 33

where Hf (x) denotes the Hessian (i.e., the second order derivative matrix) of f .Using the single bandwidth kernel (3.39), this simplifies to

AMSE(fLL(x, h)) =(

12µ2(K)h2 trHf (x)

)2

+N−1λh−nR(K).(3.41)

Hence minimizing this expression w.r.t. h,

infh>0

AMSE(fLL(x, h)) =(14n4/(n+4) + n−n/(n+4)

)µn2 (K)R2(K) trnHf (x)λ2

2/(n+4) ·N−4/(n+4),(3.42)

with optimal bandwidth

hN =(

nR(K)λµ2

2(K) tr2Hf (x)

)1/(n+4)

N−1/(n+4). (3.43)

As shown, the convergence rate N−4/(n+4) for the multivariable kernel estimator isslower than for the corresponding scalar estimator (n = 1). This is a manifestationof the so-called curse of dimensionality which follows from the sparseness of datain higher dimensions.

Appendix 3.A: Proof of Theorem 3.1

For future reference the proof of Theorem 3.1 is given here.

Proof (Wand and Jones [50])

Introduce weights wi according to

wi = Kh(Xi − x) i = 1, . . . , N. (3.44)

Standard weighted least squares theory then gives that the solution to (3.5) is

θ =

(N∑i=1

wiXiXTi

)−1 N∑i=1

wiXiYi, (3.45)

where

Xi =(

1Xi − x

). (3.46)

Hence it follows that

EfLL(x, h) = (1 0)

(N∑i=1

wiXiXTi

)−1 N∑i=1

wiXif(Xi). (3.47)

34 Appendix 3.A: Proof of Theorem 3.1

Using a Taylor series expansion of f at Xi,

f(Xi) = XTi

(f(x)f ′(x)

)+

12f ′′(x)(Xi − x)2 + . . . , (3.48)

this leads to the bias of fLL(x, h) being

EfLL(x, h)− f(x) =

=12f ′′(x)(1 0)

(N∑i=1

wiXiXTi

)−1 N∑i=1

wiXi(Xi − x)2 + . . . . (3.49)

Using the notation introduced in (3.12), we can write

N−1N∑i=1

wiXiXTi =

(s0(x, h) s1(x, h)s1(x, h) s2(x, h)

)(3.50)

and

N−1N∑i=1

wiXi(Xi − x)2 =(s2(x, h)s3(x, h)

). (3.51)

For sufficiently large N , it then follows

sl(x, h) =∫ 1

0

(y − x)lKh(y − x) dy +O(N−1)

= hl∫ (1−x)/h

−x/hulK(u) du+O(N−1)

= hl∫ 1

−1

ulK(u) du+O(N−1).

(3.52)

By the symmetry and compact support of the Kernel function, all odd moments ofK vanish and we have that

N−1N∑i=1

wiXiXTi =

(1 +O(N−1) O(N−1)O(N−1) h2µ2(K) +O(N−1)

), (3.53)

and

N−1N∑i=1

wiXi(Xi − x)2 =(h2µ2(K) +O(N−1)

O(N−1)

). (3.54)

where µl(K) =∫ulK(u) du. Hence (3.53) and (3.54) inserted into (3.49) lead to

the expression

EfLL(x, h)− f(x) = 12h

2f ′′(x)µ2(K) + o(h2) +O(N−1) (3.55)

Appendix 3.A: Proof of Theorem 3.1 35

for the leading bias term.

For the variance part we have

Var(fLL(x, h)) = E

(1 0)

(N∑i=1

wiXiXTi

)−1( N∑i=1

wiXiei

)

×

N∑j=1

wjXTj ej

N∑j=1

wjXjXTj

−1(10

). (3.56)

Using similar approximations as used in the bias error case, we have that

N−1E

(

N∑i=1

wiXiei

) N∑j=1

wjXTj ej

= N−1λ

N∑i=1

w2i XiX

Ti

= N−1λ

N∑i=1

K2h(Xi − x)

(1 Xi − x

Xi − x (Xi − x)2

)

=(h−1R(K)λ+ o(h−1) O(N−1)

O(N−1) hµ2(K2)λ+O(N−1)

), (3.57)

where R(K) =∫K2(u) du. Hence simple algebraic manipulations result in

Var(fLL(x, h)) =1Nh

R(K)λ+ o((Nh)−1). (3.58)

2

36 Appendix 3.A: Proof of Theorem 3.1

4Just-in-Time Models

In this chapter we present an alternative solution to the nonlinear regression prob-lem presented in Chapter 3, Just-in-Time models. The basic idea is to form alocal estimate at a certain operating point x, by retrieving a subset of the data(Xi, Yi) located closest to x (in some suitable norm), and on the basis of thatsubset compute an estimate f(x). It turns out that, using a weighted least squarescriterion with simple constraints on the weights, the predictor/estimator can bewritten as a weighted average of the response variables Yi belonging to the subset.

The outline of the chapter is as follows. The first section, Section 4.1, givesan introduction and describes the basic ideas behind the Just-in-Time modelingapproach. Section 4.2 presents the derivation of the particular Just-in-Time esti-mator. The following three sections, Section 4.3 to Section 4.5, describe and solvethe three subproblems weight computation, Hessian estimation, and neighborhoodselection, which are associated with Just-in-Time modeling. Section 4.6 gives asummary of the derived algorithm, and Section 4.7, finally, investigates the basicproperties of it. The chapter is based on [44].

4.1 The Just-in-Time Idea

Let us again return to the nonlinear regression problem. Assume that we havecollected a large set of observations S = (X1, Y1), . . . , (XN , YN ) from a systemthat can be modeled as

Yi = f(Xi) + ei, (4.1)

37

38 Chapter 4 Just-in-Time Models

and that we want to compute an estimate of the true regression function f at acertain operating point x.

Traditionally in statistics and system identification, such kind of problems havebeen solved by global modeling methods, like kernel methods (see Chapter 3) andnonlinear parametric black-box models (see Chapter 2). However, when the numberof samples N becomes very large, this approach is less attractive to deal with. Forreal industrial applications, for example in the chemical process industry, it is notunusual that the volumes of data may occupy several Gigabytes.

The global modeling procedure is in general associated with an optimizationstep of the form

θ = arg minθ

N∑i=1

(Yi − f(Xi, θ))2 . (4.2)

This optimization problem is typically non-convex and will have a number of lo-cal minima which make the search for the global minimum difficult. Although theglobal model has the appealing feature of giving a high degree of data compression,it seems to be both inefficient and unnecessary to spend a large amount of calcu-lations to optimize a model which is valid over the whole regressor space, while itin most cases is more likely that we will only visit a very restricted subset of it.

Inspired and encouraged by recent progress in database technology, we will takea conceptually different point of view. The basic idea is to store all observationsin a database, and to build models dynamically as the actual need arises. Whena model is really needed in a neighborhood of an operating point x, a subset ΩMof the data S closest to the operating point is retrieved from the database, anda local modeling operating is performed on that subset, see Figure 4.1. For this

x

ΩM

X-space

Figure 4.1 The Just-in-Time idea: A subset ΩM of the data closest to theoperating point x is retrieved, and is used to compute an estimate of f(x).

concept, we have adopted the name Just-in-Time models, suggested by [9], and

4.2 The Just-in-Time Estimator 39

to be consistent with the notation introduced in Chapter 3, we will denote thecorresponding estimator

fJIT(x).

Compared to global methods, the advantage with Just-in-Time models is that themodeling is optimized locally, which might improve the performance. A possibledrawback is the computational complexity, both since we have to search for a neigh-borhood of x in a multidimensional space, and since the derived estimator is quitecomputationally intensive. In this contribution however, we will only investigatethe properties of the modeling part of the problem.

Before commencing the derivation of the Just-in-Time estimator, we will in-troduce some basic assumptions and notation. As in Chapter 2 we assume thatthe data are generated according to (4.1), where Xi ∈ Rn and Yi ∈ R, and that eidenotes independent, identically distributed random variables with zero means andvariances λ. We further assume that the true regression function f is two timesdifferentiable. The symbol ΩM is used to denote a neighborhood of x containingM samples, i.e.,

ΩM = Xi : ‖Xi − x‖ ≤ h (4.3)

for some arbritrary h ∈ R and some suitable norm ‖ · ‖.

4.2 The Just-in-Time Estimator

When performing the local modeling operating discussed in the preceeding section,it is clear that a wide range of solutions is possible in general. However, referring tothe discussion in Chapter 1, we shall here mainly concentrate on two modeling al-ternatives – using a simple local parametric model, and using a local nonparametricmodel.

Considering the parametric approach, a straightforward solution is to locally fita hyperplane

XTi θ + φ

to the data in the neighborhood ΩM using weighted least squares with weights wi.The value of the estimate fJIT(x) is then given by

fJIT(x) = xT θ + φ, (4.4)

where θ and φ are the solution to the weighted least squares problem

arg minθ,φ

∑Xi∈ΩM

wi(Yi −XTi θ − φ)2. (4.5)

An alternative is instead to consider a local nonparametric approach, and per-form the modeling as weighted average of the Yi’s that corresponds to the Xi’s inΩM , i.e.,

fJIT(x) =∑

Xi∈ΩM

wiYi. (4.6)


An advantage with this solution is that we do not have to deal with structuralquestions as in the parametric modeling case, once we have chosen the model order(na, nb and nk in the dynamical system case). For instance if we are dealing witha regression function f that depends on two variables x1 and x2, the regressionvector can in the parametric case be constructed in infinitely many ways, where

x = (x1, x2)T

andx = (x1, x2,

√x1, x1x2, x

32)T

are two particular examples. When using the nonparametric approach, we onlyhave to consider the former variant. We will return to this type of questions inChapter 6.

Although the two methods (4.4) and (4.6) at a glance seem to be quite different,it is easy to show that they are in fact identical, provided certain properties of theweight sequence wi hold. It is natural to assume that the weights satisfy∑

Xi∈ΩM

wi = 1, (4.7)

∑Xi∈ΩM

wi(Xi − x) = 0, (4.8)

which are standard in the kernel estimation context (see Chapter 3) and for smooth-ing windows in spectral analysis [27]. These fundamental assumptions can be jus-tified by the fact that (4.6) then will produce unbiased estimates in the case thatthe true regression function is linear, and as a result thereof, the following theoremholds.

Theorem 4.1 Consider the two local models (4.4) and (4.6). Let

Xi =(

1Xi

)and x =

(1x

), (4.9)

and assume that the matrix ∑Xi∈ΩM

wiXiXTi ,

is invertible. Further assume that the weight sequence wi satisfies (4.7) and(4.8). Then

fJIT(x) = xT θ + φ =∑

Xi∈ΩM

wiYi. (4.10)

Proof Standard weighted least squares theory gives that the solution to (4.5) is(φ

θ

)=

( ∑Xi∈ΩM

wiXiXTi

)−1 ∑Xi∈ΩM

wiXiYi. (4.11)

4.2 The Just-in-Time Estimator 41

From the local linear model (4.4) we thus have

fJIT(x) = xT

( ∑Xi∈ΩM

wiXiXTi

)−1 ∑Xi∈ΩM

wiXiYi. (4.12)

Equation (4.12) can be expanded as

fJIT(x) = xT

( ∑Xi∈ΩM

wiXiXTi

)−1 ∑Xi∈ΩM

wiXiYi

= (1 xT )

( ∑Xi∈ΩM

wi

(1 XT

i

Xi XiXTi

))−1 ∑Xi∈ΩM

wi

(1Xi

)Yi

= (1 xT )

1 xT

x∑

Xi∈ΩM

wiXiXTi

−1 ∑Xi∈ΩM

(wiYiwiXiYi

),

(4.13)

where the third equality follows from the constraints (4.7) and (4.8). The matrixinverse in (4.13) can be rewritten using the block matrix inversion formula [26].This gives 1 xT

x∑

Xi∈ΩM

wiXiXTi

−1

=(

1 + xT∆−1x −xT∆−1

−∆−1x ∆−1

)(4.14)

where

∆ =∑

Xi∈ΩM

wiXiXTi − xxT (4.15)

Equation (4.14) inserted into (4.13) finally gives

fJIT(x) = (1 xT )(

1 + xT∆−1x −xT∆−1

−∆−1x ∆−1

) ∑Xi∈ΩM

(wiYiwiXiYi

)

= (1 0)

∑

Xi∈ΩM

wiYi∑Xi∈ΩM

wiXiYi

=∑

Xi∈ΩM

wiYi, (4.16)

and the theorem is proved. 2

That is, assuming that the weight sequence satisfies (4.7) and (4.8), the estimateis given as a weighted average of the Yi’s in the neighborhood. A great advantagewith this is that we do not have to deal with the structural issues as mentioned inthe beginning of the section. However, two major questions arise:


• How should the weights wi in (4.16) be chosen so that the prediction error isminimized?

• Which data should be used in the estimation, i.e., what is the optimal numberof data M in the neighborhood ΩM , and what shape should ΩM have?

These two questions are discussed in Section 4.3 and Section 4.5, respectively.

4.3 Optimal Weights

As shown in Theorem 4.1, the weighted mean predictor and the local linear modelare equivalent assuming the weight sequence satisfies the conditions (4.7) and (4.8).However, we have still not specified how to construct the actual weight sequencewi. This will be the subject of this section.

Since the prediction will be computed at one particular point x, it is naturalto investigate the pointwise error measure MSE as a function of the weights andon the basis of this determine how to choose the weights in order to minimize thismeasure.

4.3.1 The MSE Formula

In this section we will derive an expression for the mean square error of the estima-tor (4.16) as a function of the weights wi. As mentioned in Chapter 2, the meansquare prediction error (MSE) is defined by

MSEfJIT(x,w) = E(fJIT(x,w) − f(x)

)2

. (4.17)

Here we have added the argument w to the estimator to emphasize the fact that itsperformance strongly depends on the properties of the weight sequence. Inserting(4.16) and (4.1) to (4.17) gives

MSEfJIT(x,w) = E

( ∑Xi∈ΩM

wiYi − f(x)

)2

= E

( ∑Xi∈ΩM

wi (f(Xi) + ei)− f(x)

)2

= E

( ∑Xi∈ΩM

wif(Xi)− f(x) +∑

Xi∈ΩM

wiei

)2

. (4.18)

The bias error term, ∑Xi∈ΩM

wif(Xi)− f(x),

4.3 Optimal Weights 43

in (4.18) can be simplified using a second order Taylor series expansion of f at x,

f(Xi) ≈ f(x) +DTf (x)(Xi − x) + 12 (Xi − x)THf (x)(Xi − x)T ,

(4.19)

where Df (x) and Hf (x) denote the Jacobian and the Hessian of f with respect tox, respectively. As a consequence of the constraints (4.7) and (4.8), the constantterm and the linear term will vanish, and we have that∑

Xi∈ΩM

wif(Xi)− f(x) ≈∑

Xi∈ΩM

wiβi, (4.20)

where

βi = 12 (Xi − x)THf (x)(Xi − x). (4.21)

Hence (4.18) can be approximated as

MSEfJIT(x,w) ≈ E

( ∑Xi∈ΩM

wi(βi + ei)

)2

= (βTw)2 + λ · wTw, (4.22)

where we have introduced the more compact vector notation

w = (w1, w2, . . . , wM )T , (4.23)

and

β = (β1, β2, . . . , βM )T . (4.24)

As expected, the MSE formula splits into two parts: a squared bias error part,(βTw)2, and a variance error part, λ · wTw.

4.3.2 Optimizing the Weights

Knowing the MSE expression as a function of the weight sequence wi, it seemsreasonable to try to minimize this expression subject to the constraints. Since thisis a square optimization problem with linear constraints, it can easily be solved asa linear equation system. From basic calculus, we have the result that in order tominimize an object function w.r.t. a constraint function, the gradients of the twofunctions have to be parallel.

Using the notation introduced in (4.9), the constraints (4.7) and (4.8) can bereformulated as

g(w) = XTw − x = 0, (4.25)


where

X = (X1, X2, . . . , XM )T . (4.26)

By introducing Lagrange multipliers, µk, k = 1, . . . , n+ 1, we thus obtain

∂MSEfJIT(x,w)∂wi

+n+1∑k=1

µk∂gk(w)∂wi

= 0, ∀wi (4.27)

where gk(w) denotes the kth component of the vector valued constraint functiong(w) in (4.25). The constrained minimization of MSEfJIT(x,w) can then bestated in matrix form as(

2(ββT + λI) XXT 0

)(wµ

)=(

0x

)(4.28)

whereµ = (µ1, µ2, . . . , µn+1)T .

This is a linear equation system in M + n+ 1 equations and M + n+ 1 variables,from which the weights wi can be uniquely solved. Solving such a large equationsystem might seem to be a quite desperate thing to do, especially in an on-linesituation, but since the matrix in (4.28) has a simple structure, the solution canfortunately be computed more efficiently.

From (4.28) we have (wµ

)=(A X

XT 0

)−1(0x

), (4.29)

where A = 2(ββT + λI). From the block matrix inversion formula [26], it thenfollows that(

A XXT 0

)−1

=(A−1 −A−1XD−1XTA−1 A−1XD−1

D−1XTA−1 D−1

),

(4.30)

where

D = XTA−1X. (4.31)

Hence the weights can be computed as

w = A−1XD−1x = A−1X(XTA−1X

)−1x, (4.32)

where

A−1 =12

(ββT + λI)−1 =1

2λ

(I − ββT

λ+ βTβ

)(4.33)

follows from the matrix inversion lemma [26]:

(A+BCD)−1 = A−1 −A−1B(DA−1B + C−1)−1DA−1.


4.3.3 Properties of the Weight Sequence

It may be of interest to investigate how the Hessian and the noise variance affectthe shape of the weight sequence w. In Figure 4.2 a scalar example is shown, wherethree weight sequences have been computed at x = 0 using different values of theHessians but with fixed noise variance. The predictor data Xi consist of 100points, equidistantly distributed on the interval [−1, 1].

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−0.015

−0.01

−0.005

0

0.005

0.01

0.015

0.02

0.025

x

Wei

ghtswi

Figure 4.2 Optimal weights computed for different choices of HessiansHf (0) but with fixed variance λ = 0.1. Solid: Hf (0) = 1, Dashed: Hf (0) =0.1, Dash-dotted: Hf (0) = 0.

For a large value of the Hessian, which corresponds to a regression function witha high degree of curvature at x, we see that the weight sequence gets narrow (solidcurve), hence giving the measurements close to x a larger impact on the resultingestimate than those far away from x. On the other hand, for a small value ofthe Hessian, which is equivalent to an approximately linear regression function,the weight sequence tends to weight all measurements almost equally (dash-dottedcurve).

In Figure 4.3, the same experiment as in Figure 4.2 is repeated using a fixedHessian but now with varying noise variance. As shown, a large variance, whichcorresponds to a high degree of uncertainty in the measurements, results in a weightsequence that pays almost equal attention to all measurements (solid curve). Asmaller variance, which means that the measurements are a little more reliable,again leads to a narrow weight sequence that will favour the local measurements(dash-dotted curve).


−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−0.01

−0.005

0

0.005

0.01

0.015

0.02

x

Wei

ghtswi

Figure 4.3 Optimal weights computed for different values of the varianceλ but with fixed Hessian Hf (0) = 0.1. Solid: λ = 1, Dashed: λ = 0.1,Dash-dotted: λ = 0.01.

4.3.4 Comparison with Kernel Weights

It may also be interesting to compare the optimal weight sequence (4.32) with thoseused in the corresponding kernel methods described in Chapter 3. A very usefultool for doing such comparisons is the effective kernel [22]. As discussed in Section3.1, almost all nonparametric methods can, at least asymptotically, be written inthe form

f(x) =N∑i=1

wi(x)Yi.

The effective kernel at x is defined to be the set wi(x), i = 1, . . . , N . For theNadaraya-Watson estimator (3.10) we have

wi(x) =Kh(Xi − x)∑Ni=1Kh(Xi − x)

,

and for the local linear estimator (3.11),

wi(x) = N−1 s2(x, h) − s1(x, h)(Xi − x)Kh(Xi − x)s2(x, h)s0(x, h)− s2

1(x, h),

where

sr(x, h) = N−1N∑i=1

(Xi − x)rKh(Xi − x).


Important insight into the properties of a particular estimator can be obtained byplotting the effective weights as a function of Xi.

Figure 4.4 illustrates the effective kernel functions of the Just-in-Time estimatorand the local linear kernel estimator (3.11) (using the Epanechnikov kernel), whenestimating f(x) = sin(2πx) at x = 0.27 using the observations

Yi = f(Xi) + ei, i = 1, . . . , 100

where Xi = i/100 and ei ∈ N(0,√

0.05). The bandwidth is chosen according tothe AMISE measure (3.27) which results in h = 0.099. The Just-in-Time weightsare represented by the dashed line and the effective local linear weights by thedashed-dotted line. The corresponding estimates are indicated with a circle and across respectively.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−1.5

−1

−0.5

0

0.5

1

1.5

x

f(x

)

(a)

0.18 0.2 0.22 0.24 0.26 0.28 0.3 0.32 0.34 0.36−0.5

0

0.5

1

1.5

2

x

f(x

)

(b)

Figure 4.4 A comparison between the Just-in-Time weights and the ef-fective local linear weights, when estimating f(x) = sin(2πx) at x = 0.27.(a) True regression function (solid), simulated data (dots), Just-in-Time es-timate (circle) and local linear estimate (cross). The effective weights forthe Just-in-Time estimator (dashed) and the local linear estimator (dash-dotted) are also plotted. (b) Magnification of the plot in (a).

A Monte-Carlo simulation, where the estimates are averaged over 1000 differentnoise realizations, is shown in Table 4.1. For comparison the Nadaraya-Watsonestimate is also included. As indicated, the Just-in-Time weights give a slightlybetter result than with the other two methods. We also see that the Nadaraya-Watson and the local linear estimates coincide. This will always be the case whenusing the fixed design setting, and when x is an interior point of the interval.

In Figure 4.5 the same experiment is repeated but at the boundary point x = 0.As shown, the Nadaraya-Watson estimator results in a large bias error since the


Method Absolute errorJust-in-Time estimator 0.064Local linear estimator 0.070Nadaraya-Watson estimator 0.070

Table 4.1 Result of a Monte-Carlo simulation with 1000 iterations whenestimating f(x) at x = 0.27.

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2−0.5

0

0.5

1

x

f(x

)

Figure 4.5 A comparison between the Just-in-Time, local linear, andNadaraya-Watson weights, when estimating f(x) = sin(2πx) at the bound-ary point x = 0. True regression function (solid), simulated data (dots),Just-in-Time estimate (circle), local linear estimate (cross), and Nadaraya-Watson estimate (star). The Just-in-Time weights (dashed), the local linearweights (dash-dotted), and the Nadaraya-Watson weights (dotted) are alsoplotted.

weights are always positive. The Just-in-Time and local linear weights which areallowed to take negative values give almost equivalent results.

The result of a Monte-Carlo simulation using 1000 iterations is given in Table4.2. As indicated, the Just-in-Time and the local linear weights give essentiallythe same prediction error at the boundary.

4.4 Estimation of Hessian and Noise Variance 49

Method Absolute ErrorJust-in-Time estimator 0.009Local linear estimator 0.010Nadaraya-Watson estimator 0.180

Table 4.2 Result of a Monte-Carlo simulation with 1000 iterations whenestimating f(x) at the boundary point x = 0.

4.4 Estimation of Hessian and Noise Variance

In Section 4.3 we saw that in order to compute the (in mean square sense) optimalweights, we need to know the Hessian Hf (x) of the true regression function andthe noise variance λ. In general, these quantities are not a priori known, and wehave to estimate them from the observations (Yi, Xi) in some way. This can ofcourse be done in a number of different ways, but we will here mainly focus on twomethods – using a linear least squares and using a weighted mean approach.

4.4.1 Using Linear Least Squares

A straightforward approach is to start with the Taylor series expansion (4.19),which can be rewritten

f(Xi) ≈ f(x) +DTf (x)(Xi − x) + 12 trHf (x)(Xi − x)(Xi − x)T

= f(x) +DTf (x)(Xi − x) + 1

2 vecT Hf (x) vec(Xi − x)(Xi − x)T ,(4.34)

where vec Hf (x) denotes the vector ofHf (x). SinceDf (x) andHf (x) enter linearlyin (4.34) andHf (x) is a symmetric matrix, Hf (x) can be estimated by least squarestheory as follows. Introduce the parameter and regression vectors

ϑ = (f(x), DTf (x), vechT Hf (x))T , (4.35)

φi = (1, (Xi − x)T , 12 vechT (Xi − x)(Xi − x)T DT

nDn)T , (4.36)

where vech Hf (x) denotes the vector-half of the symmetric matrix Hf (x) [24], i.e.,the vector of size 1

2n(n + 1) obtained from vec Hf (x) by eliminating the above-diagonal entries in Hf (x), and Dn is the duplication matrix of order d [32], i.e., then2 × 1

2n(n+ 1) matrix of zeros and ones such that

Dn vech Hf (x) = vecHf (x).

For example, the duplication matrix of order 2 is given by

D2 =

1 0 00 1 00 1 00 0 1

.


We can now state the estimation of Hf (x) (and f(x) and Df (x)) as a linearleast squares problem,

VM (ϑ,ΩM ) =∑

Xi∈ΩM

(Yi − φTi ϑ)2, (4.37)

ϑ = arg minϑ

VM (ϑ,ΩM ). (4.38)

This can be efficiently implemented using the recursive least squares (RLS) algo-rithm.

By this estimation approach, we also get an estimate of the noise variance λ.From Chapter 16 in [30] we have that

VM (ϑ,ΩM ) ≈Mλ− dλ,

where d = dimϑ. Hence a noise variance estimate is given by

λ = VM (ϑ,ΩM )1

M − d. (4.39)

It is worth pointing out that when using the least squares method of estimatingthe Hessian, we also get an initial estimate of the regression function f(x), sinceit is part of the regression vector (4.35) . However, this estimate is in generalmuch worse than the weighted average in (4.6). The reason is that we in (4.38) userelatively few data samples to estimate a large amount of parameters.

4.4.2 Using a Weighted Mean

In the spirit of nonparametric estimation, another way of estimating the Hessiancould be to use a weighted criterion

Hf (x) =∑

Xi∈ΩM

wiyi, (4.40)

using a symmetric n×n weight matrix wi with components w(k,l)i , with the imposed

constraints ∑Xi∈ΩM

w(k,l)i = 0, (4.41)

∑Xi∈ΩM

w(k,l)i (Xi − x) = 0 (4.42)

and ∑Xi∈ΩM

w(k,l)i (Xij − xj)(Xis − xs)T = δkjδls + δksδlj (4.43)

4.5 Data selection 51

where δkj , δls, δlj and δks denote Kronecker symbols. Assuming that f belongs tothe class of two times continuously differentiable functions having a given Lipschitzconstant L for the Hessian,

|Hf (Xi)−Hf (x)| ≤ L|Xi − x|, (4.44)

the mean square error of the (k, l)th component in Hf (x) can be written [36]

E(Hf (x)(k,l) −Hf (x)(k,l)

)2

≤ L2

4

( ∑Xi∈ΩM

|w(k,l)i | · |Xi − x|3

)2

+ λ∑

Xi∈ΩM

(w

(k,l)i

)2

. (4.45)

On the basis of this evaluation one may choose the weights w(k,l)i for the Hessian

estimation by minimizing the right hand side of (4.45) under the constraints (4.41),(4.42) and (4.43).

4.5 Data selection

One question remains yet to be answered; what is the optimal size and shape ofthe neighborhood ΩM?

4.5.1 Neighborhood size

When estimating the Hessian using the least squares approach (4.37) and (4.38),it is clear that a small neighborhood would give the measurement noise a largeinfluence on the resulting estimate ϑ. On the other hand, a large neighborhoodwould make the Taylor expansion (4.19) inappropriate and would introduce a biaserror. A reasonable choice of region size could therefore be obtained as a trade-off between the bias error and the variance error when performing the Hessianestimation.

A commonly used approach in statistics and system identification is to evaluatethe loss function (4.37) on completely new datasets Ω′M , and choose Mopt as theM that minimizes VM (ϑ,Ω′M ) (so called cross-validation). Adopting this conceptto our framework, we get

Mopt = arg minM

VM (ϑ,Ω′M ) = arg minM

∑Xi∈Ω′

M

(Yi − φTi ϑ)2. (4.46)

In this context it would be more desirable to determine M by evaluating VM (·) onthe same data ΩM as used in estimation, since we do not want to wast measure-ments without cause. However, this approach yields a problem, since the estimatefor small values of M will adapt to the local noise realization. Hence when applyingthe estimate to the same data as used for estimation, the loss function will becomean increasing function in M , i.e., the optimal M will be the smallest one. A number


of methods has therefore been developed that penalize the loss function VN (ϑ,ΩM )for small M such that it imitates what we would have obtained if we had appliedthe evaluation on fresh data. One such method is Akaike’s Final Prediction Error(FPE) [1],

WFPEM = VM (ϑ,ΩM )

1 + d/M

1 − d/M , (4.47)

where d = dim ϑ = 1 + n+ 12n(n+ 1). We thus have a method of determining the

region size Mopt as

Mopt = arg minM

VM (ϑ,ΩM )1 + d/M

1− d/M . (4.48)

Note that this function is minimized w.r.t. M , and not w.r.t. the number of pa-rameters d as usual in the area of model structure selection which is its originallyintended application. In [7] and [38], the related Mallow’s Cp criterion is used toget a good bias/variance trade-off.

It may be illustrative to show how the FPE criterion (4.48) works in practice,using a linear and a nonlinear regression function.

Example 4.1 (Linear regression function)Consider the regression function

f(x) = 2x+ 1, x ∈ [−1, 1], (4.49)

which is depicted (dashed curve) in Figure 4.6 (a). The data (represented by stars)are generated in accordance to

Yi = f(Xi) + ei, i = 1, . . . , 41,

where Xi = (i − 21)/20, and the ei are independent Gaussian random variableswith Eei = 0 and Var ei = 0.2.

A Just-in-Time estimate is computed at x = 0 (represented by a circle), and theresulting FPE loss function is shown in Figure 4.6 (b).

For small M we get a large variance error due to the noise in the measurements.When M increases the variance error decays, but since the true regression functionis linear there will be no bias error, and the FPE loss function continues to decreasefor increasing M . Hence the optimal neighborhood size is selected as M = 41, i.e.the entire data set. Note that (4.47) is singular for M ≤ 3 since three parametersare estimated. Hence it is plotted only for M ≥ 4.

4.5 Data selection 53

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−1

−0.5

0

0.5

1

1.5

2

2.5

3

3.5

4

X

Y

(a)

0 5 10 15 20 25 30 35 400

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

M

WM

(ϑ,Ω

M)

(b)

Figure 4.6 An example of the FPE loss function when using a linear re-gression function. (a) True function (dashed), noisy data (stars), and JITestimate (circle). (b) Resulting FPE loss function.

Example 4.2 (Nonlinear regression function)Consider the nonlinear regression function

f(x) = sin(4x) x ∈ [−1, 1], (4.50)

which is depicted (dashed curve) in Figure 4.7 (a). The data (represented by stars)are generated using

Yi = f(Xi) + ei, i = 1, . . . , 41,

where Xi = (i − 21)/20, and the ei are independent Gaussian random variableswith Eei = 0 and Var ei = 0.2.

A Just-in-Time estimate is computed at x = 0 (represented by a circle), and theresulting FPE loss function is displayed in Figure 4.7 (b).

The regression function is approximately linear near x = 0, which gives that theFPE loss function initially decreases for increasing M . However, at M ≈ 13, thebias error term starts to grow, which yields that the loss function will increasefor M > 13. The FPE criterion hence results in a neighborhood size selected asM = 13.

4.5.2 Neighborhood shape

We have so far just considered Euclidian norms when retrieving the M closestregression vectors Xi that define ΩM . This leads to a spherical shape of ΩM .


−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−2

−1.5

−1

−0.5

0

0.5

1

1.5

X

Y

(a)

0 5 10 15 20 25 30 35 400

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

M

WM

(ϑ,Ω

M)

(b)

Figure 4.7 An example of the FPE loss function when using a nonlinearregression function. (a) True function (dashed), noisy data (stars), and JITestimate (circle). (b) Resulting FPE loss function.

Another possibility though, could be to use a norm that adapts to the propertiesof data. The neighborhood ΩM would then have the shape of an ellipsoid. Suchscaling is very important when the signal components in Xi have very differentmagnitudes. For example, if the predictor variables Xi consist of two components,where Xi1 is being measured in milli-Volt and Xi2 is being measured in kilo-Volt,a norm of the type

‖Xi − x‖ =∣∣∣∣(103 0

0 10−3

)(Xi − x)

∣∣∣∣can be used.

4.6 The Just-in-Time Algorithm

We propose a four-step algorithm which can be summarized as follows:

Algorithm 4.1 (The Just-in-Time Algorithm)

Input: A dataset (Yi, Xi)Ni=1 and an operating point x.

Output: A Just-in-Time estimate fJIT(x).

1. Sort the regressors in ascending order of the distance to x.

4.7 Properties of the Just-in-Time Estimator 55

2. Estimate the Hessian Hf (x) using the linear least squares method

VM (ϑ,ΩM ) =∑

Xi∈ΩM

(Yi − φTi ϑ)2,

ϑ = arg minϑ

VM (ϑ,ΩM ),

where φi and ϑ are defined as in (4.35) and (4.36), using the M in Euclidiandistance closest regression vectors. The neighborhood size M is chosen byAkaike’s FPE method:

M = arg minM

VM (ϑ,ΩM )1 + d/M

1− d/M .

Compute the noise variance as

λ = VM (ϑ,ΩM )1

M − d.

3. Use the result from Step 2 to compute weights,

w = A−1X(XTA−1X

)−1x,

where

A−1 =1

2λ

(I − ββT

λ+ βTβ

),

and β, X and x are defined according to (4.24) and (4.26) respectively.

4. Form the resulting estimate as a weighted mean of the M corresponding re-sponse variables, i.e.,

fJIT(x) =∑

Xi∈ΩM

wiYi.

The estimate of f(x) from Step 2 can be used by itself, but Step 4 may increasethe accuracy even when the function f(·) is purely quadratic.

4.7 Properties of the Just-in-Time Estimator

It is clear that the Just-in-Time method in each estimation point requires a consid-erable amount of computations. Therefore it does not seem to have much to offerin the linear case where parametric models relatively easy can be estimated.

Nevertheless, it indeed has some interesting features in the nonlinear modelingcase. Although the algorithm is quite demanding in matter of computational effort,it is still comparable with other nonlinear modeling techniques. Common methodslike neural networks generally require considerable computational work in order to


train the networks. Since this training also concerns nonlinear optimization, it alsoleads to the problems of local minima etc, that were discussed in Chapter 2.

The largest distinction between Just-in-Time and traditional methods is thatthe Just-in-Time method is computationally intensive mainly when it is used. Thisis in contrast to traditional methods which typically are computationally intensivewhen the models are estimated, but not when the models are actually used. Thisimplies an advantage for the Just-in-Time approach in some considerations. Forinstance, if the number of data increases, one usually can afford a larger model. Thisincrease in model size is done automatically in the Just-in-Time method, whereasother approaches in principle has to start over again and estimating completelynew models.

The greatest advantage with the Just-in-Time algorithm is that it is an almostdesign variable free method. The only user choices required are which variablesto include in the regression vector Xi, and which norm to use when searching fornearest neighbors. Compare this to the parametric case where the user also has todecide upon a certain structure. The fact that the prediction is optimized locally,is also a great advantage compared to global methods, which normally optimizethe performance over the entire regressor space.

A possible drawback, however, is that the method can be sensitive to the dis-tribution of data. In regions of the regressor space where the distribution is sparseand the noise variance is high, bad estimates may be obtained. Similar problemshave also been noticed when estimating the regression function at the boundary ofthe regressor space. This is therefore clearly a subject for future research.

4.7.1 Computational Aspects

It may be appropriate to say something about the computational aspects of thealgorithm. According to the derivation of the Just-in-Time estimator in Section4.2, the computational work required in the algorithm can essentially be dividedinto three parts

1. Data set searching

2. Hessian estimation

3. Weight computation

We shall here briefly discuss these items in order.

Data set searching

So far, we have not paid much attention to the dataset searching problem in thealgorithm, but it can be worth to mention in a few words. Currently the Just-in-Time algorithm is implemented in Matlab [33] and uses a sorting routine thatis essentially equivalent to linear search. For real industrial applications, this isnot acceptable. Searching is in general a problem since there are lack of good data

4.7 Properties of the Just-in-Time Estimator 57

structures that allow nearest neighbors to be retrieved fast and efficiently. The bestknown searching methods require a search that is exponential in the dimension ofthe regressor space, and they effectively reduce to brute-force, exhaustive searchesfor dimensions n greater than 20, on databases that consist of up to several millionsamples [9]. However, research interest for methods that perform approximateneighborhood searching has been growing in recent years [40].

It would be more than desirable to make use of a database management system(DBMS) when working with such huge data volumes. In recent years, databasesystem research in the computer science area has resulted in powerful main memorydatabase systems, which enable very large volumes of data to be stored in mainmemory of the computer [16]. Parts of the data can then be retrieved very efficientlyby queries to the database system, formulated in high level so-called structuredquery languages (SQL) [13].

Hessian estimation

Using the linear least squares approach (4.37) when estimating the Hessian, theparameter vector estimate ϑ can be efficiently computed using the recursive leastsquares (RLS) algorithm [30]:

ϑk = ϑk−1 + Lk(yk − φTk ϑk−1) (4.51)

Lk =Pkφk

1 + φTk Pk−1φk(4.52)

Pk = Pk−1 −Pk−1φkφ

Tk Pk−1

1 + φTk Pk−1φk(4.53)

It is also natural to implement the computation like this due to the iterative natureof the Just-in-Time algorithm. We start with a small neighborhood around x, andexpand it with more and more measurements (M increases) until we get the besttrade-off between bias error and variance error. According to [19], Appendix B.3,the loss function can also be updated incrementally,

VM (ϑ,ΩM ) =M∑k=1

(yk − φTk ϑM )2 + (ϑ− ϑM )TP−10 (ϑ− ϑM )

=M∑k=1

(yk − φTk ϑk−1)2

φTk Pk−1φk + 1(4.54)

Weight computation

A major computational part of the algorithm is the constrained optimization ofthe weight sequence. As we saw in Section 4.3.4 this step in general increasesthe performance compared to the corresponding kernel methods. However, theprice we pay for this is that the number of computations increases drastically forlarge neighborhood sizes M . Thus we have an upper limit of the neighborhood


size for the algorithm to be practically implementable. A reasonable solution tothis problem is to use the optimal weight sequence for M smaller than a certainnumber, and to turn to the corresponding kernel type sequences for M larger thanthis number.

The weights can be efficiently computed using the formula (4.32). This willrequire about M2 × (n+ 1) operations provided M (n+ 1).

4.7.2 Convergence

In Chapter 3 we mentioned that the rate of convergence w.r.t.N in general is slowerfor nonparametric models as compared to parametric models. Using parametricmethods, the mean square error decays asN−1 while using nonparametric methods,the mean square error decays as N−4/(n+4).

This is indeed true also for the Just-in-Time method, and in particular we showin Chapter 5 that the rate is

MSEfJIT(x) ∼ C ·N−4/(n+4) (4.55)

which is equivalent to using a local linear kernel estimator with an optimal kernelfunction, for instance the Epanechnikov kernel.

5Asymptotic Properties of the

Just-in-Time Estimator

In Chapter 4, the Just-in-Time estimator was introduced as a method of gettinga local estimate of a regression function f at a point x, on the basis of noisyobservations (Xi, Yi)Ni=1. The estimator was constructed as a weighted averageof the Yi measurements in a neighborhood around the point x, where the weightswere obtained by minimizing the pointwise mean square error (MSE) subject tocertain constraints on the weights. These constraints were chosen in such a waythat the resulting estimate was equivalent to that of fitting a hyperplane to thedata in the local neighborhood.

In order to compare the Just-in-Time estimator with corresponding statisticalmethods, it may be of interest to investigate the consistency and convergence prop-erties of the resulting estimate fJIT(x). For example: Does it converge to the trueregression function when the number of samples N tends to infinity, and if so, howfast does it converge? Investigations concerning this type of questions will be thesubject of this chapter, which is based on [45].

The investigation is carried out in the scalar case in Section 5.1, mainly becauseof its simpler notation. However, it is likely that similar results can be achieved inthe multidimensional case.

5.1 Asymptotic Properties of the Scalar Estima-tor

In this section we investigate the asymptotical properties of Just-in-Time modelsfor the univariate (scalar) case where n = dim x = 1. To be able to compare our

59

60 Chapter 5 Asymptotic Properties of the Just-in-Time Estimator

result with the corresponding asymptotic result for kernel estimators presented inSection 3.4, we here consider the same design setting. Hence we assume that theregression relation is modeled as

Yi = f(Xi) + ei, i = 1, . . . , N, (5.1)

where Xi = i/N , and ei are independent identically distributed random variableswith zero means and variances λ.

In Chapter 4 we introduced the Just-in-Time estimator/smoother as

fJIT(x) =∑

Xi∈ΩM

wiYi, (5.2)

where ΩM denotes a neighborhood of x containing M data points, and wherethe weight sequence wi is obtained by minimizing the pointwise mean squareprediction error

MSEfJIT(x,w) = E(fJIT(x,w) − f(x))2, (5.3)

subject to the constraints∑Xi∈ΩM

wi = 1, and∑

Xi∈ΩM

wi(Xi − x) = 0. (5.4)

These constraints are motivated by the fact that the Just-in-Time estimator thenproduces an unbiased estimate in the case that the true regression function is linearin the neighborhood ΩM .

The consistency of (5.2) and the speed of which the MSE (5.3) tends to zeroas a function of the sample size N , are given in the following proposition. Forsimplicity and to enable comparison with the kernel estimator result presented inTheorem 3.1, it is stated for the fixed design case, but it can be shown that italso holds for the random design case where the Xi’s are uniformly distributed on[0, 1].

Proposition 5.1 Consider the fixed equally spaced design regression model

Yi = f(Xi) + ei, i = 1, . . . , N

where Xi = i/N and ei are i.i.d. random variables with zero mean and variance λ.Assume that:

(i) The second order derivative f ′′(x) is continuous on [0, 1].

(ii) The neighborhood ΩM around x contains M = MN data and is defined as

ΩM = Xi : |Xi − x| ≤ hN.

(iii) The bandwidth parameter h = hN is a sequence satisfying hN → 0 andNhN →∞ as N →∞.

5.1 Asymptotic Properties of the Scalar Estimator 61

(iv) The estimation point x is located at the grid and is an interior point of theinterval, i.e., x = l/N for some integer l satisfying hNN ≤ l ≤ (1− hN )N .

Let fJIT(x) be a Just-in-Time estimate according to (5.2) with weights satisfying(5.4), and let the neighborhood size be chosen as

MN ∼ 2(

15λ(f ′′(x))2

)1/5

N4/5. (5.5)

Then

MSEfJIT(x) ∼ 34

((f ′′(x))2λ4

15

)1/5

N−4/5. (5.6)

Proof Let hN be the bandwidth, i.e., the radius from x that defines ΩM . ThenM = MN = b2hNNc is the number of data in the neighborhood ΩM .

Introduce vectors

e = (1, . . . , 1)T ,

α = (α1, . . . , αMN )T ,

β = (β1, . . . , βMN )T ,

w = (w1, . . . , wMN )T ,

where

αi = Xi − x, and βi = 12f′′(x)α2

i . (5.7)

According to Section 4.3.1, the mean square error (5.3) is then given by

MSEfJIT(x,w) =

(∑Xi∈Ω

wif(Xi)− f(x)

)2

+ λ∑Xi∈Ω

w2i

∼ (βTw)2︸︷︷︸bias2

+λwTw︸︷︷︸variance

, (5.8)

where the similarity follows as a consequence of the constraints (5.4) and a secondorder Taylor expansion of f(·) at x. The error made in the Taylor expansionvanishes asymptotically since hN → 0. We now want to minimize the right handside of (5.8) subject to the constraints (5.4). Define a Lagrange function L as

L =12

MSEfJIT(x,w) + µ · (eTw − 1) + ρ · (αTw) (5.9)

Then

∂L∂wi

= λwi + (βTw)βi + µ+ αiρ = 0, ∀i. (5.10)


Introduce the notation

βTw = κ, (5.11)

for the bias error term in (5.8). Hence

wi = − 1λ

(κβi + µ+ αiρ), (5.12)

and we get the equation systemeTw = − 1

λ(κ · βT e+ µ ·MN + ρ · eTα) = 1

αTw = − 1λ

(κ · βTα+ µ · eTα+ ρ · αTα) = 0

βTw = − 1λ

(κ · βTβ + µ · eTβ + ρ · βTα) = κ

(5.13)

for κ, µ and ρ.

The odd moments of αi in (5.13) vanish asymptotically, since

eTα =∑Xi∈Ω

(Xi − x) = O(MN/N) = O(hN )→ 0 as N →∞,(5.14)

and similarly

βTα =f ′′(x)

2

∑Xi∈Ω

(Xi − x)3 = O(M3N/N

3) = O(h3N )→ 0 as N →∞.

(5.15)

For the even moments of αi we have

eTβ =f ′′(x)

2

∑Xi∈Ω

(Xi − x)2 = f ′′(x)MN/2∑k=1

(k

N

)2

∼ f ′′(x)M3N

24N2,

(5.16)

αTα =∑Xi∈Ω

(Xi − x)2 ∼ M3N

12N2, (5.17)

and

βTβ =(f ′′(x))2

4

∑Xi∈Ω

(Xi − x)4 =(f ′′(x))2

2

MN/2∑k=1

(k

N

)4

∼ (f ′′(x))2M5N

320N4.(5.18)

See Appendix 5.A for the series formulas. Hence, when inserting equations (5.14)to (5.18), the equation system (5.13) has the asymptotic solution

ρ = 0

κ =30 f ′′(x)M2

NλN2

720λN4 + (f ′′(x))2M5N

µ = −9 λ

((f ′′(x))2M5

N + 320λN4)

4MN (720λN4 + (f ′′(x))2M5N )

(5.19)


From (5.12) it follows that

w = −λ−1(κ · β + µ · e+ ρ · α) ∼ −λ−1(κ · β + µ · e). (5.20)

The variance error is thus given by

λwTw ∼ −(κ · wTβ + µ · wTw) = −(κ2 + µ). (5.21)

Hence (5.8) gives

infw

MSEfJIT(x,w) = −µ =9 λ((f ′′(x))2M5

N + 320λN4)

4MN (720λN4 + (f ′′(x))2M5N).

(5.22)

By substituting MN from (5.5) we thus finally arrive at

infw

MSEfJIT(x,w) ∼ 34

((f ′′(x))2λ4

15

)1/5

N−4/5, (5.23)

and (5.6) is proved. 2

The MSE formula in (5.22) is a decreasing function of MN , but for the given value(5.5) it has a stationary point. This can be explained as follows: If f(x) is aquadratic function, i.e., the second order derivative f ′′(x) is constant for all x, theTaylor expansion will be valid over the entire interval. The optimal neighborhoodshould thus be chosen as MN = N . However, the Taylor expansion is in most casesonly valid locally, i.e., if hN → 0. Then the stationary point will give a good choiceof MN since hN → 0 as N →∞.

From this result we can make the following observations; Equation (5.6) indi-cates that the mean square error (MSE) tends to zero as the number of samples Ntends to infinity, i.e. the Just-in-Time estimate (5.2) is a consistent estimate of thetrue regression function. It also shows that the speed of which the MSE decays isin the order of N−4/5, and that we thus achieve the same rate of convergence forthe Just-in-Time estimator, as using the local linear estimator (3.11). See equation(3.25) for comparison. From (5.5) we have that in order to obtain this convergencerate, the neighborhood size has to be chosen according to MN ∼ N4/5.

It is worth to point out that the result of the proposition, as in the kernelregression case, is valid only when x is an interior point of the interval. At theboundary the convergence rate usually gets slower.

Next we will motivate that Proposition 5.1 also holds for the random designcase when the regressor variablesXi are uniformly distributed on the interval [0, 1].Then the sums in (5.14) to (5.18) can be approximated with expectation accordingto ∑

Xi∈ΩM

(Xi − x)k ≈MN · E(Xi − x)k,


where (Xi − x) is uniformly distributed on [−hN , hN ]. From Appendix 5.B, witha = −hN and b = hN , we also have that

MN · E(Xi − x) = 0,

MN · E(Xi − x)2 = MNh2N

3=

M3N

12N2,

MN · E(Xi − x)3 = 0,

MN · E(Xi − x)4 = MNh4N

5=

M5N

80N4.

Hence equations (5.14) to (5.18) still hold, which implies that the proposition isvalid even for the uniform random design case. Since all other distributions, atleast locally, can be seen as a uniform distribution, we have that the result to someextent also is applicable for general random design settings.

It may be of interest to investigate the shape of the optimal weight sequence inthe asymptotic case. From (5.20) we have the expression

wi ∼ −λ−1(κβi + µ) = −µλ

(1 +

κ

µβi

)= −µ

λ

(1 +

κf ′′(x)2µ

(Xi − x)2

)= c

(1− 5

3(f ′′(x))2h3

NN

(f ′′(x))2h3NN + 10λ/h2

N

(Xi − xhN

)2).

A plot of this function with x = 0, c = 1, λ = 0 and hN = 1 is given in Figure5.1 together with the Epanechnikov kernel function (3.9). Both functions are ofparabolic shape, but while the Epanechnikov kernel function is positive for all xin the interval, the Just-in-Time weight sequence takes negative values for |x| ≥√

3/5 ≈ 0.775.

Appendix 5.A: Some Power Series Formulas

The expressions for∑n

i=1 ik, k = 1, 2, 3, 4 can be summarized as follows:

n∑i=1

i =n2

2+n

2

n∑i=1

i2 =n3

3+n2

2+n

6

n∑i=1

i3 =n4

4+n3

2+n2

4


−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

x

wei

ghts

Figure 5.1 Asymptotic optimal Just-in-Time weight sequence (solid) andthe Epanechnikov kernel function (dashed).

n∑i=1

i4 =n5

5+n4

2+n3

3− n

30

Appendix 5.B: Moments of a Uniformly DistributedRandom Variable

If a random variable X is uniformly distributed on the interval [a, b], then

EXk =1

k + 1(bk + bk−1a+ . . .+ bak−1 + ak) (5.24)

See for instance [11].


6Applications to Dynamical

Systems

In Chapter 4, the Just-in-Time estimator/smoother was introduced as a method ofobtaining nonparametric estimates of the function f(·) in the regression relationship

Yi = f(Xi) + ei, (6.1)

given noisy measurements (Xi, Yi). In this chapter we shall apply this conceptto the field of nonlinear system identification, which can be seen as a special caseof (6.1) where the variables Xi and Yi represent inputs and outputs of a dynamicalsystem.

We start in Section 6.1 by giving a brief review of the nonlinear system iden-tification problem, and how the Just-in-Time method can be applied to it. Thefollowing four sections illustrate the concept with examples. Section 6.2 investi-gates how the method performs on a simulated linear system. In Section 6.3 thisexample is extended to the nonlinear domain by applying a static nonlinearity onthe output. In the last two sections we apply the algorithm on two real data ex-amples; Section 6.4 considers a laboratory-scale tank level system, and Section 6.5the modeling of a water heating system.

6.1 Nonparametric System Identification

An important application for Just-in-Time models is within nonlinear time domainsystem identification. As mentioned in Chapter 2, system identification is a specialcase of the general regression relationship (6.1), where Yt represents the outputfrom a dynamical system and Xt = ϕ(t) consists of past inputs and outputs of the

67

68 Chapter 6 Applications to Dynamical Systems

system,

ϕ(t) = (y(t− 1), . . . , y(t− na), u(t− nk), . . . , u(t− nb − nk + 1))T .(6.2)

Here na denotes the number of past outputs, nb denotes the number of past inputs,and nk is the time delay between the input and output signal.

Referring to Chapter 2, it is very common within the system identification areato model the function f(·) using parametric black-box models. In this chapterhowever, we shall instead concentrate on a nonparametric modeling approach, andin particular the Just-in-Time method introduced in Chapter 4.

We thus assume that all available observations up to time t are stored in adatabase as pairs,

y(k), ϕ(k), k = −∞, . . . , t− 1,

where y(k) ∈ R and ϕ(k) ∈ R(na+nb). If a prediction of the output is needed attime instant t, the regression vectors ϕ(k) located in a neighborhood around ϕ(t)are retrieved from the database, and the prediction y(t) is computed using thecorresponding y(k) values. This can conceptually be described as in Figure 6.1.Using the Just-in-Time algorithm, this implies that the prediction is formed as a

Databasey(k), ϕ(k)

JIT-estimator y(t)ϕ(t)

Figure 6.1 The Just-in-Time estimator.

weighted average of the outputs y(k) in the neighborhood,

y(t|t− 1) =t−1∑

k=−∞wky(k), (6.3)

where the weights are optimized such that the mean square prediction error (MSE)

Ey(t|t− 1)− f(ϕ(t))2

is minimized in a pointwise sense. As discussed in Section 4.3.3, this typicallyresults in that the measurements located close ϕ(t) are getting a larger influence

6.2 A Linear System 69

on the prediction than those located far away from it. The weight sequence wkwill depend on the Hessian of f(·) and the noise variance λ. When estimatingthese quantities, the size of the neighborhood around ϕ(t) will be determined as abias/variance trade-off.

6.2 A Linear System

In this section the Just-in-Time algorithm is applied to simulated linear dynamicsystem. The system considered is the so-called Astrom system, which has earlierbeen investigated in [30]. It is defined according to

y(t)− 1.5y(t− 1) + 0.7y(t− 2) = u(t− 1) + 0.5u(t− 2) + e(t). (6.4)

The system was simulated for t = 1, 2, . . . , 1000 using a Gaussian distributed whitenoise input u(t) with variance 1. The noise sequence e(t) was taken as Gaussiandistributed white noise with zero mean and variance λ = 1. A subset of thesimulated data set is displayed in Figure 6.2.

0 20 40 60 80 100 120 140 160 180 200−3

−2

−1

0

1

2

0 20 40 60 80 100 120 140 160 180 200−20

−10

0

10

20

t

u(t

)

t

y(t

)

Figure 6.2 A subset of the input-output data generated using the Astromsystem.

The data set was used to create an estimation database of the form (y(k), ϕ(k)),where

ϕ(t) = (y(t− 1), y(t− 2), u(t− 1), u(t− 2))T .

Using this estimation database, a simulation was made using a new Gaussian inputsignal and the Just-in-Time algorithm. The result is represented by the dashedcurve in Figure 6.3 (a). The solid curve is the true noiseless output obtained fromthe system (6.4).


0 10 20 30 40 50 60 70 80 90 100−15

−10

−5

0

5

10

t

y(t

)

(a)

0 10 20 30 40 50 60 70 80 90 100−10

−8

−6

−4

−2

0

2

4

6

8

10

t

y(t

)(b)

Figure 6.3 (a) Result of a Just-in-Time simulation using an estimationdatabase consisting of 1000 samples generated from the Astrom system (6.4).Solid: True output. Dashed: Simulated output. (b) Result of a simulationusing a linear ARX model with the same regressors. Solid: True output.Dashed: Simulated output.

For comparison, a linear ARX model with the same regressor configurationwas also tried. The simulation result is shown in Figure 6.3 (b). As shown thelinear ARX model performs slightly better, since it in contrast to the Just-in-Timemodel is optimized over the entire regressor space. The discrepancy in the Just-in-Time simulation case is due to that the Just-in-Time estimator at some ϕ(t)-pointsadapts to the local properties of the noise sequence e(t). However, the predictionerrors are quite small, so the Just-in-Time estimator gives an acceptable result.

6.3 A Nonlinear System

Consider again the Astrom system (6.4) but suppose that a static nonlinearity ψ(·)is applied on the output, i.e.

y(t) = ψ(y(t)).

The nonlinearity is assumed to be given by

ψ(x) = sgn(x)√|x| (6.5)

and is depicted in Figure 6.4.A data set consisting of 1000 samples of u(t) and y(t) was generated, again

using a Gaussian input u(t) with variance 1 and a Gaussian noise sequence e(t)with variance λ = 0.1. This data set was used to define an estimation database inthe same fashion as in Section 6.2. The simulation results, using a Just-in-Time

6.3 A Nonlinear System 71

−5 −4 −3 −2 −1 0 1 2 3 4 5−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

x

ψ(x

)

Figure 6.4 The static nonlinearity ψ(·).

model and a linear ARX model, are shown in Figure 6.5 (a)-(b). In this case theJust-in-Time model gives a significantly better result than the linear one. The rootmean square error (RMSE) is 0.3823 compared to 0.8253 for the linear model.

0 10 20 30 40 50 60 70 80 90 100−4

−3

−2

−1

0

1

2

3

4

t

y(t

)

(a)

0 10 20 30 40 50 60 70 80 90 100−5

−4

−3

−2

−1

0

1

2

3

4

t

y(t

)

(b)

Figure 6.5 (a) Result of a Just-in-Time simulation, using an estimationdatabase consisting of 1000 samples generated from the Astrom system witha static nonlinearity applied on the output. Solid: True output, Dashed:Simulated output. (b) Result of a simulation using a linear ARX modelwith the same regressors. Solid: True output. Dashed: Simulated output.


H = 35 [cm]

u(t)

h(t)

0

Figure 6.6 A laboratory-scale tank system.

6.4 Tank Level Modeling

In this example we will use the Just-in-Time algorithm to simulate the liquid levelof a laboratory-scale tank system as depicted in Figure 6.6. The system has earlierbeen investigated in [29], and was briefly discussed in the Example 1.1. The aimof the modeling is to describe how the level h(t) changes with the voltage u(t)that controls the pump. It is easily shown that this system is nonlinear. Thechange in level depends on the difference between the in- and outflow. The inflowis proportional to the pump voltage u(t). The outflow can by means of Bernoulli’slaw be approximated as proportional to the square root of h(t). Hence the truesystem is approximately of the form

h(t) = θ1h(t− 1) + θ2u(t− 1) + θ3

√h(t− 1). (6.6)

Two data records of 1000 samples each, one for estimation and one for valida-tion, were available. A plot of the estimation data set is shown in Figure 6.7. Theregression vector was defined using one past value of the water level and one pastvalue of the pump voltage, i.e.

ϕ(t) = (h(t− 1), u(t− 1))T . (6.7)

A simulation with the Just-in-Time approach and a linear ARX model, usingthe voltage signal u(t) from the validation dataset, gives the result illustrated inFigure 6.8. As shown, the linear model performs bad at low water levels (dottedline). The Just-in-Time model picks up the nonlinear characteristics well andgives a better result, but is at the same time more demanding with respect tocomputational effort.

Root mean square prediction errors (RMSE) have been computed and are dis-played in Table 6.1. Lindskog [29] has achieved similar results using semi-physicalmodel as in (6.6) and a fuzzy modeling approach.

6.5 Water Heating Process 73

0 2 4 6 8 10 12 14 163

4

5

6

7

0 2 4 6 8 10 12 14 16

0

10

20

30

40

Time [min]

u(t

)[V

]

Time [min]

h(t

)[c

m]

Figure 6.7 The estimation data set used in the water tank modeling ex-ample.

Method RMSEARX 0.389

Just-in-Time 0.194

Table 6.1 Root mean square prediction errors for the models used in thetank example.

6.5 Water Heating Process

In this example we consider identification of a water heating process as depicted inFigure 6.9. This process has earlier been investigated in [28] and [29].

The system can be described as follows: Cold water flows into to the heaterwith a flow rate Qin(t), and is heated by a resistor element which is controlledby the voltage u(t). At the outlet, the water temperature T (t) is measured. Theinlet flow Qin(t) as well as the inlet water temperature is assumed to be constant.The modeling problem is to describe the temperature T (t) given the voltage u(t).As shown in [29], the system is nonlinear due to saturation characteristics of thethyristor.

The data set consists of 3000 samples, recorded every 3rd second, and originatesfrom a real time identification run (performed by Koivisto [28]), where the systemwas driven by an input signal of pseudo-random type. The data was divided intoan estimation set of 2000 samples, see Figure 6.10, and a validation set of 1000samples. The time delay from input to output is between 12 to 15 seconds, [28].


0 2 4 6 8 10 12 14 16−5

0

5

10

15

20

25

30

35

Time [min]

h(t

)[c

m]

Figure 6.8 Simulation results for the water tank system. Solid: True waterlevel. Dashed: Simulated level using the Just-in-Time approach. Dotted:Simulated level using an ARX model.

u(t)

Thyristor

Qin(t)Qout(t)

Pt-100 T (t)

Figure 6.9 The water heating process.

This yields that useful regressors stemming from the input are u(t − 4), u(t − 5)and so on.

We apply our algorithm to the heater data using a second order model, i.e. theregression vector is defined as

ϕ(t) = (T (t− 1), T (t− 2), u(t− 4), u(t− 5))T . (6.8)

We then let the estimation dataset define our observation database, and use thevoltage signal u(t) in the validation dataset to obtain a simulation of the corre-sponding temperature T (t).

6.5 Water Heating Process 75

0 1000 2000 3000 4000 5000 60000

20

40

60

80

100

0 1000 2000 3000 4000 5000 600010

20

30

40

50

Time [s]

u(t

)[%

]

Time [s]

T(t

)[C

]

Figure 6.10 Estimation data for the water heating system

6000 6500 7000 7500 8000 8500 900010

15

20

25

30

35

40

45

50

Time [s]

T(t

)[C

]

(a)

6000 6500 7000 7500 8000 8500 900010

15

20

25

30

35

40

45

50

Time [s]

T(t

)[C

]

(b)

Figure 6.11 (a) Result of a Just-in-Time simulation of the water heatingsystem using an estimation database consisting of 2000 samples. Solid:True system output. Dashed: True output. (b) Result of a simulation usinga second order linear ARX model. Solid: True system output. Dashed:Simulated output.

Method RMSEARX 2.082

Just-in-Time 0.886Fuzzy 1.02

Table 6.2 Root mean square errors for the tank simulations.


For comparison we also tried a linear ARX model with the same structure asin (6.8). The result of the simulations is shown in Figure 6.11 (a)-(b), and theroot mean square errors (RMSE) are summarized in Table 6.2. Lindskog [29] hasachieved a RMSE of 1.02 using a fuzzy modeling approach, and Koivisto [28] hasreceived similar results using neural network modeling.

7Applications to Frequency

Response Estimation

A traditional application for nonparametric estimation in system identification isin the area of frequency response estimation. A nonparametric estimate of thefrequency response, the so-called ETFE [30], is formed as the ratio between theFourier transforms of the output and the input signals of the system. However,since the observations usually are corrupted by measurement noise, the noise willpropagate to the ETFE and make it noisy. This is normally solved by smoothingthe ETFE with a window function of fixed width, where the user has to choosethe width in order to get a good trade-off between variance reduction and resolu-tion. This is usually done in the time domain, using the so-called Blackman-Tukeyprocedure [3]. The problem with the fixed window approach, though, is that anacceptable noise reduction in one part of the frequency interval is achieved to theprice of a too low resolution in another part of the interval. Considerable improve-ments could therefore be obtained by allowing a smoothing window with variablewidth. We shall here investigate how the Just-in-Time method can be used for thispurpose.

The outline is as follows: In Section 7.1 a review of traditional methods isgiven. Section 7.2 derives a variant of the Just-in-Time smoother that can be usedfor frequency response estimation. Section 7.3 presents an example with real data,and Section 7.4 gives a summary and some conclusions.

7.1 Traditional Methods

This section gives a brief review of traditional methods in the area of frequencyresponse estimation. The methods emanate from spectral estimation techniques

77

78 Chapter 7 Applications to Frequency Response Estimation

in signal processing and mathematical statistics. Good overviews of the topic aregiven in Chapter 10 in [25], in Chapter 6 in [30], in [5], and in [17].

Consider a stable linear system described by the input-output relation

y(t) =∞∑k=1

g(k)u(t− k) + e(t), (7.1)

where g(k) is the impulse response and e(t) is a disturbance being a stationarystochastic process with spectrum Φe(ω). The frequency response, G0(eiω), is thendefined by

G0(eiω) =∞∑k=1

g(k)e−iωk. (7.2)

Given observations (y(t), u(t))Nt=1, from the process, the goal is to get an estimateof G0(eiω) without imposing any parametric model structure.

Since the input-output relation (7.1) is linear, a straightforward estimate of thetrue frequency response is given by

ˆGN (eiω) =

YN (ω)UN(ω)

, (7.3)

where

YN (ω) =N∑t=1

y(t)e−iωt and UN(ω) =N∑t=1

u(t)e−iωt (7.4)

denote the discrete Fourier transform of y(t) and u(t), respectively. The estimate(7.3) is often called the empirical transfer function estimate (ETFE), since it isformed directly from the observations without any other assumptions than lin-earity of the system. It is well known that the ETFE is a very crude and noisyestimate of the true transfer function. This is due to the construction of the Fouriertransform; we determine as many independent estimates as we have data points.Or equivalently, we have no compression of data. As a consequence of this, thesystem’s properties at different frequencies may be completely unrelated.

7.1.1 Properties of the ETFE

The crudeness of the ETFE is caused by the fact that the observations (y(t), u(t))are corrupted by measurement noise, and that the noise propagates to the estimatethrough the Fourier transform. Hence the ETFE is also a random variable withcertain statistical properties.

From [30], Lemma 6.1, we have that

EˆGN (eiω) = G0(eiω) +R

(1)N , (7.5)

7.1 Traditional Methods 79

E[ ˆGN (eiω)−G0(eiω)][ ˆ

GN (e−iξ)−G0(e−iξ)]

=

1

|UN (ω)|2 [Φe(ω) +R(2)N ], if ξ = ω

R(2)N

UN (ω)UN (−ξ) , if |ξ − ω| = 2πkN , k = 1, . . . , N − 1,

(7.6)

where

R(i)N → 0, as N →∞. (7.7)

From this result we see that the ETFE is an asymptotically unbiased estimate ofthe true transfer function, but that the variance does not decay when N increases.Instead it approaches the noise-to-signal ratio at the frequency in question.

7.1.2 Smoothing the ETFE

One way to improve the poor properties of the ETFE is to assume that the valuesof the true transfer function G0(eiω) at neighboring frequencies are related. Thetransfer function estimate at a certain frequency ω0 can then be obtained as aweighted average of the neighboring ETFE values

G(eiω0) =

∫ π

−πWγ(ξ − ω0)|UN (ξ)|2 ˆ

GN (eiξ) dξ∫ π

−πWγ(ξ − ω0)|UN (ξ)|2 dξ

, (7.8)

where Wγ(ξ) is a function centered about ξ = 0 which is characterized by∫ π

−πWγ(ξ) dξ = 1,

∫ π

−πξWγ(ξ) dξ = 0,

∫ π

−πξ2Wγ(ξ) dξ = M(γ)∫ π

−π|ξ|3Wγ(ξ) dξ = C3(γ),

∫ π

−πW 2γ (ξ) dξ =

12πW (γ).

This function is usually referred to as the frequency window [30]. Here γ is aparameter that controls the width of the window. If the frequency window iswide, then many neighboring frequencies will be weighted together in (7.8). Thiswill reduce the variance of GN (eiω0). However, a wide window also implies thatfrequency estimates located far away from ω0, whose expected values may differsignificantly from G0(eiω0), will have a great deal of influence on the estimate. Thiswill cause large bias. The window width thus controls the trade-off between thebias and the variance errors.

A commonly used frequency window is the Hamming window [2, 20]

Wγ(ω) =12Dγ(ω) +

14Dγ

(ω − π

γ

)+

14Dγ

(ω +

π

γ

), (7.9)


where

Dγ(ω) =sin(γ + 1

2ω)

sinω/2. (7.10)

It is depicted in Figure 7.1 for different values of γ.

−3 −2 −1 0 1 2 3

0

1

2

3

4

5

6

7

8

9

10

ξ

Wγ(ξ

)

Figure 7.1 Hamming window of different widths γ. Solid: γ = 5, Dashed:γ = 10, Dotted: γ = 3.

7.1.3 Asymptotic Properties of the Estimate

The transfer function estimate (7.8) has been thoroughly investigated in severalbooks on spectral estimation. See, for example, Chapter 10 of [25], Chapter 6of [5] or Chapter 6 of [30]. From [30] we have the following asymptotic formula(asymptotic both in N and γ).

Bias

EGN (eiω)−G0(eiω) = M(γ)[

12G′′0 (eiω) +G′0(eiω)

Φ′u(w)Φu(ω)

]+O(C3(γ)) +O(1/

√N), (7.11)

where the differentiation is with respect to ω.

Variance

E|GN (eiω)− EGN (eiω)|2 =1NW (γ)

Φe(ω)Φu(ω)

+ o(W (γ)/N). (7.12)

7.1 Traditional Methods 81

Hence the asymptotic mean square error (MSE) can be written

E|GN (eiω)−G0(eiω)|2

∼M2(γ)∣∣∣∣12G′′0 (eiω) +G′0(eiω)

Φ′u(w)Φu(ω)

∣∣∣∣2 +1NW (γ)

Φe(ω)Φu(ω)

(7.13)

The value of γ that minimizes (7.13) and thus gives a good trade-off between biasand variance error is

γopt(ω) =

4M2∣∣∣ 12G′′0(eiω) +G′0(eiω)Φ′u(w)

Φu(ω)

∣∣∣2 Φu(ω)

WΦe(ω)

1/5

·N1/5.(7.14)

Using this value of γ, the mean square error decays like

MSEGN(eiω) ∼ C ·N−4/5. (7.15)

If all quantities in (7.14) were known by the user, the window width γ could beallowed to be frequency dependent. In practice though, these are unknown and theuser has to choose a suitable value of γ manually.

7.1.4 An Example

The following example, taken from [4], illustrates the impact different windowwidths have on the resulting estimate.

Example 7.1 (Bodin [4])Consider the linear system

G(q) = C(q2 − 2r cos(φ+ ∆φ)q + r2)(q2 − 2r cos(φ−∆φ)q + r2)

(q − k)(q2 − 2r cos(φ)q + r2)2,

(7.16)

where

r = 0.95φ = 1.3π/4

∆φ = 0.03π/4k = 0.5C = 0.5

The system has a damped peak in the frequency response of the system.A data set was generated according to

y(t) = G(q)u(t) + e(t), t = 1, . . . , N, (7.17)


where N = 212, u(t) is a unit PRBS signal, and e(t) is an identically distributedrandom sequence with zero mean and standard deviation 0.03.The amplitude and phase plots of the true transfer function are shown in Figure7.2 (a). The corresponding ETFE plots are shown in Figure 7.2 (b).

In Figure 7.3 (a)-(c), the ETFE has been windowed with a Hamming window ofdifferent widths. In Figure 7.3 (a) a wide window with γ = 64 is used. The plot issmooth, but the resolution at the peak is quite poor. In Figure 7.3 (b) a narrowerwindow with γ = 256 is used. The resolution at the peak is now better, butat other frequencies the plot is noisier. Figure 7.3 (c) shows a compromise withwindow width γ = 128.

0 0.5 1 1.5 2 2.5 3−10

−8

−6

−4

−2

0

0 0.5 1 1.5 2 2.5 3

−150

−100

−50

0

Am

plitu

de

[dB

]

Frequency [rad/s]

Phase

[deg

]

Frequency [rad/s]

(a) The true frequency response G(eiω).

0 0.5 1 1.5 2 2.5 3−10

−8

−6

−4

−2

0

0 0.5 1 1.5 2 2.5 3

−150

−100

−50

0

Am

plitu

de

[dB

]

Frequency [rad/s]

Phase

[deg

]

Frequency [rad/s]

(b) The ETFE etimateˆGN (eiω).

Figure 7.2 Frequency functions of the system in Example 7.1.

The example clearly shows the trade-off between resolution (narrow window)and variance reduction (wide window), which has to be done when using a frequencywindow with fixed width. A typical procedure is to start by taking γ = N/20 [30],and increase γ until the desired level of details is achieved.

7.2 Using the Just-in-Time Approach

The problem with the traditional methods described in Section 7.1 is that thewindow width γ in (7.8) in practice has to be chosen by the user, and that it isfixed over the whole frequency axis. The main reason for this is that the smoothingusually is performed in the time domain using the Blackman-Tukey procedure,which results in a fixed window in the frequency domain. As stressed in Section7.1.3, a better performance can be achieved by using a frequency window withvariable width, which adapts to the local properties of the ETFE. In Chapter 4, wederived a method, the Just-in-Time smoother, that is doing exactly this. Let us

7.2 Using the Just-in-Time Approach 83

0 0.5 1 1.5 2 2.5 3−10

−8

−6

−4

−2

0

0 0.5 1 1.5 2 2.5 3

−150

−100

−50

0

Am

plitu

de

[dB

]

Frequency [rad/s]

Phase

[deg

]

Frequency [rad/s]

(a) γ = 64

0 0.5 1 1.5 2 2.5 3−10

−8

−6

−4

−2

0

0 0.5 1 1.5 2 2.5 3

−150

−100

−50

0

Am

plitu

de

[dB

]

Frequency [rad/s]

Phase

[deg

]

Frequency [rad/s]

(b) γ = 256

0 0.5 1 1.5 2 2.5 3−10

−8

−6

−4

−2

0

0 0.5 1 1.5 2 2.5 3

−150

−100

−50

0

Am

plitu

de

[dB

]

Frequency [rad/s]

Phase

[deg

]

Frequency [rad/s]

(c) γ = 128

Figure 7.3 Windowed frequency responses using a Hamming window withdifferent width γ. Solid: Windowed estimates. Dashed: True frequencyresponse.

here discuss how the Just-in-Time predictor can be extended to perform smoothingof the transfer function estimate.

Suppose that the input-output relation of the system as in (7.1) is given by

y(t) =∞∑k=1

gku(t− k) + e(t), (7.18)

that a set of time domain data (u(t), y(t)) is collected, and that the ETFE isformed in accordance to (7.3). Then we have available a data set of noisy frequency


response data(ωk, ˆ

GN (eiωk))N−1k=0 ,

where ωk ∈ R and ˆGN(eiωk) ∈ C. From Section 7.1.1, Equations (7.5)-(7.7), we

have that for a sufficiently large N , the regression relation can be modeled as

ˆG(eiωk) = G0(eiωk) + ρk, (7.19)

where ρk is a complex disturbance with zero mean and variance

λ = Φe(ω)/|UN (ω)|2.

Considering a nonparametric Just-in-Time approach in the spirit of Chapter 4,an estimate of the frequency response can be formed as a weighted mean of theETFE at neighboring frequencies

GN (eiω0) =∑

ωk∈ΩM

wkˆGN (eiωk), (7.20)

where ΩM denotes the set containing the M neighboring frequencies of ω0, andwk denotes a weight sequence which satisfies∑

ωk∈ΩM

wk = 1, (7.21)

∑ωk∈ΩM

wk(ωk − ω0) = 0. (7.22)

The fact that we now have partly complex data results in a slight complication.However, we have a number of possibilities:

1. Compute the weights based on | ˆGN (eiω)|, and apply them to the complex

function ˆGN (eiω).

2. Smooth the real part and complex part of ˆGN (eiω) separately.

3. Make appropriate modifications to the Just-in-Time smoother so that it ac-cepts complex data.

Among the possible solutions above, we have chosen alternative 3, since it is be-lieved that it would give the most accurate result. The Just-in-Time algorithm,presented in Section 4.6, can therefore be modified and summarized as follows,when it is applied on complex ETFE data:

Algorithm 7.1 (The Just-in-Time Algorithm for ETFE smoothing)

Input: An ETFE dataset ( ˆGN (eiωk), ωk)Nk=1 and an operating point ω0.

Output: A frequency response estimate G(eiω0).

7.2 Using the Just-in-Time Approach 85

1. Sort the frequency data ωk in ascending distance according to ω0.

2. Estimate the second order derivative G′′0 (eiω0) of G0(·) using the linear leastsquares method,

VM (ϑ,ΩM ) =∑

ωk∈ΩM

∣∣∣ ˆGN (eiωk)− φTk ϑ

∣∣∣2 ,ϑ = arg min

ϑVM (ϑ,ΩM ),

whereφk = (1, (ωk − ω0), 1

2 (ωk − ω0)2)T ,

ϑ = (G0(eiω0), G′0(eiω0), G′′0 (eiω0))T ,

and whereˆGN (eiωk) denotes the ETFE at the M closest neighboring frequen-

cies.

The number of samples M is chosen by Akaike’s FPE criterion,

Mopt = arg minM

VM (ϑ,ΩM )1 + 3/M1− 3/M

.

A variance estimate is then obtained by

λ = VM (ϑ,ΩM )1

M − 3.

3. Use the result from Step 2 to compute weights,

w = A−1X(X TA−1X

)−1x,

where

A−1 =1

2λ

(I − ββT

λ+ βTβ

)β = (β1, . . . , βM )T , βk = G′′0 (eiω0)(ωk − ω0)2

X =

1 ω1

......

1 ωM

, x =(

1ω0

).

4. Form the resulting transfer response estimate as a weighted mean of the Mcorresponding ETFE values, i.e.

G(eiω0) =∑

ωk∈ΩM

wkˆGN (eiωk).


Let us investigate what we obtain if we apply the proposed algorithm to thedata set used in Example 7.1.

Example 7.2The Just-in-Time algorithm for ETFE smoothing was applied to the data set gen-erated in Example 7.1. The resulting smoothed estimate is depicted in Figure 7.4.

For comparison, the root mean square errors (RMSE) have been calculated for thedifferent smoothing methods. The results are summarized in Table 7.1.

Method RMSEHamming γ = 64 0.0074Hamming γ = 128 0.0069Hamming γ = 256 0.0094Just-in-Time 0.0061

Table 7.1 Comparison of RMSE for different methods.

As shown, the performance is improved using the Just-in-Time method since thesmoothing window adapts to the local properties of the data. However, the pricethat has to be paid for this improvement is that the computational work increases.

7.3 Aircraft Flight Flutter Data

In this section we consider ETFE smoothing of a data set that origins from a realindustrial application. The data set has earlier been investigated by [41] and [29].

When new aircrafts are developed they are normally evaluated through quiterigorous test flight programs. Among many other thing it is interesting to exam-ine the mechanical limitations of the different parts of the aircraft. A commonlyused measure of this is the so-called flight flutter condition in which an aircraftcomponent at a specific airspeed starts to oscillate.

The experiments are usually performed by attaching special transducers to var-ious points on the airframe, in this particular case the wings, thereby introducingmechanical vibrations artificially. Flying at a predetermined and constant speed,data is collected and analyzed off-line, giving information about whether to allowthe the aircraft to fly faster or not. See [41] for further experimental details.

Data recorded during flight flutter test are normally noisy with a quite lowsignal to noise ratio, and for economical reasons only short data sequences arepermitted. The data to be investigated origins from LMS International and wasfirst used by Schoukens and Pintelon [41].

7.4 Summary 87

0 0.5 1 1.5 2 2.5 3−10

−8

−6

−4

−2

0

0 0.5 1 1.5 2 2.5 3

−150

−100

−50

0

Am

plitu

de

[dB

]

Frequency [rad/s]

Phase

[deg

]

Frequency [rad/s]

Figure 7.4 Just-in-Time frequency response estimate of the system in Ex-ample 7.1.

Flutter data were obtained using burst swept sine (4–40 Hz) excitations thatgenerated a force input u(t) leading to an acceleration response which was takenas the measured output y(t). The data was sampled at 100 Hz and consists of2048 samples. The goal was to model the frequencies in the frequency band of 4 to11 Hz. The raw data were therefore pre-filtered through a fifth order Butterworthband-pass filter. The resulting excitation signal u(t) and response signal y(t) areshown in Figure 7.5.

The ETFE of the data set was formed according to (7.3) and is shown as crossesin Figure 7.6. A Just-in-Time smoothed estimate G(eiω) was computed based onthe ETFE. It is represented by the solid line in the same figure.

7.4 Summary

It has been shown that the concept of Just-in-Time models, with minor modifica-tions, can be applied to the nonparametric frequency response estimation problem.It provides a way of letting the frequency window be frequency dependent, whichin general improves the quality of the estimate. It also has the advantage of beingdesign variable free, i.e., the user is not required to specify any smoothing param-eters as in the traditional case. A drawback, however, is that the method at thesame time requires considerable computational effort.


0 1 2 3 4 5 6 7 8−10

−5

0

5

10

0 1 2 3 4 5 6 7 8−5

0

5

Exci

tati

onu

(t)

Time [s]

Res

pon

sey(t

)

Time [s]

Figure 7.5 Time domain flight flutter data.

4 5 6 7 8 9 10 11−25

−20

−15

−10

−5

0

4 5 6 7 8 9 10 110

50

100

150

Frequency [Hz]

Am

plitu

de

[dB

]

Frequency [Hz]

Phase

[deg

]

Figure 7.6 Just-in-Time smoothed ETFE of the flight flutter data set.

8Summary & Conclusions

The problem of modeling systems when large data volumes are available has beenstudied. The solution proposed, the Just-in-Time estimator, stores all observationsin a database, and based on this dynamic models are computed as the need arises.When a model is really needed at a certain operating point, the data located in aneighborhood around the operating point are retrieved from the database, and amodel is determined based on these data.

The Just-in-Time estimator is formed in a nonparametric fashion as a weightedaverage of the observations in the neighborhood, where the weights are optimizedso that the local mean square error is minimized. The size of the neighborhood isdetermined as a bias/variance error trade-off.

It has been shown that the Just-in-Time concept with success can be applied tothe system identification problem, both in the time and the frequency domains. Italso has been concluded that the method on some problems gives smaller predictionerrors than other proposed methods, but that this performance improvement comesto the price of an increased computational complexity. Another possible drawbackis that the Just-in-Time method might be sensitive to the data distribution andeffects at the boundary of the regressor space.

There are a number of open questions that need to be further investigated.Some possible topics for future research are listed below.

• An obvious extension is the data set searching problem. This will concerninvestigations regarding suitable data structures that will enable efficientsearches for neighborhoods.

• Another extension involves how the method can be applied to control prob-

89

90 Chapter 8 Summary & Conclusions

lems. A possible application in this area is predictive control, which is amethod of optimizing the control signal to a system based on predictions offuture outputs of the system. In this framework, the Just-in-Time conceptcould be used for computing the predictions.

• In some cases it has been noticed that the method for estimating the Hes-sian produces noisy estimates. A better and more robust method for this istherefore desirable.

• Also, it has been noticed that the method sometimes encounters problemswhen computing predictions at the boundary of the regression space. Thisclearly requires further investigations.

Bibliography

[1] H. Akaike. Fitting autoregressive models for prediction. Ann. Inst. Statist.Math., 21:243–247, 1969.

[2] R.B. Blackman and J.W. Tukey. The Measurement of Power Spectra. Dover,New York, 1958.

[3] R.B. Blackman and R.W. Tukey. The measurement of power spectra from thepoint of view of communications engineering. Bell Syst. Tech. J., 37:183–282,485–569, 1958.

[4] P. Bodin. On wavelets and orthonormal bases in system identification. Licen-tiate Thesis TRITA-REG-9502, Automatic Control, Dept. of Signals, Sensorsand Systems, Royal Institute of Technology, Sweden, 1995.

[5] D.R. Brillinger. Time Series: Data Analysis and Theory. Holden-Day, SanFransisco, 1981.

[6] S. Chen, S.A. Billings, C.F.N. Cowan, and P.M. Grant. Non-linear systemidentification using radial basis functions. International Journal of SystemsScience, 21(12):2513–2539, 1990.

[7] W.S. Cleveland. Robust locally weighted regression and smoothing scatter-plots. Journal of the American Statistical Association, 74:829–836, 1979.

[8] T.M. Cover and P.E. Hart. Nearest neighbor pattern classification. IEEETransactions on Information Theory, 13:21–27, 1967.

91

92 Bibliography

[9] G. Cybenko. Just-in-time learning and estimation. In S. Bittani and G. Picci,editors, Identification, Adaption, Learning, NATO ASI series, pages 423–434.Springer, 1996.

[10] I. Daubechies. The wavelet transform, time-frequency localization and signalanalysis. IEEE Transactions on Information Theory, 36(4):961–1005, 1990.

[11] W.B. Davenport, Jr. Probability and Random Processes. McGraw-Hill, 1975.

[12] J.E. Dennis, Jr. and R.B. Schnabel. Numerical Methods for UnconstrainedOptimization and Nonlinear Equations. Prentice-Hall, Englewood Cliffs, NewJersey, 1983.

[13] R. Elmasri and S.B. Navathe. Fundamentals of Database Systems. Ben-jamin/Cummings Publishing Company, Redwood City, CA, second edition,1994.

[14] V.A. Epanechnikov. Non-parametric estimation of a multivariate probabilitydensity. Theory of Probability and Its Applications, 14:153–158, 1969.

[15] J. Fan. Design-adaptive nonparametric regression. Journal of the AmericanStatistical Association, 87:998–1004, 1992.

[16] H. Garcia-Molina and K. Salem. Main memory database systems: Anoverview. IEEE Transactions on Knowledge and Data Engineering, 4(6):509–516, December 1992.

[17] W.A. Gardner. Statistical Spectral Analysis – A Nonprobabilistic Theory. Pren-tice Hall, Englewood Cliffs, New Jersey, 1988.

[18] T. Gasser and H.-G. Muller. Kernel estimation of regression functions. InT. Gasser and M. Rosenblatt, editors, Smoothing Techniques for Curve Esti-mation, pages 23–68. Springer-Verlag, Heidelberg, 1979.

[19] F. Gustafsson. Estimation of Discrete Parameters in Linear Systems. PhDthesis, Dept of EE, Linkoping University, S-581 83 Linkoping, Sweden, 1992.

[20] R.W. Hamming. Digital Filters. Prentice-Hall, Englewood Cliffs, New Jersey,1983.

[21] W. Hardle. Applied Nonparametric Regression. Number 19 in EconometricSociety Monographs. Cambridge University Press, 1990.

[22] T.J. Hastie and C. Loader. Local regression: Automatic kernel carpentry (withdiscussion). Statist. Sci., 8:120–143, 1993.

[23] S. Haykin. Neural Networks: A Comprehensive Foundation. Macmillan, 1994.

[24] H.V. Henderson and S.R. Searle. Vec and vech operators for matrices, withsome uses in Jacobians and multivariate statistics. Canadian Journal of Statis-tics, 7:65–81, 1979.

Bibliography 93

[25] G.M. Jenkins and D.G. Watts. Spectral Analysis and its Applications. Holden-Day, San Fransisco, California, 1968.

[26] T. Kailath. Linear Systems. Prentice-Hall, Englewood Cliffs, New Jersey,1980.

[27] S.M. Kay. Modern Spectral Estimation. Prentice-Hall, Englewood Cliffs, NewJersey, 1988.

[28] H. Koivisto. A Practical Approach to Model Based Neural Network Control.PhD thesis, Tampere University of Technology, Tampere, Finland, December1995.

[29] P. Lindskog. Methods, Algorithms and Tools for System Identification Basedon Prior Knowledge. PhD thesis, Linkoping University, Linkoping, Sweden,May 1996.

[30] L. Ljung. System Identification – Theory for the user. Prentice Hall, Engle-wood Cliffs, New Jersey, 1987.

[31] D.O. Loftsgaarden and G.P. Quesenberry. A nonparametric estimate of amultivariate density function. Annals of Mathematical Statistics, 36:1049–1051, 1965.

[32] J.R. Magnus and H. Neudecker. Matrix Differential Calculus with Applicationsin Statistics and Econometrics. Wiley, Chichester, 1988.

[33] The Mathworks. MATLAB Reference Guide. The Mathworks Inc., 24 PrimePark Way, Natick, Mass. 01760, 1992.

[34] H.-G. Muller. Weighted local regression and kernel methods for nonparametriccurve fitting. Journal of the American Statistical Association, 82:231–238,1987.

[35] E. Nadaraya. On estimating regression. Theory of Probability and Its Appli-cations, 10:186–190, 1964.

[36] A.V. Nazin, 1996. Private communication.

[37] M.B. Priestly and M.T. Chao. Non-parametric function fitting. Journal of theRoyal Statistical Society, B 34:385–392, 1972.

[38] D. Ruppert, S.J. Sheather, and M.P. Wand. An effective bandwidth selector forlocal least squares regression. Journal of the American Statistical Association,90, 1995.

[39] D. Ruppert and M.P. Wand. Multivariate locally weighted least squares re-gression. Annals of Statistics, 22, 1994.

[40] F.A. Sadjadi. An approximate k-nearest neighbor method. In Adaptive andLearning Systems II. Proc. SPIE 1962, 1993.

94 Bibliography

[41] J. Schoukens and R. Pintelon. Identification of Linear Systems – A PracticalGuideline to Accurate Modeling. Pergamon Press, 1991.

[42] J. Sjoberg, Q. Zhang, L. Ljung, A. Benveniste, B. Deylon, P.-Y. Glorennec,H. Hjalmarsson, and A. Juditsky. Nonlinear black-box modeling in systemidentification: a unified overview. Automatica, 31:1691–1724, 1995.

[43] T. Soderstrom and P. Stoica. System Identification. Prentice-Hall Interna-tional, Hemel Hempstead, Hertfordshire, 1989.

[44] A. Stenman, F. Gustafsson, and L. Ljung. Just in time models for dynamicalsystems. In Proceedings of the 35th IEEE Conference on Decision and Control,Kobe, Japan, 1996.

[45] A. Stenman, A.V. Nazin, and F. Gustafsson. Asymptotic properties of Just-in-Time models. 1997. To be presented at SYSID ’97 in Fukuoka, Japan.

[46] C.J. Stone. Consistent nonparametric regression. The Annals of Statistics,5:595–620, 1977.

[47] M. Stone. Cross-validatory choice and assessment of statistical predictions(with discussion). Journal of the Royal Statistical Society, Series B(36):111–147, 1974.

[48] J.W. Tukey. Curves as parameters and touch estimation. In Proc 4th BerkeleySymposium, pages 681–694, 1961.

[49] J.W. Tukey. Exploratory Data Analysis. Addison-Wesley, Reading, MA, 1977.

[50] M.P. Wand and M.C. Jones. Kernel Smoothing. Number 60 in Monographson Statistics and Applied Probability. Chapman & Hall, 1995.

[51] G. Watson. Smooth regression analysis. Sankhya, A(26):359–372, 1964.

Subject Index

AARX model . . . . . . . . . . . . . . . . . . . . . . 15asymptotic MISE . . . . . . . . . . . . . . . . 30asymptotic MSE . . . . . . . . . . . . . . . . . 29

Bbandwidth . . . . . . . . . . . . . . . . . . . . . . . 24bandwidth matrix . . . . . . . . . . . . . . . . 32bandwidth selection . . . . . . . . . . . . . . 31basis function . . . . . . . . . . . . . . . . . .2, 15bias error . . . . . . . . . . . . . . . . . 29, 42, 43black-box models . . . . . . . . . . . .1, 5, 15

Ccomposition . . . . . . . . . . . . . . . . . . . . . .16constrained minimization . . . . . . . . .44cross-validation . . . . . . . . . . . . . . . 31, 51curse of dimensionality . . . . . . . . . . . 33

Ddata set searching . . . . . . . . . . . . . . . . 57direct plug-in methods . . . . . . . . . . . 31

EEpanechnikov kernel . . . . . . . . . . . . . 24ETFE . . . . . . . . . . . . . . . . . . . . . . . . . 6, 78

Ffinal prediction error, FPE . . . . . . . 52FIR model . . . . . . . . . . . . . . . . . . . . . . . 15fixed design . . . . . . . . . . . . . . . . . . . . . . 23frequency window . . . . . . . . . . . . . . . . 79

HHamming window . . . . . . . . . . . . . 4, 79Hessian . . . . . . . . . . . . . . . . . . . 43, 51, 57

Iindex set . . . . . . . . . . . . . . . . . . . . . . . . .27

JJacobian . . . . . . . . . . . . . . . . . . . . . . . . . 43Just-in-Time algorithm . . . . . . . . . . .54Just-in-Time estimator . . . . . . . . 8, 39

Kk-nearest neighbor estimator . . . . . 27kernel function . . . . . . . . . 4, 22, 24, 32

LLagrange multipliers . . . . . . . . . . . . . 44linear least squares . . . . . . . . . . . . . . . 19linear regression . . . . . . . . . . . . . . . . . .13local linear estimator . . . . . . . . . 25, 47

95

96 Subject Index

local polynomial kernel estimator 21,23

Mmatrix inversion lemma . . . . . . . . . . 44mean integrated square error, MISE

28mean square error, MSE . . . . . . 28, 42Monte Carlo simulation . . . . . . . . . . 48

NNadaraya-Watson estimator . . 21, 24,

47nearest neighbor . . . . . . . . . . . . . . . . . . 4neighborhood . . . . . . . . . . . . . . . . . . . . 54neural networks . . . . . . . . . . . . . . . . . . 16Newton’s algorithm . . . . . . . . . . . . . . 19nonlinear least squares . . . . . . . . . . . 19nonparametric regression . . . . . . . .1, 4

Pparameter estimation . . . . . . . . . . . . .18parametric regression . . . . . . . . . . 2, 13product construction . . . . . . . . . . . . . 32

Qquery language . . . . . . . . . . . . . . . . . . .57

Rradial basis function . . . . . . . . . . . . . .17radial construction . . . . . . . . . . . 16, 32random design . . . . . . . . . . . . . . . . . . . 23rate of convergence . . . . . . . .33, 58, 63recursive least squares, RLS . . 50, 57regression function . . . . . . . . . . . . . . . . 2ridge construction . . . . . . . . . . . . . . . . 16

Ssigmoid basis function . . . . . . . . . . . . 16smoothing . . . . . . . . . . . . . . . . . . . . . . . 22system identification . . . . . . . . . . . 5, 14

TTaylor series . . . . . . . . . . . . . . . . . . . . . 49

Vvariance error . . . . . . . . . . . . . . . . 29, 43

Wwavelets . . . . . . . . . . . . . . . . . . . . . . . . . .17

Documents

R E G L E RTEKNIK AUT L - Automatic Control · vec A The vector of the matrix A, obtained by stacking the columns of Aunderneath each other in order from left to right vech A The