An RKHS Approach to Systematic Kernel Selection in Nonlinear System Identification

An RKHS Approach toSystematic Kernel Selection

in Nonlinear System Identification

Y. Bhujwalla, V. Laurain, M. Gilson

55th IEEE Conference on Decision and Controlyusuf-michael.bhujwalla@univ-lorraine.fr

Yusuf Bhujwalla (Université de Lorraine) CDC 2016 1 / 14

IntroductionProblem Description

Measured data :DN = {(u1, y1), (u2, y2), . . . , (uN , yN)}

Describing an unknown system :

yo,k = fo(xk), fo : X → R

yk = yo,k + eo,k, eo,k ∼ N (0,σ2e )

- xk = [ yk−1 · · · yk−na u1,k · · · u1,k−nb u2,k · · · unu,k−nb ]⊤ ∈ X = R

na+nu (nb+1)

IntroductionModelling Objective

Aim : to choose the simplest model from a candidate set of models that accuratelydescribes the system :

Mopt : Accuracy (Data) vs Simplicity (Model)⏐

Vf : V( f ) =∑N

k=1 ( yk − f (xk))2 + g( f )

Vf : V( f ) =∑N

k=1 ( yk − f (xk))2 + g( f )

Q1 : How to choose the simplest accurate model ?- Often g( f ) = λ ∥ f∥2H - ensure uniqueness of the solution- λ → controls the bias-variance trade-off

Vf : V( f ) =∑N

k=1 ( yk − f (xk))2 + g( f )

Q1 : How to choose the simplest accurate model ?- Often g( f ) = λ ∥ f∥2H - ensure uniqueness of the solution- λ → controls the bias-variance trade-off

Q2 : How to determine a suitable set of candidate models. . . ?

Outline

1. Kernel Methods in Nonlinear Identification

2. Model Selection Using Derivatives

3. Smoothness-Enforcing Regularisation

4. Application : Estimation of Locally Nonsmooth Functions

Input0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

2f̂kx

→ Model :

Ff : f (x) =N∑

αi kxi(x)

→ Nonparametric (nθ ∼ N)→ Flexible : M defined through choice of K→ Height : α (model parameters)→ Width : σ (kernel hyperparameter)

1. Kernel Methods in Nonlinear IdentificationIdentification in the RKHS

Reproducing Kernel Hilbert SpacesKernel function defines the model class :

K↔ H

Hence, functions can be represented in terms of kernels :

f (x) = ⟨ f , kx⟩H (1)

1. Kernel Methods in Nonlinear IdentificationThe Kernel Selection Problem

Choosing an overly flexible model class (a small kernel) :

FIGURE: Flexible Model Class

Choosing an overly flexible model class (a small kernel) :

FIGURE: Flexible Model Class-1 -0.5 0 0.5 1

1.2foyf̂kx

FIGURE: High Variance

Choosing an overly constrained model class (a large kernel) :

FIGURE: Constrained Model Class

Choosing an overly constrained model class (a large kernel) :

FIGURE: Constrained Model Class

-1 -0.5 0 0.5 1-0.4

1.2foyf̂kx

FIGURE: Model Biased

Why not just choose the ’optimal’ model class ?

FIGURE: Optimal Model Class

Why not just choose the ’optimal’ model class ?

FIGURE: Optimal Model Class

-1 -0.5 0 0.5 1-0.4

1.2foyf̂kx

FIGURE: Optimal Model

Why not just choose the ’optimal’ model class ?• This is, in general, what we try to do.• However, Hopt is unknown.• Optimisation over one hyperparameter - not that difficult.• Optimisation over multiple model structures, kernel functions and

hyperparameters → more difficult.

-1 -0.5 0 0.5 1-0.4

1.2foyf̂kx

Outline

But, note that many properties of K are encoded into its derivatives, e.g.

Smoothness f (x) = ax2 + bx+ c =⇒ d3 f(x)dx3

∀x= 0

f (x) = g1(x) [ x < x∗] + g2(x) [ x > x∗] =⇒ ∃ d f(x)dx

∀x̸=x∗

Linearity f ( x1, x2 ) = x1 h1(x2) + h2(x2) =⇒ ∂2 f( x1,x2 )∂x21

∀x1= 0

Separability f ( x1, x2 ) = g(x1) + h(x2) =⇒ ∂2 f( x1,x2 )∂x1 ∂x1

∀x1,x2= 0

∀x= 0

∀x̸=x∗

∀x1= 0

∀x1,x2= 0

∀x= 0

∀x̸=x∗

∀x1= 0

∀x1,x2= 0

2. Model Selection Using DerivativesIncorporating this information into the problem formulation allows the model selectioncan be transferred from an optimisation over K. . .

. . . to an explicit regularisation problem over derivatives, using an a priori flexiblemodel class definition.

Outline

3. Smoothness-Enforcing RegularisationProblem Formulation

Here we consider X = R → where the kernel optimisation is reduced to asmoothness selection problem.

What would we like to do?

Replace exisiting functional norm regularisation. . .

Vf : V( f ) =N∑

( yk − f (xk))2 + λ ∥ f∥2H

Vf : V( f ) =N∑

( yk − f (xk))2 + λ ∥ f∥2H

With a smoothness-penalty in the cost-function. . .

VD : V( f ) =N

( yk − f (xk))2 + λ ∥Df∥2H

Vf : V( f ) =N∑

( yk − f (xk))2 + λ ∥ f∥2H

With a smoothness-penalty in the cost-function. . .

VD : V( f ) =N

( yk − f (xk))2 + λ ∥Df∥2H

How?- ∥Df∥2H → known (D. X. Zhou, 2008)- f (x) for VD → unknown

3. Smoothness-Enforcing RegularisationAn Extended Representer of f (x)

A finite representer for VD does not exist.But, by adding kernels along X , an approximate formulation can be defined :

Input (x/σ)-6 -4 -2 0 2 4 6

1.5ObservationsObs Kernels∥ f∥2

FIGURE: N = 2

Input (x/σ)-6 -4 -2 0 2 4 6

1.5ObservationsObs KernelsAdd Kernels∥Df∥2

FIGURE: (N,P) = (2, 8)

FD : f (x) =N∑

αi kxi(x) +P

α∗j kx∗j (x)

3. Smoothness-Enforcing RegularisationChoosing the Kernel Width

Examination of the kernel density allows us to make an a priori choice of kernel width :

Input (x/σ)-6 -4 -2 0 2 4 6

1.5f̂kx

FIGURE: ρk = 0.4

Input (x/σ)-6 -4 -2 0 2 4 6

1.5f̂kx

FIGURE: ρk = 0.5

Input (x/σ)-6 -4 -2 0 2 4 6

1.5f̂kx

FIGURE: ρk = 0.6

Hence, for a given P we can define the maximally flexible model class for a givenproblem.

Outline

4. ApplicationEstimation of Locally Nonsmooth Functions

In VD, smoothness ∼ regularisation.

Hence, by introducing weights into the loss-function, importance of the regularisationcan be varied across X :

Vw : V( f ) =N∑

(wk yk − wk f (xk))2 + λ∥Df∥2H,

How to determine the weights?

Relative to a particular modelling objective, e.g.• wk ∼ ∥D f̂(0)(xk)∥22 for piecewise constant structures, or• wk ∼ ∥D2 f̂(0)(xk)∥22 for piecewise linear structures.

-0.5 0 0.5-10

FIGURE: Noise-Free System

-0.5 0 0.5-10

FIGURE: Noisy System

-0.5 0 0.5-10

25yof̂MEDBIAS + SDEV

FIGURE: Vf : R( f )

-0.5 0 0.5-10

FIGURE: VD : R(Df )

-0.5 0 0.5-10

FIGURE: Vf : R( f )

-0.5 0 0.5-10

FIGURE: Vw : R(Df )

ConclusionsObjectives :

• To simplify model selection in nonlinear identification.• By shifting the problem to a regularisation over functional derivatives.

→ Allowing the definition of an a priori flexible model class.

This presentation :• First step ⇒ consider a simple example.

→ Model selection ⇔ smoothness detection.→ Kernel selection ⇔ hyperparameter optimisation.

This presentation :• First step ⇒ consider a simple example.

→ Model selection ⇔ smoothness detection.→ Kernel selection ⇔ hyperparameter optimisation.

Current/Future Research :• Application to dynamical, control-oriented problems (e.g. linearparameter-varying identification)

• Investigation of more complex model selection problems (e.g. detection oflinearities, separability. . . ).

A. Bibliography

• Sobolev Spaces (Wahba, 1990 ; Pillonetto et al, 2014)

∥ f∥Hk=

dif (x)dxi

• Identification using derivative observations (Zhou, 2008 ; Rosasco et al,2010)

Vobvs( f ) = ∥y− f (x)∥22 + γ1

−df (x)dx

+ · · · γm

dmydxm

−dmf (x)dxm

2+ λ ∥f∥H

• Regularization Using Derivatives (Rosasco et al, 2010 ; Lauer, Le and Bloch,2012 ; Duijkers et al, 2014)

VD( f ) = ∥y− f (x)∥22 + λ∥Dmf∥p.

B. Choosing the Kernel WidthThe Smoothness-Tolerance Parameter

ρk =σ

∆x∗, ∆x∗ =

x∗max − x∗minP

, ϵ̂f = 100 ×

1− ∥ f̂ ∥infC

Kernel Density (ρ)10-2 10-1 100

ϵ(ρ)ϵ̂

FIGURE: Selecting an appropriate kernel using ϵ

C. Effect of the Regularisation

⇒ Negligible regularisation (very small λf , λD).

Input-1 -0.5 0 0.5 1

yof̂MEANf̂SD

FIGURE: Vf : R( f )

Input-1 -0.5 0 0.5 1

yof̂MEANf̂SD

FIGURE: VD : R(Df )

⇒ Light regularisation (small λf , λD).

Input-1 -0.5 0 0.5 1

yof̂MEANf̂SD

FIGURE: Vf : R( f )

Input-1 -0.5 0 0.5 1

yof̂MEANf̂SD

FIGURE: VD : R(Df )

⇒ Moderate regularisation.

Input-1 -0.5 0 0.5 1

yof̂MEANf̂SD

FIGURE: Vf : R( f )

Input-1 -0.5 0 0.5 1

yof̂MEANf̂SD

FIGURE: VD : R(Df )

⇒ Heavy regularisation (large λf , λD).

Input-1 -0.5 0 0.5 1

yof̂MEANf̂SD

FIGURE: Vf : R( f )

Input-1 -0.5 0 0.5 1

yof̂MEANf̂SD

FIGURE: VD : R(Df )

⇒ Excessive regularisation (very large λf , λD).

Input-1 -0.5 0 0.5 1

yof̂MEANf̂SD

FIGURE: Vf : R( f )

Input-1 -0.5 0 0.5 1

yof̂MEANf̂SD

FIGURE: VD : R(Df )

D. Further Examples : Detecting Piecewise Structures

So : Noise-free and observed data

-1 -0.5 0 0.5 1-1

-0.49039

1.4358

FIGURE: y(x1, x2)

Results M1 : (Vf , Ff )

FIGURE: MEDIAN FIGURE: BIAS FIGURE: SDEV

Results M2 : (VD, FD)

Results M3 : (Vw, FD)

E. Further Examples : Enforcing Separabilityf ( x1, x2 )

λ−→ f1(x1) + f2(x2)

FIGURE: VDX: R(∂x1∂x2 f )

λ−→ f1(x1) + f2(x2)

An RKHS Approach to Systematic Kernel Selection in Nonlinear System Identification

Engineering

Nonlinear component analysis as a kernel eigenvalue problem

Nonlinear Forecasting with Many Predictors using Kernel

9.520 Class 03, 15 February 2006 Andrea Caponnetto9.520/spring06/Classes/class03.pdf · sider, is a Reproducing Kernel Hilbert Space (RKHS). Lagrange multiplier’s technique Lagrange

NONLINEAR SIGNAL PROCESSING BASED ON REPRODUCING KERNEL HILBERT SPACE By

Interactions between Kernels, Frames, and Persistent ...of reproducing kernel Hilbert spaces (RKHS), has played an increasingly impor tant role in a broad range of applications in

Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions · 2014-08-29 · Kernel Choice and Classiﬁability for RKHS Embeddings of Probability Distributions

Nonlinear Regression Estimation Using Subset-based Kernel

Reproducing Kernel Hilbert Spaces9.520/spring11/slides/class03_rkhsPart1.pdf · Reproducing kernel (rk) If His a RKHS, then for each t 2X there exists, by the Riesz representation

Distributed kernel-based gradient descent algorithms · divide-and-conquerapproach and gradient descent algorithm in a reproducing kernel Hilbert space (RKHS). Using special spectral

Learning a Kernel Matrix for Nonlinear …jebara/6772/papers/sde.pdfLearning a Kernel Matrix for Nonlinear Dimensionality Reduction Kilian Q. Weinberger kilianw@cis.upenn.edu Fei Sha

The Kernel Trick for Nonlinear Factor Modeling

Kernel, RKHS, and Gaussian Processes

Universality, Characteristic Kernels and RKHS Embedding of …jmlr.csail.mit.edu/papers/volume12/sriperumbudur11a/... · 2017-07-22 · Universality, Characteristic Kernels and RKHS

Kernel Methods and Nonlinear Classificationpiyush/teaching/15-9-slides.pdf · Kernel Methods and Nonlinear Classiﬁcation Piyush Rai CS5350/6350: Machine Learning September 15, 2011

Universality, Characteristic Kernels and RKHS …...UNIVERSALITY, CHARACTERISTIC KERNELS AND RKHS EMBEDDING OF MEASURES H consists of p-integrable (w.r.t. any Borel probability measure,µ)

The RKHS Approach to Minimum Variance …1 The RKHS Approach to Minimum Variance Estimation Revisited: Variance Bounds, Sufﬁcient Statistics, and Exponential Families Alexander Jung,

RKHS 11.12.07

Difficulties with Nonlinear SVM for Large Problems The nonlinear kernel is fully dense Computational complexity depends on Separating surface depends

Kernel Conditional Moment Test via Maximum Moment Restriction · conditional moment restrictions in a reproduc-ing kernel Hilbert space (RKHS) called con-ditional moment embedding

Learning a Kernel Matrix for Nonlinear Dimensionality Reduction