An RKHS Approach to Systematic Kernel Selection in Nonlinear System Identification

Preview:

Citation preview

An RKHS Approach toSystematic Kernel Selection

in Nonlinear System Identification

Y. Bhujwalla, V. Laurain, M. Gilson

55th IEEE Conference on Decision and Controlyusuf-michael.bhujwalla@univ-lorraine.fr

Yusuf Bhujwalla (Université de Lorraine) CDC 2016 1 / 14

IntroductionProblem Description

Measured data :DN = {(u1, y1), (u2, y2), . . . , (uN , yN)}

Describing an unknown system :

So :

{

yo,k = fo(xk), fo : X → R

yk = yo,k + eo,k, eo,k ∼ N (0,σ2e )

- xk = [ yk−1 · · · yk−na u1,k · · · u1,k−nb u2,k · · · unu,k−nb ]⊤ ∈ X = R

na+nu (nb+1)

Yusuf Bhujwalla (Université de Lorraine) CDC 2016 2 / 14

IntroductionModelling Objective

Aim : to choose the simplest model from a candidate set of models that accuratelydescribes the system :

Mopt : Accuracy (Data) vs Simplicity (Model)⏐

#

Vf : V( f ) =∑N

k=1 ( yk − f (xk))2 + g( f )

Yusuf Bhujwalla (Université de Lorraine) CDC 2016 3 / 14

IntroductionModelling Objective

Aim : to choose the simplest model from a candidate set of models that accuratelydescribes the system :

Mopt : Accuracy (Data) vs Simplicity (Model)⏐

#

Vf : V( f ) =∑N

k=1 ( yk − f (xk))2 + g( f )

Q1 : How to choose the simplest accurate model ?- Often g( f ) = λ ∥ f∥2H - ensure uniqueness of the solution- λ → controls the bias-variance trade-off

Yusuf Bhujwalla (Université de Lorraine) CDC 2016 3 / 14

IntroductionModelling Objective

Aim : to choose the simplest model from a candidate set of models that accuratelydescribes the system :

Mopt : Accuracy (Data) vs Simplicity (Model)⏐

#

Vf : V( f ) =∑N

k=1 ( yk − f (xk))2 + g( f )

Q1 : How to choose the simplest accurate model ?- Often g( f ) = λ ∥ f∥2H - ensure uniqueness of the solution- λ → controls the bias-variance trade-off

Q2 : How to determine a suitable set of candidate models. . . ?

Yusuf Bhujwalla (Université de Lorraine) CDC 2016 3 / 14

Outline

1. Kernel Methods in Nonlinear Identification

2. Model Selection Using Derivatives

3. Smoothness-Enforcing Regularisation

4. Application : Estimation of Locally Nonsmooth Functions

Yusuf Bhujwalla (Université de Lorraine) CDC 2016 3 / 14

1. Kernel Methods in Nonlinear Identification

Input0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Out

put

0

0.5

1

1.5

2f̂kx

→ Model :

Ff : f (x) =N∑

i=1

αi kxi(x)

→ Nonparametric (nθ ∼ N)→ Flexible : M defined through choice of K→ Height : α (model parameters)→ Width : σ (kernel hyperparameter)

Yusuf Bhujwalla (Université de Lorraine) CDC 2016 4 / 14

1. Kernel Methods in Nonlinear IdentificationIdentification in the RKHS

Reproducing Kernel Hilbert SpacesKernel function defines the model class :

K↔ H

Hence, functions can be represented in terms of kernels :

f (x) = ⟨ f , kx⟩H (1)

Yusuf Bhujwalla (Université de Lorraine) CDC 2016 5 / 14

1. Kernel Methods in Nonlinear IdentificationThe Kernel Selection Problem

Choosing an overly flexible model class (a small kernel) :

FIGURE: Flexible Model Class

Yusuf Bhujwalla (Université de Lorraine) CDC 2016 6 / 14

1. Kernel Methods in Nonlinear IdentificationThe Kernel Selection Problem

Choosing an overly flexible model class (a small kernel) :

FIGURE: Flexible Model Class-1 -0.5 0 0.5 1

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

1.2foyf̂kx

FIGURE: High Variance

Yusuf Bhujwalla (Université de Lorraine) CDC 2016 6 / 14

1. Kernel Methods in Nonlinear IdentificationThe Kernel Selection Problem

Choosing an overly constrained model class (a large kernel) :

FIGURE: Constrained Model Class

Yusuf Bhujwalla (Université de Lorraine) CDC 2016 6 / 14

1. Kernel Methods in Nonlinear IdentificationThe Kernel Selection Problem

Choosing an overly constrained model class (a large kernel) :

FIGURE: Constrained Model Class

-1 -0.5 0 0.5 1-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

1.2foyf̂kx

FIGURE: Model Biased

Yusuf Bhujwalla (Université de Lorraine) CDC 2016 6 / 14

1. Kernel Methods in Nonlinear IdentificationThe Kernel Selection Problem

Why not just choose the ’optimal’ model class ?

FIGURE: Optimal Model Class

Yusuf Bhujwalla (Université de Lorraine) CDC 2016 6 / 14

1. Kernel Methods in Nonlinear IdentificationThe Kernel Selection Problem

Why not just choose the ’optimal’ model class ?

FIGURE: Optimal Model Class

-1 -0.5 0 0.5 1-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

1.2foyf̂kx

FIGURE: Optimal Model

Yusuf Bhujwalla (Université de Lorraine) CDC 2016 6 / 14

1. Kernel Methods in Nonlinear IdentificationThe Kernel Selection Problem

Why not just choose the ’optimal’ model class ?• This is, in general, what we try to do.• However, Hopt is unknown.• Optimisation over one hyperparameter - not that difficult.• Optimisation over multiple model structures, kernel functions and

hyperparameters → more difficult.

-1 -0.5 0 0.5 1-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

1.2foyf̂kx

Yusuf Bhujwalla (Université de Lorraine) CDC 2016 6 / 14

Outline

1. Kernel Methods in Nonlinear Identification

2. Model Selection Using Derivatives

3. Smoothness-Enforcing Regularisation

4. Application : Estimation of Locally Nonsmooth Functions

Yusuf Bhujwalla (Université de Lorraine) CDC 2016 6 / 14

2. Model Selection Using Derivatives

But, note that many properties of K are encoded into its derivatives, e.g.

Smoothness f (x) = ax2 + bx+ c =⇒ d3 f(x)dx3

∀x= 0

f (x) = g1(x) [ x < x∗] + g2(x) [ x > x∗] =⇒ ∃ d f(x)dx

∀x̸=x∗

Linearity f ( x1, x2 ) = x1 h1(x2) + h2(x2) =⇒ ∂2 f( x1,x2 )∂x21

∀x1= 0

Separability f ( x1, x2 ) = g(x1) + h(x2) =⇒ ∂2 f( x1,x2 )∂x1 ∂x1

∀x1,x2= 0

Yusuf Bhujwalla (Université de Lorraine) CDC 2016 7 / 14

2. Model Selection Using Derivatives

But, note that many properties of K are encoded into its derivatives, e.g.

Smoothness f (x) = ax2 + bx+ c =⇒ d3 f(x)dx3

∀x= 0

f (x) = g1(x) [ x < x∗] + g2(x) [ x > x∗] =⇒ ∃ d f(x)dx

∀x̸=x∗

Linearity f ( x1, x2 ) = x1 h1(x2) + h2(x2) =⇒ ∂2 f( x1,x2 )∂x21

∀x1= 0

Separability f ( x1, x2 ) = g(x1) + h(x2) =⇒ ∂2 f( x1,x2 )∂x1 ∂x1

∀x1,x2= 0

Yusuf Bhujwalla (Université de Lorraine) CDC 2016 7 / 14

2. Model Selection Using Derivatives

But, note that many properties of K are encoded into its derivatives, e.g.

Smoothness f (x) = ax2 + bx+ c =⇒ d3 f(x)dx3

∀x= 0

f (x) = g1(x) [ x < x∗] + g2(x) [ x > x∗] =⇒ ∃ d f(x)dx

∀x̸=x∗

Linearity f ( x1, x2 ) = x1 h1(x2) + h2(x2) =⇒ ∂2 f( x1,x2 )∂x21

∀x1= 0

Separability f ( x1, x2 ) = g(x1) + h(x2) =⇒ ∂2 f( x1,x2 )∂x1 ∂x1

∀x1,x2= 0

Yusuf Bhujwalla (Université de Lorraine) CDC 2016 7 / 14

2. Model Selection Using DerivativesIncorporating this information into the problem formulation allows the model selectioncan be transferred from an optimisation over K. . .

. . . to an explicit regularisation problem over derivatives, using an a priori flexiblemodel class definition.

Yusuf Bhujwalla (Université de Lorraine) CDC 2016 8 / 14

Outline

1. Kernel Methods in Nonlinear Identification

2. Model Selection Using Derivatives

3. Smoothness-Enforcing Regularisation

4. Application : Estimation of Locally Nonsmooth Functions

Yusuf Bhujwalla (Université de Lorraine) CDC 2016 8 / 14

3. Smoothness-Enforcing RegularisationProblem Formulation

Here we consider X = R → where the kernel optimisation is reduced to asmoothness selection problem.

What would we like to do?

Replace exisiting functional norm regularisation. . .

Vf : V( f ) =N∑

k=1

( yk − f (xk))2 + λ ∥ f∥2H

Yusuf Bhujwalla (Université de Lorraine) CDC 2016 9 / 14

3. Smoothness-Enforcing RegularisationProblem Formulation

Here we consider X = R → where the kernel optimisation is reduced to asmoothness selection problem.

What would we like to do?

Replace exisiting functional norm regularisation. . .

Vf : V( f ) =N∑

k=1

( yk − f (xk))2 + λ ∥ f∥2H

With a smoothness-penalty in the cost-function. . .

VD : V( f ) =N

k=1

( yk − f (xk))2 + λ ∥Df∥2H

Yusuf Bhujwalla (Université de Lorraine) CDC 2016 9 / 14

3. Smoothness-Enforcing RegularisationProblem Formulation

Here we consider X = R → where the kernel optimisation is reduced to asmoothness selection problem.

What would we like to do?

Replace exisiting functional norm regularisation. . .

Vf : V( f ) =N∑

k=1

( yk − f (xk))2 + λ ∥ f∥2H

With a smoothness-penalty in the cost-function. . .

VD : V( f ) =N

k=1

( yk − f (xk))2 + λ ∥Df∥2H

How?- ∥Df∥2H → known (D. X. Zhou, 2008)- f (x) for VD → unknown

Yusuf Bhujwalla (Université de Lorraine) CDC 2016 9 / 14

3. Smoothness-Enforcing RegularisationAn Extended Representer of f (x)

A finite representer for VD does not exist.But, by adding kernels along X , an approximate formulation can be defined :

Input (x/σ)-6 -4 -2 0 2 4 6

Out

put

0

0.5

1

1.5ObservationsObs Kernels∥ f∥2

FIGURE: N = 2

Input (x/σ)-6 -4 -2 0 2 4 6

Out

put

0

0.5

1

1.5ObservationsObs KernelsAdd Kernels∥Df∥2

FIGURE: (N,P) = (2, 8)

FD : f (x) =N∑

i=1

αi kxi(x) +P

j=1

α∗j kx∗j (x)

Yusuf Bhujwalla (Université de Lorraine) CDC 2016 10 / 14

3. Smoothness-Enforcing RegularisationChoosing the Kernel Width

Examination of the kernel density allows us to make an a priori choice of kernel width :

Input (x/σ)-6 -4 -2 0 2 4 6

Out

put

0

0.5

1

1.5f̂kx

FIGURE: ρk = 0.4

Input (x/σ)-6 -4 -2 0 2 4 6

Out

put

0

0.5

1

1.5f̂kx

FIGURE: ρk = 0.5

Input (x/σ)-6 -4 -2 0 2 4 6

Out

put

0

0.5

1

1.5f̂kx

FIGURE: ρk = 0.6

Hence, for a given P we can define the maximally flexible model class for a givenproblem.

Yusuf Bhujwalla (Université de Lorraine) CDC 2016 11 / 14

Outline

1. Kernel Methods in Nonlinear Identification

2. Model Selection Using Derivatives

3. Smoothness-Enforcing Regularisation

4. Application : Estimation of Locally Nonsmooth Functions

Yusuf Bhujwalla (Université de Lorraine) CDC 2016 11 / 14

4. ApplicationEstimation of Locally Nonsmooth Functions

In VD, smoothness ∼ regularisation.

Hence, by introducing weights into the loss-function, importance of the regularisationcan be varied across X :

Vw : V( f ) =N∑

i=1

(wk yk − wk f (xk))2 + λ∥Df∥2H,

How to determine the weights?

Relative to a particular modelling objective, e.g.• wk ∼ ∥D f̂(0)(xk)∥22 for piecewise constant structures, or• wk ∼ ∥D2 f̂(0)(xk)∥22 for piecewise linear structures.

Yusuf Bhujwalla (Université de Lorraine) CDC 2016 12 / 14

4. ApplicationEstimation of Locally Nonsmooth Functions

-0.5 0 0.5-10

-5

0

5

10

15

20

25yo

FIGURE: Noise-Free System

-0.5 0 0.5-10

-5

0

5

10

15

20

25y

FIGURE: Noisy System

Yusuf Bhujwalla (Université de Lorraine) CDC 2016 13 / 14

4. ApplicationEstimation of Locally Nonsmooth Functions

-0.5 0 0.5-10

-5

0

5

10

15

20

25yof̂MEDBIAS + SDEV

FIGURE: Vf : R( f )

-0.5 0 0.5-10

-5

0

5

10

15

20

25yof̂MEDBIAS + SDEV

FIGURE: VD : R(Df )

Yusuf Bhujwalla (Université de Lorraine) CDC 2016 13 / 14

4. ApplicationEstimation of Locally Nonsmooth Functions

-0.5 0 0.5-10

-5

0

5

10

15

20

25yof̂MEDBIAS + SDEV

FIGURE: Vf : R( f )

-0.5 0 0.5-10

-5

0

5

10

15

20

25yof̂MEDBIAS + SDEV

FIGURE: Vw : R(Df )

Yusuf Bhujwalla (Université de Lorraine) CDC 2016 13 / 14

ConclusionsObjectives :

• To simplify model selection in nonlinear identification.• By shifting the problem to a regularisation over functional derivatives.

→ Allowing the definition of an a priori flexible model class.

Yusuf Bhujwalla (Université de Lorraine) CDC 2016 14 / 14

ConclusionsObjectives :

• To simplify model selection in nonlinear identification.• By shifting the problem to a regularisation over functional derivatives.

→ Allowing the definition of an a priori flexible model class.

This presentation :• First step ⇒ consider a simple example.

→ Model selection ⇔ smoothness detection.→ Kernel selection ⇔ hyperparameter optimisation.

Yusuf Bhujwalla (Université de Lorraine) CDC 2016 14 / 14

ConclusionsObjectives :

• To simplify model selection in nonlinear identification.• By shifting the problem to a regularisation over functional derivatives.

→ Allowing the definition of an a priori flexible model class.

This presentation :• First step ⇒ consider a simple example.

→ Model selection ⇔ smoothness detection.→ Kernel selection ⇔ hyperparameter optimisation.

Current/Future Research :• Application to dynamical, control-oriented problems (e.g. linearparameter-varying identification)

• Investigation of more complex model selection problems (e.g. detection oflinearities, separability. . . ).

Yusuf Bhujwalla (Université de Lorraine) CDC 2016 14 / 14

A. Bibliography

• Sobolev Spaces (Wahba, 1990 ; Pillonetto et al, 2014)

∥ f∥Hk=

m∑

i=0

X

(

dif (x)dxi

)2

dx

• Identification using derivative observations (Zhou, 2008 ; Rosasco et al,2010)

Vobvs( f ) = ∥y− f (x)∥22 + γ1

dydx

−df (x)dx

2

2

+ · · · γm

dmydxm

−dmf (x)dxm

2

2+ λ ∥f∥H

• Regularization Using Derivatives (Rosasco et al, 2010 ; Lauer, Le and Bloch,2012 ; Duijkers et al, 2014)

VD( f ) = ∥y− f (x)∥22 + λ∥Dmf∥p.

Yusuf Bhujwalla (Université de Lorraine) CDC 2016 14 / 14

B. Choosing the Kernel WidthThe Smoothness-Tolerance Parameter

ρk =σ

∆x∗, ∆x∗ =

x∗max − x∗minP

, ϵ̂f = 100 ×

{

1− ∥ f̂ ∥infC

}

%.

Kernel Density (ρ)10-2 10-1 100

Smoo

thne

ss T

oler

ance

(ϵ %

)

10-15

10-10

10-5

100

ϵ(ρ)ϵ̂

FIGURE: Selecting an appropriate kernel using ϵ

Yusuf Bhujwalla (Université de Lorraine) CDC 2016 14 / 14

C. Effect of the Regularisation

⇒ Negligible regularisation (very small λf , λD).

Input-1 -0.5 0 0.5 1

Out

put

-20

-10

0

10

20

30

yof̂MEANf̂SD

FIGURE: Vf : R( f )

Input-1 -0.5 0 0.5 1

Out

put

-20

-10

0

10

20

30

yof̂MEANf̂SD

FIGURE: VD : R(Df )

Yusuf Bhujwalla (Université de Lorraine) CDC 2016 14 / 14

C. Effect of the Regularisation

⇒ Light regularisation (small λf , λD).

Input-1 -0.5 0 0.5 1

Out

put

-20

-10

0

10

20

30

yof̂MEANf̂SD

FIGURE: Vf : R( f )

Input-1 -0.5 0 0.5 1

Out

put

-20

-10

0

10

20

30

yof̂MEANf̂SD

FIGURE: VD : R(Df )

Yusuf Bhujwalla (Université de Lorraine) CDC 2016 14 / 14

C. Effect of the Regularisation

⇒ Moderate regularisation.

Input-1 -0.5 0 0.5 1

Out

put

-20

-10

0

10

20

30

yof̂MEANf̂SD

FIGURE: Vf : R( f )

Input-1 -0.5 0 0.5 1

Out

put

-20

-10

0

10

20

30

yof̂MEANf̂SD

FIGURE: VD : R(Df )

Yusuf Bhujwalla (Université de Lorraine) CDC 2016 14 / 14

C. Effect of the Regularisation

⇒ Heavy regularisation (large λf , λD).

Input-1 -0.5 0 0.5 1

Out

put

-20

-10

0

10

20

30

yof̂MEANf̂SD

FIGURE: Vf : R( f )

Input-1 -0.5 0 0.5 1

Out

put

-20

-10

0

10

20

30

yof̂MEANf̂SD

FIGURE: VD : R(Df )

Yusuf Bhujwalla (Université de Lorraine) CDC 2016 14 / 14

C. Effect of the Regularisation

⇒ Excessive regularisation (very large λf , λD).

Input-1 -0.5 0 0.5 1

Out

put

-20

-10

0

10

20

30

yof̂MEANf̂SD

FIGURE: Vf : R( f )

Input-1 -0.5 0 0.5 1

Out

put

-20

-10

0

10

20

30

yof̂MEANf̂SD

FIGURE: VD : R(Df )

Yusuf Bhujwalla (Université de Lorraine) CDC 2016 14 / 14

D. Further Examples : Detecting Piecewise Structures

So : Noise-free and observed data

-1 -0.5 0 0.5 1-1

-0.5

0

0.5

1

-0.49039

0

1

1.4358

FIGURE: y(x1, x2)

Yusuf Bhujwalla (Université de Lorraine) CDC 2016 14 / 14

D. Further Examples : Detecting Piecewise Structures

Results M1 : (Vf , Ff )

FIGURE: MEDIAN FIGURE: BIAS FIGURE: SDEV

Yusuf Bhujwalla (Université de Lorraine) CDC 2016 14 / 14

D. Further Examples : Detecting Piecewise Structures

Results M2 : (VD, FD)

FIGURE: MEDIAN FIGURE: BIAS FIGURE: SDEV

Yusuf Bhujwalla (Université de Lorraine) CDC 2016 14 / 14

D. Further Examples : Detecting Piecewise Structures

Results M3 : (Vw, FD)

FIGURE: MEDIAN FIGURE: BIAS FIGURE: SDEV

Yusuf Bhujwalla (Université de Lorraine) CDC 2016 14 / 14

E. Further Examples : Enforcing Separabilityf ( x1, x2 )

λ−→ f1(x1) + f2(x2)

FIGURE: VDX: R(∂x1∂x2 f )

Yusuf Bhujwalla (Université de Lorraine) CDC 2016 14 / 14

E. Further Examples : Enforcing Separabilityf ( x1, x2 )

λ−→ f1(x1) + f2(x2)

FIGURE: VDX: R(∂x1∂x2 f )

Yusuf Bhujwalla (Université de Lorraine) CDC 2016 14 / 14

E. Further Examples : Enforcing Separabilityf ( x1, x2 )

λ−→ f1(x1) + f2(x2)

FIGURE: VDX: R(∂x1∂x2 f )

Yusuf Bhujwalla (Université de Lorraine) CDC 2016 14 / 14

E. Further Examples : Enforcing Separabilityf ( x1, x2 )

λ−→ f1(x1) + f2(x2)

FIGURE: VDX: R(∂x1∂x2 f )

Yusuf Bhujwalla (Université de Lorraine) CDC 2016 14 / 14

E. Further Examples : Enforcing Separabilityf ( x1, x2 )

λ−→ f1(x1) + f2(x2)

FIGURE: VDX: R(∂x1∂x2 f )

Yusuf Bhujwalla (Université de Lorraine) CDC 2016 14 / 14

E. Further Examples : Enforcing Separabilityf ( x1, x2 )

λ−→ f1(x1) + f2(x2)

FIGURE: VDX: R(∂x1∂x2 f )

Yusuf Bhujwalla (Université de Lorraine) CDC 2016 14 / 14

E. Further Examples : Enforcing Separabilityf ( x1, x2 )

λ−→ f1(x1) + f2(x2)

FIGURE: VDX: R(∂x1∂x2 f )

Yusuf Bhujwalla (Université de Lorraine) CDC 2016 14 / 14

E. Further Examples : Enforcing Separabilityf ( x1, x2 )

λ−→ f1(x1) + f2(x2)

FIGURE: VDX: R(∂x1∂x2 f )

Yusuf Bhujwalla (Université de Lorraine) CDC 2016 14 / 14

E. Further Examples : Enforcing Separabilityf ( x1, x2 )

λ−→ f1(x1) + f2(x2)

FIGURE: VDX: R(∂x1∂x2 f )

Yusuf Bhujwalla (Université de Lorraine) CDC 2016 14 / 14

E. Further Examples : Enforcing Separabilityf ( x1, x2 )

λ−→ f1(x1) + f2(x2)

FIGURE: VDX: R(∂x1∂x2 f )

Yusuf Bhujwalla (Université de Lorraine) CDC 2016 14 / 14

E. Further Examples : Enforcing Separabilityf ( x1, x2 )

λ−→ f1(x1) + f2(x2)

FIGURE: VDX: R(∂x1∂x2 f )

Yusuf Bhujwalla (Université de Lorraine) CDC 2016 14 / 14

Recommended