Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
Gaussian process models and covariancefunction design
Nicolas Durrande (Ecole des Mines de St Etienne)
UJM – APSSE Seminar, the 28th of November 2014
1/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 1 / 49
Introduction
Context
We consider a function f whose evaluation of f is costly:
examplesphysical experimentoutput of a largecomputer code...
2/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 2 / 49
Introduction
Using a limited number of observations we want to answer questionssuch as
what is the minimum of f?what is the mean value?what is the probablility to be above a given threshold?Are there some non influent variables?...
3/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 3 / 49
Introduction
Outline:
1 Introduction to GP models2 A few words on RKHS3 What is a kernel?4 Usual operations on kernels5 Effect of a linear application6 Two examples of application:
Sensitivity analysisPeriodicity detection
4/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 4 / 49
Introduction
We assume we have observed f for a limited number of time pointsx1, . . . , xn:
0 5 10 15x
−1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
2.0
2.5
f
The observations are denoted by fi = f (xi) (or F = f (X )).
5/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 5 / 49
Introduction
Since f in unknown, we make the general assumption that it is to thesample path of a Gaussian process Y :
0 5 10 15
x
−4
−3
−2
−1
0
1
2
3
4Y
Y is characterised by its mean and covariance function.
6/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 6 / 49
Introduction
We can look at the sample paths of Y that interpolate the data points:
0 5 10 15
x
−1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
2.0
2.5Y
7/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 7 / 49
Introduction
The conditional distribution is still Gaussian. It has mean and variance
m(x) = E (Y (x)|Y (X ) = F ) = k(x ,X )k(X ,X )−1Fv(x) = var (Y (x)|Y (X ) = F ) = k(x , x)− k(x ,X )tk(X ,X )−1k(x ,X )
where k is the kernel: k(x , y) = cov(Y (x),Y (y)).
It can be represented as a mean function with confidence intervals.
0 5 10 15x
−1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
2.0
2.5
Y
8/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 8 / 49
Introduction
Changing the kernel has a huge impact on the model:
0 5 10 15
x
−1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
2.0
2.5
gaussian kernel
k(x , y) = σ2 exp(−(x − y)2
2θ2
)0 5 10 15
x
−1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
2.0
2.5
exponential kernel
k(x , y) = σ2 exp(−|x − y |
θ
)
9/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 9 / 49
Introduction
This is because it means changing the prior on f :
0 5 10 15
x
−3
−2
−1
0
1
2
3
Y
gaussian kernel
0 5 10 15
x
−4
−3
−2
−1
0
1
2
3
4
Y
exponential kernel
Given the observations, the model is entirely defined by the kernel.We will now focus on this object.
10/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 10 / 49
Introduction
Theorem (Loeve)k corresponds to the covariance of a GP
mk is a symmetric positive semi-definite function
DefinitionA function k is positive semi-definite if it satisfies
n∑i=1
n∑j=1
aiajk(xi , xj) ≥ 0
for all n ∈ N, for all xi ∈ D, for all ai ∈ R.
11/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 11 / 49
Introduction
A symmetric positive semi-definite function is also the reproducingkernel of a RKHS:
DefinitionH is a RKHS with kernel k if it is a Hilbert space such that:
for all x , k(x , .) ∈ Hfor all f ∈ H, 〈f (.), k(x , .)〉H = f (x)
Given a kernel k , the associated RKHS is the completion of{n∑
i=1
aik(xi , .);n ∈ N,ai ∈ R, xi ∈ D}
for the inner product〈n∑
i=1
aik(xi , .),m∑
i=1
bik(xi , .)
〉=
n∑i=1
m∑j=1
aibjk(xi , xj)
12/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 12 / 49
Introduction
Given some observations, the best predictor is defined as theinterpolator with minimal norm:
m = argminh∈H
{||h||H,h(xi)=f (xi)} = · · · = k(x ,X )k(X ,X )−1F
The expression is the same as the conditional expectation of the GP!
k m
H
Z
13/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 13 / 49
Introduction
In order to build m we can use any off the shelf kernel:
white noise: k(x , y) = δx ,y
bias: k(x , y) = 1
linear: k(x , y) = xy
exponential: k(x , y) = exp (−|x − y |)Brownian: k(x , y) = min(x , y)
Gaussian: k(x , y) = exp(−(x − y)2
)Matérn 3/2: k(x , y) = (1 + |x − y |)× exp (−|x − y |)
sinc: k(x , y) = sin(|x − y |)|x − y |...
14/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 14 / 49
Introduction
15/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 15 / 49
Introduction
The kernel has to be chosen accordingly to the prior believe on thefunction to approximate:
What is the regularity of the phenomenon?Is it stationary?...
If we have some knowledge about the behaviour of f , can wedesign a kernel accordingly?We will discuss 3 options:
Making new from oldLinear operatorextracting RKHS subspaces
16/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 16 / 49
Introduction
Making new from old:
Kernels can be:Summed together
On the same space k(x , y) = k1(x , y) + k2(x , y)On the tensor space k(x,y) = k1(x1, y1) + k2(x2, y2)
Multiplied togetherOn the same space k(x , y) = k1(x , y)× k2(x , y)On the tensor space k(x,y) = k1(x1, y1)× k2(x2, y2)
Composed with a functionk(x , y) = k1(f (x), f (y))
All these operations will preserve the positive definiteness.
How can this be useful?
17/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 17 / 49
Introduction
Sum of kernels over the same space
Example (The Mauna Loa observatory dataset)This famous dataset compiles the monthly CO2 concentration inHawaii since 1958.
1960 1970 1980 1990 2000 2010 2020 2030
320
340
360
380
400
420
440
Let’s try to predict the concentration for the next 20 years.
18/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 18 / 49
Introduction
Sum of kernels over the same space
We first consider a squared-exponential kernel:
1950 1960 1970 1980 1990 2000 2010 2020 2030 2040600
400
200
0
200
400
600
1950 1960 1970 1980 1990 2000 2010 2020 2030 2040300
320
340
360
380
400
420
440
460
480
The results are terrible!
19/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 19 / 49
Introduction
Sum of kernels over the same space
What happen if we sum both kernels?
k(x , y) = σ21krbf1(x , y) + σ22krbf2(x , y)
1950 1960 1970 1980 1990 2000 2010 2020 2030 2040300
320
340
360
380
400
420
440
460
480
The model is drastically improved!
20/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 20 / 49
Introduction
Sum of kernels over the same space
What happen if we sum both kernels?
k(x , y) = σ21krbf1(x , y) + σ22krbf2(x , y)
1950 1960 1970 1980 1990 2000 2010 2020 2030 2040300
320
340
360
380
400
420
440
460
480
The model is drastically improved!
20/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 20 / 49
Introduction
Sum of kernels over the same space
We can try the following kernel:
k(x , y) = σ20x2y2 + σ21krbf1(x , y) + σ
22krbf2(x , y) + σ
23kper (x , y)
1950 1960 1970 1980 1990 2000 2010 2020 2030 2040300
320
340
360
380
400
420
440
460
Once again, the model is significantly improved.
21/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 21 / 49
Introduction
Sum of kernels over the same space
We can try the following kernel:
k(x , y) = σ20x2y2 + σ21krbf1(x , y) + σ
22krbf2(x , y) + σ
23kper (x , y)
1950 1960 1970 1980 1990 2000 2010 2020 2030 2040300
320
340
360
380
400
420
440
460
Once again, the model is significantly improved.
21/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 21 / 49
Introduction
Sum of kernels over tensor space
Property
K (x,y) = K1(x1, y1) + K2(x2, y2) (1)
is valid covariance structure.
0.0 0.2 0.4 0.6 0.8 1.00.00.2
0.40.6
0.81.0
0.2
0.4
0.6
0.8
1.0
+
0.0 0.2 0.4 0.6 0.8 1.00.00.2
0.40.6
0.81.0
0.2
0.4
0.6
0.8
1.0
=
0.0 0.2 0.4 0.6 0.8 1.00.00.2
0.40.6
0.81.0
0.5
1.0
1.5
Remark:From a GP point of view, K is the kernel of Z (x) = Z1(x1) + Z2(x2)
22/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 22 / 49
Introduction
Sum of kernels over tensor space
We can have a look at a few sample paths from Z :
0.0 0.2 0.4 0.6 0.8 1.00.00.2
0.40.6
0.81.0
21
01234
0.0 0.2 0.4 0.6 0.8 1.00.00.2
0.40.6
0.81.0
321
012
0.0 0.2 0.4 0.6 0.8 1.00.00.2
0.40.6
0.81.0
1.51.00.5
0.00.51.01.5
⇒ They are additive (up to a modification)
Tensor Additive kernels are very useful forApproximating additive functionsBuilding models over high dimensional inputs spaces
23/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 23 / 49
Introduction
Effect of a linear operator
PropertyLet L be a linear operator that commutes with the covariance, thenk(x , y) = Lx(Ly (k1(x , y))) is a kernel.
ExampleWe want to approximate a function [0,1]→ R that is symmetric withrespect to 0.5. We will consider 2 linear operators:
L1 : f (x)→{
f (x) x < 0.5f (1− x) x ≥ 0.5
L2 : f (x)→f (x) + f (1− x)
2.
24/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 24 / 49
Introduction
Effect of a linear operator: example (Ginsbourger, AFST 2013)
Examples of associated sample paths are
k1 = L1(L1(k))
0.0 0.2 0.4 0.6 0.8 1.0
−2
−1
01
23
x
Y
k2 = L2(L2(k))
0.0 0.2 0.4 0.6 0.8 1.0
−2
−1
01
23
x
Y
The differentiability is not always respected!
25/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 25 / 49
Introduction
Effect of a linear operator
Ideally, we want to extract the subspace of symmetric functions in H
H
Hsym
f
L1fL2f
and to define L as the orthogonal projection onto Hsym
⇒ This can be difficult... but it raises interesting questions!
26/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 26 / 49
2 practical cases
We will now see how to construct the optimal linear application on tworeal problems:
Kernels for sensitivity analysisKernels for periodicity detection
27/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 27 / 49
2 practical cases
Sensitivity analysis
The analysis of the influence of the various variables of ad-dimensional function f is often based on the HDMR:
f (x) = f0 +d∑
i=1
fi(xi) +∑i
2 practical cases
A first idea is to consider ANOVA kernels [Stitson 97]:
k(x,y) =d∏
i=1
(1 + k(xi , yi))
= 1 +d∑
i=1
k(xi , yi)︸ ︷︷ ︸additive part
+∑i
2 practical cases
A decomposition of the best predictor is naturally associated to thosekernels.
Example: we have in 2D K = 1+K1 +K2 +K1K2 so the best predictorcan be written as
m(x) = (1 + k(x1) + k(x2) + k(x1)k(x2))tK−1F= m0 + m1(x1) + m2(x2) + m12(x)
This decomposition looks like the ANOVA representation of m but the
mI do not satisfy ∫Di
mI(xI)dxi = 0
We need to build a kernel k0 such that∫
k0(x , y)dx = 0 for all y .
30/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 30 / 49
2 practical cases
The RKHS framework will be very useful here
Can we extract the subspace of zero mean function in H ?
h ∈ H0 ⇔∫
h(x)dx = 0H0
H
The integral operator is linear, and it is bounded if∫
k(x , x)dx
2 practical cases
Calculations give directly
R(x) = 〈R, k(x , .)〉H =∫
Dk(x , s)ds
L(h) = h − 〈R, k(x , .)〉H||R||2H
R
k0(x , y) = k(x , y)−
∫k(x , s)ds
∫k(y , s)ds∫∫
k(s, t)dsdt
32/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 32 / 49
2 practical cases
Let us consider the random test function f : [0, 1]10 → R :
x 7→ 10 sin(πx1x2) + 20(x3 − 0.5)2 + 10x4 + 5x5 +N (0,1)
The steps for approximating f with GPR are:1 Learn f on a DoE (here LHS maximin with 180 points)2 get the optimal values for the kernel parameters using MLE,3 build the kriging predictor m based on
∏(1 + k0)
As f̂ is a function of 10 variables, the model can not easily berepresented: it is usually considered as a “blackbox”. However, thestructure of the kernel allows to split m in submodels.
33/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 33 / 49
2 practical cases
The univariate sub-models are:
(we had f (x) = 10 sin(πx1x2) + 20(x3 − 0.5)2 + 10x4 + 5x5 +N (0, 1)
)
34/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 34 / 49
2 practical cases Periodicity detection
General problemGiven a few observations can we extract the periodic part of a signal ?
0 5 10 151.5
1.0
0.5
0.0
0.5
1.0
1.5
2.0
2.5
35/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 35 / 49
2 practical cases Periodicity detection
As previously we will build an orthogonal decomposition of the RKHS:
H = Hp +Ha
where Hp is the subspace of H spanned by the Fourier basisB(t) = (sin(t), cos(t), . . . , sin(nt), cos(nt))t .
PropertyThe reproducing kernel of Hp is
kp(x , y) = B(x)tG−1B(y)
where G is the Gram matrix G associated to B.
36/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 36 / 49
2 practical cases Periodicity detection
We can deduce the following decomposition of the kernel:
k(x , y) = kp(x , y) + k(x , y)− kp(x , y)︸ ︷︷ ︸ka(x,y)
Property: Decomposition of the model
The decomposition of the kernel gives directly
m(t) = (kp(t) + ka(t))t(Kp + Ka)−1F
= kp(t)t(Kp + Ka)−1F︸ ︷︷ ︸periodic sub-model mp
+ ka(t)t(Kp + Ka)−1F︸ ︷︷ ︸aperiodic sub-model ma
and we can associate a prediction variance to the sub-models:
vp(t) = kp(t , t)− kp(t)t(Kp + Ka)−1kp(t)va(t) = ka(t , t)− ka(t)t(Kp + Ka)−1ka(t)
37/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 37 / 49
2 practical cases Periodicity detection
ExampleFor the observations shown previously we obtain:
0 5 10 15−1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
2.0
2.5
0 5 10 15−1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
2.0
2.5
||
0 5 10 15−1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
2.0
2.5
+
Can we can do better?
38/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 38 / 49
2 practical cases Periodicity detection
Previously, the kernels were parameterized by 2 variables:
k(x , y , σ2, θ)
but writing k as a sum allows to tune independently the parameters ofthe sub-kernels.
Let k∗ be defined as
k∗(x , y , σ2p, σ2a, θp, θa) = kp(x , y , σ
2p, θp) + ka(x , y , σ
2a, θa)
Furthermore, we include a 5th parameter in k∗ accounting for theperiod by changing the Fourier basis:
Bω(t) = (sin(ωt), cos(ωt), . . . , sin(nωt), cos(nωt))t
39/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 39 / 49
2 practical cases Periodicity detection
If we optimize the 5 parameters of k∗ with maximum likelihoodestimation we obtain:
0 5 10 15−1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
2.0
2.5
0 5 10 15−1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
2.0
2.5
||
0 5 10 15−1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
2.0
2.5
+
40/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 40 / 49
2 practical cases Periodicity detection
The 24 hour cycle of days can be observed in the oscillations of manyphysiological processes of living beings.
ExamplesBody temperature, jet lag, sleep, ... but also observed for plants,micro-organisms, etc.
This phenomenon is called the circadian rhythm and the mechanismdriving this cycle is the circadian clock.
To understand how the circadian clock operates at the gene level,biologist look at the temporal evolution of gene expression.
41/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 41 / 49
2 practical cases Periodicity detection
The aim of gene expression is to measure the activity of various genes:
42/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 42 / 49
2 practical cases Periodicity detection
The mRNA concentration is measured with microarray experiments
The chip is then scanned to determine the occupation of each cell andreveal the concentration of mRNA.
43/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 43 / 49
2 practical cases Periodicity detection
Experiments to study the circadian clock are typically:1 Expose the organism to a 12h light / 12h dark cycle2 at t=0, transfer to constant light3 perform a microarray experiment every 4 hours to measure gene
expression
Regulators of the circadian clock are often rhythmically regulated.⇒ identifying periodically expressed genes gives an insight on the
overall mechanism.
44/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 44 / 49
2 practical cases Periodicity detection
We used data from Edward 2006, based on arabidopsis.
The dimension of the data is:
22810 genes13 time points
Edward 2006 gives a list of the 3504 most periodically expressedgenes. The comparison with our approach gives:
21767 genes with the same label (2461 per. and 19306 non-per.)1043 genes with different labels
45/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 45 / 49
2 practical cases Periodicity detection
Let’s look at genes with different labels:
30 40 50 60 703
2
1
0
1
2
3 At1g60810
30 40 50 60 702
1
0
1
2
3 At4g10040
30 40 50 60 703
2
1
0
1
2
3 At1g06290
30 40 50 60 702.0
1.5
1.0
0.5
0.0
0.5
1.0
1.5
2.0 At5g4890030 40 50 60 702
1
0
1
2
3
4 At5g41480
30 40 50 60 702
1
0
1
2
3
4 At3g08000
30 40 50 60 703
2
1
0
1
2 At3g03900
30 40 50 60 702
1
0
1
2
3 At2g36400
periodic for Edward periodic for our approach
46/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 46 / 49
2 practical cases Periodicity detection
What cannot be done with GPs... It is rather difficult to:
47/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 47 / 49
Conclusion
Conclusion
We have seen thatGaussian process regression is a great tool for modeling.Kernels can (and should) be tailored to the problem at hand.
What cannot be done with GPs...It is rather difficult to:
impose non linear constrainsdeal with a (very) large number of observations...
48/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 48 / 49
Conclusion
Conclusion
Bibliography
N. Aronszajn.Theory of reproducing kernels.Transaction of the AMS, 68,1950
A. Berlinet and C. Thomas-Agnan.RKHS in probability and statistics.Kluwer academic publisher,2004
49/49 N. Durrande ([email protected]) GP models and kernel design 28th of November 2014 49 / 49
IntroductionIntroduction2 practical casesPeriodicity detection
Conclusion