Gaussian Process Methods in Machine Learning

Gaussian Process Methods in Machine Learning

Jonathan [email protected]

Lecture 2: Optimization with Gaussian Processes

CS6216, Semester 1, AY2021/22

Outline of Lectures

• Lecture 0: Bayesian Modeling and Regression

• Lecture 1: Gaussian Processes, Kernels, and Regression

• Lecture 2: Optimization with Gaussian Processes

• Lecture 3: Advanced Bayesian Optimization Methods

• Lecture 4: GP Methods in Non-Bayesian Settings

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 2/ 46

Outline: This Lecture

I This lecture1. Black-box function optimization2. Gaussian processes3. Bayesian optimization algorithms4. Regret bounds5. Ongoing challenges in Bayesian optimization


Optimization (I)

• So far, we have talked about learning the entire input-output relation, e.g., y ≈ f(x)

• A potentially easier problem is optimization, where we seek to find x maximizing f :

x? ∈ arg maxx∈D

f(x)

I It is potentially easier because we typically don’t need to learn f accurately forinputs x that are far from x∗

• In this context, instead of simply having access to a data set D = {(xt, yt)}nt=1, weare able to adaptively query xt based on the past samples y1, . . . , yt−1 (a form ofactive learning)


Optimization (II)

• A more familiar optimization setting:I Function f is knownI Algorithm can use derivatives (e.g., gradient descent/ascent)I Main bottleneck is computation

• What we will consider:I Function f is unknownI Algorithm only has access to “black-box” function queriesI Main bottleneck is the high cost of such queries


Optimization (II)

• A more familiar optimization setting:I Function f is knownI Algorithm can use derivatives (e.g., gradient descent/ascent)I Main bottleneck is computation

• What we will consider:I Function f is unknownI Algorithm only has access to “black-box” function queriesI Main bottleneck is the high cost of such queries


Optimization (III)

Black-box function optimization:

x? ∈ arg maxx∈D⊆Rd

f(x)

• Setting:

I Unknown “reward” function fI Expensive evaluations of fI Noisy evaluationsI Choose xt based on {(xt′ , yt′ )}t′<t

yt = f(xt) + zt

zt ∼ N(0, σ2)

• Note: This problem with a GP approach is often called Bayesian optimization


Optimization (III)

Black-box function optimization:


f(x)

• Setting:

I Unknown “reward” function fI Expensive evaluations of fI Noisy evaluationsI Choose xt based on {(xt′ , yt′ )}t′<t

yt = f(xt) + zt

zt ∼ N(0, σ2)

• Note: This problem with a GP approach is often called Bayesian optimization


The Bayesian Mechanics

0.0 0.2 0.4 0.6 0.8 1.0

4

2

0

2

4

Mean predictions + 3 standard deviations

• Processing noisy observations in the Bayesian setting

yt = f(xt) + zt, where zt ∼ N (0, σ2)I Given samples yt = [y1, ..., yt] at points x1:t, the posterior is also a GP:

µt+1(x) = kt(x)T(

Kt + σ2It)−1

yt

σ2t+1(x) = k(x,x)− kt(x)T

(Kt + σ2It

)−1kt(x),

where Kt =[k(x,x′)

]x,x′∈x1:t

and kt(x) =[k(xi,x)

]ti=1



0.0 0.2 0.4 0.6 0.8 1.0

4

2

0

2

4




µt+1(x) = kt(x)T(

Kt + σ2It)−1

yt


(Kt + σ2It

)−1kt(x),


]x,x′∈x1:t

and kt(x) =[k(xi,x)

]ti=1



0.0 0.2 0.4 0.6 0.8 1.0

4

2

0

2

4




µt+1(x) = kt(x)T(

Kt + σ2It)−1

yt


(Kt + σ2It

)−1kt(x),


]x,x′∈x1:t

and kt(x) =[k(xi,x)

]ti=1



0.0 0.2 0.4 0.6 0.8 1.0

4

2

0

2

4




µt+1(x) = kt(x)T(

Kt + σ2It)−1

yt


(Kt + σ2It

)−1kt(x),


]x,x′∈x1:t

and kt(x) =[k(xi,x)

]ti=1



0.0 0.2 0.4 0.6 0.8 1.0

4

2

0

2

4




µt+1(x) = kt(x)T(

Kt + σ2It)−1

yt


(Kt + σ2It

)−1kt(x),


]x,x′∈x1:t

and kt(x) =[k(xi,x)

]ti=1



0.0 0.2 0.4 0.6 0.8 1.0

4

2

0

2

4




µt+1(x) = kt(x)T(

Kt + σ2It)−1

yt


(Kt + σ2It

)−1kt(x),


]x,x′∈x1:t

and kt(x) =[k(xi,x)

]ti=1



0.0 0.2 0.4 0.6 0.8 1.0

4

2

0

2

4




µt+1(x) = kt(x)T(

Kt + σ2It)−1

yt


(Kt + σ2It

)−1kt(x),


]x,x′∈x1:t

and kt(x) =[k(xi,x)

]ti=1



0.0 0.2 0.4 0.6 0.8 1.0

4

2

0

2

4




µt+1(x) = kt(x)T(

Kt + σ2It)−1

yt


(Kt + σ2It

)−1kt(x),


]x,x′∈x1:t

and kt(x) =[k(xi,x)

]ti=1



0.0 0.2 0.4 0.6 0.8 1.0

4

2

0

2

4




µt+1(x) = kt(x)T(

Kt + σ2It)−1

yt


(Kt + σ2It

)−1kt(x),


]x,x′∈x1:t

and kt(x) =[k(xi,x)

]ti=1



0.0 0.2 0.4 0.6 0.8 1.0

4

2

0

2

4




µt+1(x) = kt(x)T(

Kt + σ2It)−1

yt


(Kt + σ2It

)−1kt(x),


]x,x′∈x1:t

and kt(x) =[k(xi,x)

]ti=1


Exercise

Exercise. Let D = [0, 1], so we are looking for x∗ = arg maxx∈[0,1] f(x).1. Draw an f which will be extremely difficult to optimize based on expensive

evaluations f(xt) (even if they are noiseless)2. Draw an f where such optimization is relatively easy


Challenge: Optimizing in the Dark

There are (infinitely) many functions consistent with a finite number of samples

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

-6

-4

-2

0

2

4

6

8




-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

-6

-4

-2

0

2

4

6

8




-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

-6

-4

-2

0

2

4

6

8




-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

-6

-4

-2

0

2

4

6

8


Relevance

black-box function optimization:


f(x)

• Applications in a vast number of domains...I hyperparameter tuning for learning algorithms [Snoek et al., 2012]I environmental monitoring and sensor networks [Srinivas et al., 2012]I recommender systems and advertising [Vanchinathan et al., 2014]I robotics [Lizotte et al., 2007]I molecule discovery [Gómez-Bombarelli et al., 2018]I ...and many more...


Example 1: Parameter Tuning

• A typical example is parameter tuningI RoboticsI Deep neural networksI . . .

Parameter 1Parameter 2

Performance

Parameter 1Parameter 2

Performance

Image source: (i) https://www.youtube.com/watch?v=x90UjisDPjM, (ii) https://www.ez-robot.com/,

(iii) https://wp.wwu.edu/machinelearning/2017/02/12/deep-neural-networks/


https://www.youtube.com/watch?v=x90UjisDPjM

https://www.ez-robot.com/

https://wp.wwu.edu/machinelearning/2017/02/12/deep-neural-networks/

Example 2: Molecule Discovery

• To give an idea of a “less standard” application of black-box optimization, here is arecent example from [Gómez-Bombarelli et al., 2018]

• Molecular design:I Represent molecule structures in some feature spaceI Explore that space to discover configurations with desirable properties

I e.g., elasticity, power conversion propertiesI Kernel encodes similarities between different configurations


Impact

• Impact of GP-based black-box optimization: Test-of-time award at ICML 2020

• Famous application: Hyperparameter tuning in AlphaGo


Towards Practical Solutions

A black-box function optimization problem


f(x)

• The above problem is hard in general

• Sequentially choose xt based on {(xt′ , yt′ )}t′<t


Towards Practical Solutions

A black-box function optimization problem


f(x)

• The above problem is hard in general

• Sequentially choose xt based on {(xt′ , yt′ )}t′<t

I need a meaningful “success” metric: Regret

I need additional assumptions: Smoothness


Simple and Cumulative Regret

• The true optimal point:

x? ∈ arg maxx∈D

f(x)

• Simple regret: After T rounds, report a point x and incur regret

r(x) = f(x?)− f(x)

• Cumulative regret: After T rounds, total regret incurred is

RT =T∑t=1

r(xt)


Exercise

Question. In the following applications, are we more likely to care about simpleregret1 or cumulative regret?2

1. f(x) = performance of deep neural network with hyperparameters x (e.g., x1 =dropout rate, x2 = learning rate, x3 = number of layers, etc.)

2. f(x) = number of times a set of users clicked on the advertisement x shown ontheir Facebook page

1Suboptimality of final reported point x2Sum of suboptimality across all selected points x1, . . . ,xT


Exploration vs. Exploitation

• Black-box optimization algorithms naturally trade off two competing goals:

I Exploration: Explore uncertain regions of space to learn more about themI Uncertain regions may contain very high values

I Exploitation: If we already know some “good” regions:I Sampling there will incur less cumulative regretI Even if we are interested in simple regret, we might want to sample nearby to attain

local improvements – this is important when we get close to the true maximum


Note:

To find the best x, we need to exploreuncertain regions a sufficient amount


Smoothness: Enter Gaussian Processes

Figure: Functions sampled from GP(0, kSE) and from GP(0, kMatérn)

• Model f by Gaussian process

f(·) ∼ GP(µ(·), k(·, ·)),

where µ(x) = E[f(x)] and k(x,x′) = Cov[f(x), f(x′)].

• Covariance specified by a kernel, e.g.,

I Squared exponential (SE): kSE(x,x′) = exp(−‖x− x′‖2

2`2

)I Matérn: kMat(x,x′) = 21−ν

Γ(ν)

( √2ν‖x−x′‖`

)Jν( √2ν‖x−x′‖

`

)


Smoothness: Enter Gaussian Processes

Figure: Functions sampled from GP(0, kSE) and from GP(0, kMatérn)

• Model f by Gaussian process

f(·) ∼ GP(µ(·), k(·, ·)),

where µ(x) = E[f(x)] and k(x,x′) = Cov[f(x), f(x′)].

• Covariance specified by a kernel, e.g.,

I Squared exponential (SE): kSE(x,x′) = exp(−‖x− x′‖2

2`2

)I Matérn: kMat(x,x′) = 21−ν

Γ(ν)

( √2ν‖x−x′‖`

)Jν( √2ν‖x−x′‖

`

)CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 19/ 46

Making use of Confidence Bounds

• A confidence region can be thought of as a region in which we believe the truefunction lies with high probabilityI Highest part of region: Upper confidence boundI Lowest part of region: Lower confidence bound

• Example from [de Freitas, 2013 GP lecture]I According to confidence bounds, the maximizer should be in the white region

• How to construct the confidence bounds: We will see shortly


Making use of Confidence Bounds

• A confidence region can be thought of as a region in which we believe the truefunction lies with high probabilityI Highest part of region: Upper confidence boundI Lowest part of region: Lower confidence bound

• Example from [de Freitas, 2013 GP lecture]I According to confidence bounds, the maximizer should be in the white region

• How to construct the confidence bounds: We will see shortly


Note:

Confidence bounds allow us to rule out largeregions of the input as being suboptimal


A Template for Sequentially Choosing xt Based on {(xt′ , yt′)}t′<t

1: for t = 1, 2, . . . , T do2: choose new xt by optimizing an acquisition function α(·)

xt ∈ arg maxx∈D

α(x;Dt−1)

where Dt−1 is the data collected up to time t− 13: query objective function f to obtain yt = f(xt) + zt4: augment data Dt = Dt−1 ∪ {(xt, yt)}5: update the GP model6: end for7: make final recommendation x


Upper Confidence Bound Algorithm

• Upper Confidence Bound (GP-UCB): [Srinivas et al., 2012]

xt = arg maxx∈D

µt−1(x) +√βtσt−1(x)

Gaussian Process Bandit Optimization

Gaussian Process Optimization Algorithm

P(f ) = GP(0, K ), µi(x ) = E[f (x )|yi ], �i(x ) = Var[f (x )|yi ]1/2

How to sample f?

x i = argmaxx2D

µi�1(x ) + �1/2i �i�1(x )

GP-UCB algorithm: Srinivas et.al., arXiv: 0912.3995

Model f by Gaussian processMaximize posterior upper confidencelevel, instead of mean or variance alone

−6 −4 −2 0 2 4 6−5

−4

−3

−2

−1

0

1

2

3

4

5

Seeger (MMCI) Gaussian Process Information Gain 23/3/10 15 / 23

• Intuition: Optimism in the face of uncertaintyI We are obviously interested in high-mean pointsI At the same time, high-variance points have more chance of being much higher

than the mean, so we should try to learn more about themI In other words, exploration vs. exploitation

I Increasing βt leads to more exploration


Upper Confidence Bound Algorithm

• Upper Confidence Bound (GP-UCB): [Srinivas et al., 2012]

xt = arg maxx∈D


Gaussian Process Bandit Optimization

Gaussian Process Optimization Algorithm

P(f ) = GP(0, K ), µi(x ) = E[f (x )|yi ], �i(x ) = Var[f (x )|yi ]1/2

How to sample f?

x i = argmaxx2D

µi�1(x ) + �1/2i �i�1(x )

GP-UCB algorithm: Srinivas et.al., arXiv: 0912.3995

Model f by Gaussian processMaximize posterior upper confidencelevel, instead of mean or variance alone

−6 −4 −2 0 2 4 6−5

−4

−3

−2

−1

0

1

2

3

4

5

Seeger (MMCI) Gaussian Process Information Gain 23/3/10 15 / 23• Intuition: Optimism in the face of uncertaintyI We are obviously interested in high-mean pointsI At the same time, high-variance points have more chance of being much higher

than the mean, so we should try to learn more about themI In other words, exploration vs. exploitation

I Increasing βt leads to more exploration


Note:

If we are optimistic in the face of uncertainty,we will naturally explore uncertain regions


Expected Improvement Algorithm

• Expected Improvement (EI): [Mockus et al., 1978]

xt = arg maxx∈D

Et−1[(f(x)− ξt−1)1{f(x) > ξt−1}

]whereI ξt−1 is the highest observation up to time t− 1I Et−1[·] the average conditioned on the previous observations (i.e., the current GP

posterior)

• Intuition:I We want to improve on the best value ξt found so farI If we sample x, the improvement over ξt is (f(x)− ξt)1{f(x) > ξt}I Try to optimize this on average

• Advantage: No parameter tuning (unlike GP-UCB) – but sometimes ξt can bechosen differently to above, especially in noisy settings. Then it becomes a parameter.


Expected Improvement Algorithm

• Expected Improvement (EI): [Mockus et al., 1978]

xt = arg maxx∈D

Et−1[(f(x)− ξt−1)1{f(x) > ξt−1}

]whereI ξt−1 is the highest observation up to time t− 1I Et−1[·] the average conditioned on the previous observations (i.e., the current GP

posterior)

• Intuition:I We want to improve on the best value ξt found so farI If we sample x, the improvement over ξt is (f(x)− ξt)1{f(x) > ξt}I Try to optimize this on average

• Advantage: No parameter tuning (unlike GP-UCB) – but sometimes ξt can bechosen differently to above, especially in noisy settings. Then it becomes a parameter.


Thompson Sampling Algorithm

• Thompson sampling (TS): [Thompson, 1933]

xt = arg maxx∈D

f(x)

where f is a random sample from the current posterior distribution

• Intuition:I Samples a point randomly according to the probability of it being optimalI This implicitly balances high mean and high variance, like the previous algorithms

• Advantage: Versatile (applies to general Bayesian settings beyond GPs), noparameter tuning as stated (but may arise via the choice of prior/kernel)


Thompson Sampling Algorithm

• Thompson sampling (TS): [Thompson, 1933]

xt = arg maxx∈D

f(x)

where f is a random sample from the current posterior distribution

• Intuition:I Samples a point randomly according to the probability of it being optimalI This implicitly balances high mean and high variance, like the previous algorithms

• Advantage: Versatile (applies to general Bayesian settings beyond GPs), noparameter tuning as stated (but may arise via the choice of prior/kernel)


Experimental Example 1

• Example plots of average regret from [Srinivas et al., 2012]:

xt = arg maxx∈D


I (Left) Function drawn from a GP (synthetic)I (Middle) Activating sensors to find the highest temperature location in a roomI (Right) Activating sensors to find the most congested region of a highway


Experimental Example 2

• Another example experiment from [Metzen, 2016]I Optimizing the parameters for a robot control task (ball throwing)

• Performance plots from [Metzen, 2016]:I We will see Entropy Search (ES) and Minimum Regret Search (MRS) next lecture


Note:

In practice, the most popular acquisitionfunctions (UCB, EI, Thompson, etc.) tend to

perform fairly similarly(at least in standard settings)

The choice of kernel (or class of kernels)impacts the performance more.


A Guarantee on the Cumulative Regret (I)

• The true optimum:

x? ∈ arg maxx∈D

f(x)

•Cumulative regret

RT =T∑t=1

(f(x?)− f(xt)

)


A Guarantee on the Cumulative Regret (II)

• Guarantee for GP-UCB [Srinivas et al., 2012]

RT ≤ O(√

TγT)

γT = maxX⊆D : |X|=T I(f ; yX).

where yX denotes the collection of noisy observation upon querying the points in X,and I(·; ·) denotes the mutual information (from information theory).

Interpretation: γT is a kernel-dependent term capturing the difficulty of optimizingfunctions drawn according to that kernelI In particular, the regret is usually sub-linear in TI Specifically, it is typically upper bounded by something between O(

√T ) and

o(T ), depending on the kernelI e.g., O(

√T (log T )2d) for SE kernel, O(T c) for some c ∈

(12 , 1)for Matérn

• Similar guarantees exist for Thompson sampling [Russo and Van Roy, 2014]


A Guarantee on the Cumulative Regret (II)

• Guarantee for GP-UCB [Srinivas et al., 2012]

RT ≤ O(√

TγT)

γT = maxX⊆D : |X|=T I(f ; yX).

where yX denotes the collection of noisy observation upon querying the points in X,and I(·; ·) denotes the mutual information (from information theory).

Interpretation: γT is a kernel-dependent term capturing the difficulty of optimizingfunctions drawn according to that kernelI In particular, the regret is usually sub-linear in TI Specifically, it is typically upper bounded by something between O(

√T ) and

o(T ), depending on the kernelI e.g., O(

√T (log T )2d) for SE kernel, O(T c) for some c ∈

(12 , 1)for Matérn

• Similar guarantees exist for Thompson sampling [Russo and Van Roy, 2014]


A Commonly-Heard Quote

• Commonly-heard quote:

In theory, there is no difference between theory and practice.In practice, there is.

• Corrected quote:

In theory, there is no difference between theory and practice. Inpractice, there is.

• Limitations here: (i) Large constant/logarithmic factors that matter in practice;(ii) Perfect knowledge of the kernel is assumed


A Commonly-Heard Quote

• Commonly-heard quote:

In theory, there is no difference between theory and practice.In practice, there is.

• Corrected quote:

In theory, there is no difference between theory and practice. Inpractice, there is.

• Limitations here: (i) Large constant/logarithmic factors that matter in practice;(ii) Perfect knowledge of the kernel is assumed


Proof Outline 1

• Consider the case that D is finite (continuous domains such as D = [0, 1]d can behandled with a bit of extra effort)

• Key claim 1: If βt = 2 log(|D|π2t2/(6δ)

), then with probability at least 1− δ:

|f(x)− µt−1(x)| ≤ β1/2t σt−1(x), ∀x ∈ D, t ≥ 1.

That is, µt−1(x)± β1/2t σt−1(x) gives valid confidence bounds on f(x).

I Intuition: Random variables are within a few standard deviations of the mean,with high probability

I Proof outline:I Recall each observation has N(0, σ2) noise.I By Gaussian tail bound, probability of violation for a single (x, t) is at most e−βt/2.I Apply the union bound over x ∈ D and all integers t; the choice of βt givese−βt/2 = 6δ

|D|π2t2, and we can apply

∑∞t=1

1t2

= π26 .


Proof Outline 2

• Key claim 2: Assuming valid confidence bounds, the instant regret at time tsatisfies rt ≤ 2β1/2

t σt−1(xt)I Instant regret: rt = f(x∗)− f(xt)I Proof:

rt = f(x∗)− f(xt)

≤ β1/2t σt−1(x∗) + µt−1(x∗)− f(xt) (by confidence bound)

≤ β1/2t σt−1(xt) + µt−1(xt)− f(xt) (by UCB rule)

≤ 2β1/2t σt−1(xt) (by confidence bound)


Proof Outline 3• Wrapping up:I Sum the established bound rt ≤ 2β1/2

t σt−1(xt) over t = 1, . . . , T to get

RT ≤ 2β1/2T

T∑t=1

σt−1(xt)

where we applied βt ≤ βTI Cauchy-Schwartz inequality (or Jensen’s inequality, or `1/`2-norm relation)

T∑t=1

σt−1(xt) ≤

√√√√T

T∑t=1

σ2t−1(xt)

I Some simple (but less obvious) manipulations give

T∑t=1

σ2t−1(xt) ≤ C1γT

for a suitable constant C1I Shown by writing the mutual information (see definition of γT ) in terms of the valuesσt−1, using a well-known formula for mutual information between Gaussians


Proof Outline 4

• Bounds on γT for SE and Matérn kernels:I Much more complicated; see [Srinivas et al., 2012]I Requires careful bounding of kernel eigenvalues (a natural generalization of

matrix eigenvalues to function spaces)


Advantages and Disadvantages of Bayesian Optimization

• There are also other methods for black-box function optimization, a prominentexample being evolutionary methods

• Advantages of BO:I Rich GP modeling frameworkI Very efficient in terms of #points queriedI Gives principled estimates of uncertainty

• Potential disadvantages of BO:I Can be computationally expensive (naively order n3)I Often only suited to low #inputs (e.g., d < 10)I Choosing a good kernel can be very difficultI Standard algorithms don’t immediately parallelize

There is plenty of research addressing all of these disadvantages!


Note:

• Bayesian optimization (with GPs) is greatfor optimizing low-dimensional smoothfunctions whose samples are expensive

• (and promising, but often a work inprogress, for other settings)


Useful Programming Packages

• Useful libraries:I Python packages (some with other methods beyonds GPs):

I GPy and GPyOptI SpearmintI BayesianOptimizationI PyBoI HyperOptI MOE

I Packages for other languages:I GPML for MATLABI GPFit and rBayesianOptimization for R


Ongoing Challenges I – Scaling Up the Number of Samples

• Computing the posterior after sampling n points (x1, . . . ,xn) typically requiresO(n3) computationI Problematic if n becomes moderate to large in sizeI An active research direction has been approximating the posterior efficiently

• Approach 1: Approximate by a GP with fewer points.I With k � n such points, O(k3) computation may be much more tolerableI Illustration from [Bauer et al., 2016]:

• Approach 2: Approximate the posterior directly using a deep neural networkI Can reduce the dependence from n3 to just n (linear time!)I See, e.g., [Huang et al., 2015]


Ongoing Challenges II – Scaling to High Dimensions

• Typical applications of BO consist of optimizing at most 10 or so variables, e.g.,

f(x) = f(x1, x2, x3, x4, x5).

• High-dimensional setting. Tens/hundreds/thousands/millions of variables, e.g.,

f(x) = f(x1, x2, . . . , x999, x1000).

• To circumvent curse of dimensionality, assume low-dimensional structure:I Example 1. Only a few variables impact f significantly:

f(x1, x2, . . . , x1000) ≈ g(x4, x32, x99, x256, x965)

(e.g., see REMBO [Wang et al., 2013])I Example 2. Additive structure:

f(x1, x2, . . . , x10) = g(x1, x2) + h(x3, x4, x5) + u(x6, x7, x8) + v(x9, x10)

(e.g., see Add-GP-UCB [Kandasamy et al., 2016])


Ongoing Challenges II – Scaling to High Dimensions (Cont.)• Let’s look more at the first example:

f(x1, x2, . . . , x1000) ≈ g(x4, x32, x99, x256, x965)

I Challenge: In advance, we don’t know exactly which (few) variables f depends on

• Embedding approach. Instead of optimizing the full function f(x) directly withx ∈ R1000, optimize f(Ax) with Ax ∈ R10 (for example).I Simplest choice: Generate A randomlyI More sophisticated: Try to learn a good choice of A from data

• Illustration from [Wang et al., 2013]:


Ongoing Challenges III – Robustness Considerations

• In practice, any GP modeling assumption is only an approximation

• Minor deviations from the model can be treated as noise

• Rare but large deviations from the model should be treated as outliers

• To avoid degradation, one approach is to identify and remove such pointsI Illustration from [Martinez-Cantin et al., 2017]:

• More recent variations on robustness:I Adversarial perturbations of the final point returned [Bogunovic et al., 2018]I Adversarial perturbations of the sampled points [Bogunovic et al., 2020]


Ongoing Challenges III – Robustness Considerations

• In practice, any GP modeling assumption is only an approximation

• Minor deviations from the model can be treated as noise

• Rare but large deviations from the model should be treated as outliers

• To avoid degradation, one approach is to identify and remove such pointsI Illustration from [Martinez-Cantin et al., 2017]:

• More recent variations on robustness:I Adversarial perturbations of the final point returned [Bogunovic et al., 2018]I Adversarial perturbations of the sampled points [Bogunovic et al., 2020]


References I

[1] Christopher M. Bishop.Pattern Recognition and Machine Learning (Information Science and Statistics).Springer-Verlag, 2006.

[2] Ilija Bogunovic, Andreas Krause, and Jonathan Scarlett.Corruption-tolerant Gaussian process bandit optimization.In Int. Conf. Art. Intel. Stats. (AISTATS), 2020.

[3] Eric Brochu, Vlad M. Cora, and Nando de Freitas.A tutorial on bayesian optimization of expensive cost functions, with application to active usermodeling and hierarchical reinforcement learning.http://arxiv.org/abs/1012.2599, 2010.

[4] Rafael Gómez-Bombarelli, Jennifer N Wei, David Duvenaud, José Miguel Hernández-Lobato,Benjamín Sánchez-Lengeling, Dennis Sheberla, Jorge Aguilera-Iparraguirre, Timothy D Hirzel,Ryan P Adams, and Alán Aspuru-Guzik.Automatic chemical design using a data-driven continuous representation of molecules.ACS Central Science, 4(2):268–276, 2018.

[5] Daniel J Lizotte, Tao Wang, Michael H Bowling, and Dale Schuurmans.Automatic gait optimization with Gaussian process regression.In IJCAI, volume 7, pages 944–949, 2007.


References II

[6] Ruben Martinez-Cantin, Kevin Tee, and Michael McCourt.Practical bayesian optimization in the presence of outliers.arXiv preprint arXiv:1712.04567, 2017.

[7] Jan Hendrik Metzen.Minimum regret search for single-and multi-task optimization.In Int. Conf. Mach. Learn. (ICML), 2016.

[8] J Moćkus, V Tiesis, and A Źilinskas.The application of Bayesian methods for seeking the extremum. vol. 2, 1978.

[9] Daniel Russo and Benjamin Van Roy.An information-theoretic analysis of Thompson sampling.http://arxiv.org/abs/1403.5341, 2014.

[10] Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P Adams, and Nando de Freitas.Taking the human out of the loop: A review of Bayesian optimization.Proc. IEEE, 104(1):148–175, 2016.

[11] Alex J Smola and Bernhard Schölkopf.Learning with kernels.GMD-Forschungszentrum Informationstechnik, 1998.


References III

[12] Jasper Snoek, Hugo Larochelle, and Ryan P Adams.Practical Bayesian optimization of machine learning algorithms.In Adv. Neur. Inf. Proc. Sys. 2012.

[13] N. Srinivas, A. Krause, S.M. Kakade, and M. Seeger.Information-theoretic regret bounds for Gaussian process optimization in the bandit setting.IEEE Trans. Inf. Theory, 58(5):3250–3265, May 2012.

[14] William R Thompson.On the likelihood that one unknown probability exceeds another in view of the evidence of twosamples.Biometrika, 25(3/4):285–294, 1933.

[15] Hastagiri P Vanchinathan, Isidor Nikolic, Fabio de Bona, and Andreas Krause.Explore-exploit in top-n recommender systems via Gaussian processes.In Proc. ACM Conf. Rec. Sys., pages 225–232, 2014.

[16] Ziyu Wang, Masrour Zoghi, Frank Hutter, David Matheson, and Nando de Freitas.Bayesian optimization in high dimensions via random embeddings.In Int. Joint. Conf. Art. Int., 2013.


Documents

Gaussian Process Methods in Machine Learning