69
Gaussian Process Methods in Machine Learning Jonathan Scarlett [email protected] Lecture 2: Optimization with Gaussian Processes CS6216, Semester 1, AY2021/22

Gaussian Process Methods in Machine Learning

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Gaussian Process Methods in Machine Learning

Gaussian Process Methods in Machine Learning

Jonathan [email protected]

Lecture 2: Optimization with Gaussian Processes

CS6216, Semester 1, AY2021/22

Page 2: Gaussian Process Methods in Machine Learning

Outline of Lectures

• Lecture 0: Bayesian Modeling and Regression

• Lecture 1: Gaussian Processes, Kernels, and Regression

• Lecture 2: Optimization with Gaussian Processes

• Lecture 3: Advanced Bayesian Optimization Methods

• Lecture 4: GP Methods in Non-Bayesian Settings

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 2/ 46

Page 3: Gaussian Process Methods in Machine Learning

Outline: This Lecture

I This lecture1. Black-box function optimization2. Gaussian processes3. Bayesian optimization algorithms4. Regret bounds5. Ongoing challenges in Bayesian optimization

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 3/ 46

Page 4: Gaussian Process Methods in Machine Learning

Optimization (I)

• So far, we have talked about learning the entire input-output relation, e.g., y ≈ f(x)

• A potentially easier problem is optimization, where we seek to find x maximizing f :

x? ∈ arg maxx∈D

f(x)

I It is potentially easier because we typically don’t need to learn f accurately forinputs x that are far from x∗

• In this context, instead of simply having access to a data set D = {(xt, yt)}nt=1, weare able to adaptively query xt based on the past samples y1, . . . , yt−1 (a form ofactive learning)

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 4/ 46

Page 5: Gaussian Process Methods in Machine Learning

Optimization (II)

• A more familiar optimization setting:I Function f is knownI Algorithm can use derivatives (e.g., gradient descent/ascent)I Main bottleneck is computation

• What we will consider:I Function f is unknownI Algorithm only has access to “black-box” function queriesI Main bottleneck is the high cost of such queries

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 5/ 46

Page 6: Gaussian Process Methods in Machine Learning

Optimization (II)

• A more familiar optimization setting:I Function f is knownI Algorithm can use derivatives (e.g., gradient descent/ascent)I Main bottleneck is computation

• What we will consider:I Function f is unknownI Algorithm only has access to “black-box” function queriesI Main bottleneck is the high cost of such queries

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 5/ 46

Page 7: Gaussian Process Methods in Machine Learning

Optimization (III)

Black-box function optimization:

x? ∈ arg maxx∈D⊆Rd

f(x)

• Setting:

I Unknown “reward” function fI Expensive evaluations of fI Noisy evaluationsI Choose xt based on {(xt′ , yt′ )}t′<t

yt = f(xt) + zt

zt ∼ N(0, σ2)

• Note: This problem with a GP approach is often called Bayesian optimization

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 6/ 46

Page 8: Gaussian Process Methods in Machine Learning

Optimization (III)

Black-box function optimization:

x? ∈ arg maxx∈D⊆Rd

f(x)

• Setting:

I Unknown “reward” function fI Expensive evaluations of fI Noisy evaluationsI Choose xt based on {(xt′ , yt′ )}t′<t

yt = f(xt) + zt

zt ∼ N(0, σ2)

• Note: This problem with a GP approach is often called Bayesian optimization

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 6/ 46

Page 9: Gaussian Process Methods in Machine Learning

The Bayesian Mechanics

0.0 0.2 0.4 0.6 0.8 1.0

4

2

0

2

4

Mean predictions + 3 standard deviations

• Processing noisy observations in the Bayesian setting

yt = f(xt) + zt, where zt ∼ N (0, σ2)I Given samples yt = [y1, ..., yt] at points x1:t, the posterior is also a GP:

µt+1(x) = kt(x)T(

Kt + σ2It)−1

yt

σ2t+1(x) = k(x,x)− kt(x)T

(Kt + σ2It

)−1kt(x),

where Kt =[k(x,x′)

]x,x′∈x1:t

and kt(x) =[k(xi,x)

]ti=1

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 7/ 46

Page 10: Gaussian Process Methods in Machine Learning

The Bayesian Mechanics

0.0 0.2 0.4 0.6 0.8 1.0

4

2

0

2

4

Mean predictions + 3 standard deviations

• Processing noisy observations in the Bayesian setting

yt = f(xt) + zt, where zt ∼ N (0, σ2)I Given samples yt = [y1, ..., yt] at points x1:t, the posterior is also a GP:

µt+1(x) = kt(x)T(

Kt + σ2It)−1

yt

σ2t+1(x) = k(x,x)− kt(x)T

(Kt + σ2It

)−1kt(x),

where Kt =[k(x,x′)

]x,x′∈x1:t

and kt(x) =[k(xi,x)

]ti=1

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 7/ 46

Page 11: Gaussian Process Methods in Machine Learning

The Bayesian Mechanics

0.0 0.2 0.4 0.6 0.8 1.0

4

2

0

2

4

Mean predictions + 3 standard deviations

• Processing noisy observations in the Bayesian setting

yt = f(xt) + zt, where zt ∼ N (0, σ2)I Given samples yt = [y1, ..., yt] at points x1:t, the posterior is also a GP:

µt+1(x) = kt(x)T(

Kt + σ2It)−1

yt

σ2t+1(x) = k(x,x)− kt(x)T

(Kt + σ2It

)−1kt(x),

where Kt =[k(x,x′)

]x,x′∈x1:t

and kt(x) =[k(xi,x)

]ti=1

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 7/ 46

Page 12: Gaussian Process Methods in Machine Learning

The Bayesian Mechanics

0.0 0.2 0.4 0.6 0.8 1.0

4

2

0

2

4

Mean predictions + 3 standard deviations

• Processing noisy observations in the Bayesian setting

yt = f(xt) + zt, where zt ∼ N (0, σ2)I Given samples yt = [y1, ..., yt] at points x1:t, the posterior is also a GP:

µt+1(x) = kt(x)T(

Kt + σ2It)−1

yt

σ2t+1(x) = k(x,x)− kt(x)T

(Kt + σ2It

)−1kt(x),

where Kt =[k(x,x′)

]x,x′∈x1:t

and kt(x) =[k(xi,x)

]ti=1

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 7/ 46

Page 13: Gaussian Process Methods in Machine Learning

The Bayesian Mechanics

0.0 0.2 0.4 0.6 0.8 1.0

4

2

0

2

4

Mean predictions + 3 standard deviations

• Processing noisy observations in the Bayesian setting

yt = f(xt) + zt, where zt ∼ N (0, σ2)I Given samples yt = [y1, ..., yt] at points x1:t, the posterior is also a GP:

µt+1(x) = kt(x)T(

Kt + σ2It)−1

yt

σ2t+1(x) = k(x,x)− kt(x)T

(Kt + σ2It

)−1kt(x),

where Kt =[k(x,x′)

]x,x′∈x1:t

and kt(x) =[k(xi,x)

]ti=1

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 7/ 46

Page 14: Gaussian Process Methods in Machine Learning

The Bayesian Mechanics

0.0 0.2 0.4 0.6 0.8 1.0

4

2

0

2

4

Mean predictions + 3 standard deviations

• Processing noisy observations in the Bayesian setting

yt = f(xt) + zt, where zt ∼ N (0, σ2)I Given samples yt = [y1, ..., yt] at points x1:t, the posterior is also a GP:

µt+1(x) = kt(x)T(

Kt + σ2It)−1

yt

σ2t+1(x) = k(x,x)− kt(x)T

(Kt + σ2It

)−1kt(x),

where Kt =[k(x,x′)

]x,x′∈x1:t

and kt(x) =[k(xi,x)

]ti=1

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 7/ 46

Page 15: Gaussian Process Methods in Machine Learning

The Bayesian Mechanics

0.0 0.2 0.4 0.6 0.8 1.0

4

2

0

2

4

Mean predictions + 3 standard deviations

• Processing noisy observations in the Bayesian setting

yt = f(xt) + zt, where zt ∼ N (0, σ2)I Given samples yt = [y1, ..., yt] at points x1:t, the posterior is also a GP:

µt+1(x) = kt(x)T(

Kt + σ2It)−1

yt

σ2t+1(x) = k(x,x)− kt(x)T

(Kt + σ2It

)−1kt(x),

where Kt =[k(x,x′)

]x,x′∈x1:t

and kt(x) =[k(xi,x)

]ti=1

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 7/ 46

Page 16: Gaussian Process Methods in Machine Learning

The Bayesian Mechanics

0.0 0.2 0.4 0.6 0.8 1.0

4

2

0

2

4

Mean predictions + 3 standard deviations

• Processing noisy observations in the Bayesian setting

yt = f(xt) + zt, where zt ∼ N (0, σ2)I Given samples yt = [y1, ..., yt] at points x1:t, the posterior is also a GP:

µt+1(x) = kt(x)T(

Kt + σ2It)−1

yt

σ2t+1(x) = k(x,x)− kt(x)T

(Kt + σ2It

)−1kt(x),

where Kt =[k(x,x′)

]x,x′∈x1:t

and kt(x) =[k(xi,x)

]ti=1

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 7/ 46

Page 17: Gaussian Process Methods in Machine Learning

The Bayesian Mechanics

0.0 0.2 0.4 0.6 0.8 1.0

4

2

0

2

4

Mean predictions + 3 standard deviations

• Processing noisy observations in the Bayesian setting

yt = f(xt) + zt, where zt ∼ N (0, σ2)I Given samples yt = [y1, ..., yt] at points x1:t, the posterior is also a GP:

µt+1(x) = kt(x)T(

Kt + σ2It)−1

yt

σ2t+1(x) = k(x,x)− kt(x)T

(Kt + σ2It

)−1kt(x),

where Kt =[k(x,x′)

]x,x′∈x1:t

and kt(x) =[k(xi,x)

]ti=1

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 7/ 46

Page 18: Gaussian Process Methods in Machine Learning

The Bayesian Mechanics

0.0 0.2 0.4 0.6 0.8 1.0

4

2

0

2

4

Mean predictions + 3 standard deviations

• Processing noisy observations in the Bayesian setting

yt = f(xt) + zt, where zt ∼ N (0, σ2)I Given samples yt = [y1, ..., yt] at points x1:t, the posterior is also a GP:

µt+1(x) = kt(x)T(

Kt + σ2It)−1

yt

σ2t+1(x) = k(x,x)− kt(x)T

(Kt + σ2It

)−1kt(x),

where Kt =[k(x,x′)

]x,x′∈x1:t

and kt(x) =[k(xi,x)

]ti=1

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 7/ 46

Page 19: Gaussian Process Methods in Machine Learning

Exercise

Exercise. Let D = [0, 1], so we are looking for x∗ = arg maxx∈[0,1] f(x).1. Draw an f which will be extremely difficult to optimize based on expensive

evaluations f(xt) (even if they are noiseless)2. Draw an f where such optimization is relatively easy

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 8/ 46

Page 20: Gaussian Process Methods in Machine Learning

Challenge: Optimizing in the Dark

There are (infinitely) many functions consistent with a finite number of samples

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

-6

-4

-2

0

2

4

6

8

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 9/ 46

Page 21: Gaussian Process Methods in Machine Learning

Challenge: Optimizing in the Dark

There are (infinitely) many functions consistent with a finite number of samples

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

-6

-4

-2

0

2

4

6

8

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 9/ 46

Page 22: Gaussian Process Methods in Machine Learning

Challenge: Optimizing in the Dark

There are (infinitely) many functions consistent with a finite number of samples

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

-6

-4

-2

0

2

4

6

8

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 9/ 46

Page 23: Gaussian Process Methods in Machine Learning

Challenge: Optimizing in the Dark

There are (infinitely) many functions consistent with a finite number of samples

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

-6

-4

-2

0

2

4

6

8

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 9/ 46

Page 24: Gaussian Process Methods in Machine Learning

Relevance

black-box function optimization:

x? ∈ arg maxx∈D⊆Rd

f(x)

• Applications in a vast number of domains...I hyperparameter tuning for learning algorithms [Snoek et al., 2012]I environmental monitoring and sensor networks [Srinivas et al., 2012]I recommender systems and advertising [Vanchinathan et al., 2014]I robotics [Lizotte et al., 2007]I molecule discovery [Gómez-Bombarelli et al., 2018]I ...and many more...

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 10/ 46

Page 25: Gaussian Process Methods in Machine Learning

Example 1: Parameter Tuning

• A typical example is parameter tuningI RoboticsI Deep neural networksI . . .

Parameter 1Parameter 2

Performance

Parameter 1Parameter 2

Performance

Image source: (i) https://www.youtube.com/watch?v=x90UjisDPjM, (ii) https://www.ez-robot.com/,

(iii) https://wp.wwu.edu/machinelearning/2017/02/12/deep-neural-networks/

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 11/ 46

Page 26: Gaussian Process Methods in Machine Learning

Example 2: Molecule Discovery

• To give an idea of a “less standard” application of black-box optimization, here is arecent example from [Gómez-Bombarelli et al., 2018]

• Molecular design:I Represent molecule structures in some feature spaceI Explore that space to discover configurations with desirable properties

I e.g., elasticity, power conversion propertiesI Kernel encodes similarities between different configurations

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 12/ 46

Page 27: Gaussian Process Methods in Machine Learning

Impact

• Impact of GP-based black-box optimization: Test-of-time award at ICML 2020

• Famous application: Hyperparameter tuning in AlphaGo

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 13/ 46

Page 28: Gaussian Process Methods in Machine Learning

Towards Practical Solutions

A black-box function optimization problem

x? ∈ arg maxx∈D⊆Rd

f(x)

• The above problem is hard in general

• Sequentially choose xt based on {(xt′ , yt′ )}t′<t

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 14/ 46

Page 29: Gaussian Process Methods in Machine Learning

Towards Practical Solutions

A black-box function optimization problem

x? ∈ arg maxx∈D⊆Rd

f(x)

• The above problem is hard in general

• Sequentially choose xt based on {(xt′ , yt′ )}t′<t

I need a meaningful “success” metric: Regret

I need additional assumptions: Smoothness

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 14/ 46

Page 30: Gaussian Process Methods in Machine Learning

Simple and Cumulative Regret

• The true optimal point:

x? ∈ arg maxx∈D

f(x)

• Simple regret: After T rounds, report a point x and incur regret

r(x) = f(x?)− f(x)

• Cumulative regret: After T rounds, total regret incurred is

RT =T∑t=1

r(xt)

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 15/ 46

Page 31: Gaussian Process Methods in Machine Learning

Exercise

Question. In the following applications, are we more likely to care about simpleregret1 or cumulative regret?2

1. f(x) = performance of deep neural network with hyperparameters x (e.g., x1 =dropout rate, x2 = learning rate, x3 = number of layers, etc.)

2. f(x) = number of times a set of users clicked on the advertisement x shown ontheir Facebook page

1Suboptimality of final reported point x2Sum of suboptimality across all selected points x1, . . . ,xT

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 16/ 46

Page 32: Gaussian Process Methods in Machine Learning

Exploration vs. Exploitation

• Black-box optimization algorithms naturally trade off two competing goals:

I Exploration: Explore uncertain regions of space to learn more about themI Uncertain regions may contain very high values

I Exploitation: If we already know some “good” regions:I Sampling there will incur less cumulative regretI Even if we are interested in simple regret, we might want to sample nearby to attain

local improvements – this is important when we get close to the true maximum

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 17/ 46

Page 33: Gaussian Process Methods in Machine Learning

Note:

To find the best x, we need to exploreuncertain regions a sufficient amount

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 18/ 46

Page 34: Gaussian Process Methods in Machine Learning

Smoothness: Enter Gaussian Processes

Figure: Functions sampled from GP(0, kSE) and from GP(0, kMatérn)

• Model f by Gaussian process

f(·) ∼ GP(µ(·), k(·, ·)),

where µ(x) = E[f(x)] and k(x,x′) = Cov[f(x), f(x′)].

• Covariance specified by a kernel, e.g.,

I Squared exponential (SE): kSE(x,x′) = exp(−‖x− x′‖2

2`2

)I Matérn: kMat(x,x′) = 21−ν

Γ(ν)

( √2ν‖x−x′‖`

)Jν( √2ν‖x−x′‖

`

)

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 19/ 46

Page 35: Gaussian Process Methods in Machine Learning

Smoothness: Enter Gaussian Processes

Figure: Functions sampled from GP(0, kSE) and from GP(0, kMatérn)

• Model f by Gaussian process

f(·) ∼ GP(µ(·), k(·, ·)),

where µ(x) = E[f(x)] and k(x,x′) = Cov[f(x), f(x′)].

• Covariance specified by a kernel, e.g.,

I Squared exponential (SE): kSE(x,x′) = exp(−‖x− x′‖2

2`2

)I Matérn: kMat(x,x′) = 21−ν

Γ(ν)

( √2ν‖x−x′‖`

)Jν( √2ν‖x−x′‖

`

)CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 19/ 46

Page 36: Gaussian Process Methods in Machine Learning

Making use of Confidence Bounds

• A confidence region can be thought of as a region in which we believe the truefunction lies with high probabilityI Highest part of region: Upper confidence boundI Lowest part of region: Lower confidence bound

• Example from [de Freitas, 2013 GP lecture]I According to confidence bounds, the maximizer should be in the white region

• How to construct the confidence bounds: We will see shortly

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 20/ 46

Page 37: Gaussian Process Methods in Machine Learning

Making use of Confidence Bounds

• A confidence region can be thought of as a region in which we believe the truefunction lies with high probabilityI Highest part of region: Upper confidence boundI Lowest part of region: Lower confidence bound

• Example from [de Freitas, 2013 GP lecture]I According to confidence bounds, the maximizer should be in the white region

• How to construct the confidence bounds: We will see shortly

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 20/ 46

Page 38: Gaussian Process Methods in Machine Learning

Note:

Confidence bounds allow us to rule out largeregions of the input as being suboptimal

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 21/ 46

Page 39: Gaussian Process Methods in Machine Learning

A Template for Sequentially Choosing xt Based on {(xt′ , yt′)}t′<t

1: for t = 1, 2, . . . , T do2: choose new xt by optimizing an acquisition function α(·)

xt ∈ arg maxx∈D

α(x;Dt−1)

where Dt−1 is the data collected up to time t− 13: query objective function f to obtain yt = f(xt) + zt4: augment data Dt = Dt−1 ∪ {(xt, yt)}5: update the GP model6: end for7: make final recommendation x

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 22/ 46

Page 40: Gaussian Process Methods in Machine Learning

Upper Confidence Bound Algorithm

• Upper Confidence Bound (GP-UCB): [Srinivas et al., 2012]

xt = arg maxx∈D

µt−1(x) +√βtσt−1(x)

Gaussian Process Bandit Optimization

Gaussian Process Optimization Algorithm

P(f ) = GP(0, K ), µi(x ) = E[f (x )|yi ], �i(x ) = Var[f (x )|yi ]1/2

How to sample f?

x i = argmaxx2D

µi�1(x ) + �1/2i �i�1(x )

GP-UCB algorithm: Srinivas et.al., arXiv: 0912.3995

Model f by Gaussian processMaximize posterior upper confidencelevel, instead of mean or variance alone

−6 −4 −2 0 2 4 6−5

−4

−3

−2

−1

0

1

2

3

4

5

Seeger (MMCI) Gaussian Process Information Gain 23/3/10 15 / 23

• Intuition: Optimism in the face of uncertaintyI We are obviously interested in high-mean pointsI At the same time, high-variance points have more chance of being much higher

than the mean, so we should try to learn more about themI In other words, exploration vs. exploitation

I Increasing βt leads to more exploration

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 23/ 46

Page 41: Gaussian Process Methods in Machine Learning

Upper Confidence Bound Algorithm

• Upper Confidence Bound (GP-UCB): [Srinivas et al., 2012]

xt = arg maxx∈D

µt−1(x) +√βtσt−1(x)

Gaussian Process Bandit Optimization

Gaussian Process Optimization Algorithm

P(f ) = GP(0, K ), µi(x ) = E[f (x )|yi ], �i(x ) = Var[f (x )|yi ]1/2

How to sample f?

x i = argmaxx2D

µi�1(x ) + �1/2i �i�1(x )

GP-UCB algorithm: Srinivas et.al., arXiv: 0912.3995

Model f by Gaussian processMaximize posterior upper confidencelevel, instead of mean or variance alone

−6 −4 −2 0 2 4 6−5

−4

−3

−2

−1

0

1

2

3

4

5

Seeger (MMCI) Gaussian Process Information Gain 23/3/10 15 / 23• Intuition: Optimism in the face of uncertaintyI We are obviously interested in high-mean pointsI At the same time, high-variance points have more chance of being much higher

than the mean, so we should try to learn more about themI In other words, exploration vs. exploitation

I Increasing βt leads to more exploration

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 23/ 46

Page 42: Gaussian Process Methods in Machine Learning

Note:

If we are optimistic in the face of uncertainty,we will naturally explore uncertain regions

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 24/ 46

Page 43: Gaussian Process Methods in Machine Learning

Expected Improvement Algorithm

• Expected Improvement (EI): [Mockus et al., 1978]

xt = arg maxx∈D

Et−1[(f(x)− ξt−1)1{f(x) > ξt−1}

]whereI ξt−1 is the highest observation up to time t− 1I Et−1[·] the average conditioned on the previous observations (i.e., the current GP

posterior)

• Intuition:I We want to improve on the best value ξt found so farI If we sample x, the improvement over ξt is (f(x)− ξt)1{f(x) > ξt}I Try to optimize this on average

• Advantage: No parameter tuning (unlike GP-UCB) – but sometimes ξt can bechosen differently to above, especially in noisy settings. Then it becomes a parameter.

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 25/ 46

Page 44: Gaussian Process Methods in Machine Learning

Expected Improvement Algorithm

• Expected Improvement (EI): [Mockus et al., 1978]

xt = arg maxx∈D

Et−1[(f(x)− ξt−1)1{f(x) > ξt−1}

]whereI ξt−1 is the highest observation up to time t− 1I Et−1[·] the average conditioned on the previous observations (i.e., the current GP

posterior)

• Intuition:I We want to improve on the best value ξt found so farI If we sample x, the improvement over ξt is (f(x)− ξt)1{f(x) > ξt}I Try to optimize this on average

• Advantage: No parameter tuning (unlike GP-UCB) – but sometimes ξt can bechosen differently to above, especially in noisy settings. Then it becomes a parameter.

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 25/ 46

Page 45: Gaussian Process Methods in Machine Learning

Thompson Sampling Algorithm

• Thompson sampling (TS): [Thompson, 1933]

xt = arg maxx∈D

f(x)

where f is a random sample from the current posterior distribution

• Intuition:I Samples a point randomly according to the probability of it being optimalI This implicitly balances high mean and high variance, like the previous algorithms

• Advantage: Versatile (applies to general Bayesian settings beyond GPs), noparameter tuning as stated (but may arise via the choice of prior/kernel)

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 26/ 46

Page 46: Gaussian Process Methods in Machine Learning

Thompson Sampling Algorithm

• Thompson sampling (TS): [Thompson, 1933]

xt = arg maxx∈D

f(x)

where f is a random sample from the current posterior distribution

• Intuition:I Samples a point randomly according to the probability of it being optimalI This implicitly balances high mean and high variance, like the previous algorithms

• Advantage: Versatile (applies to general Bayesian settings beyond GPs), noparameter tuning as stated (but may arise via the choice of prior/kernel)

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 26/ 46

Page 47: Gaussian Process Methods in Machine Learning

Experimental Example 1

• Example plots of average regret from [Srinivas et al., 2012]:

xt = arg maxx∈D

µt−1(x) +√βtσt−1(x)

I (Left) Function drawn from a GP (synthetic)I (Middle) Activating sensors to find the highest temperature location in a roomI (Right) Activating sensors to find the most congested region of a highway

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 27/ 46

Page 48: Gaussian Process Methods in Machine Learning

Experimental Example 2

• Another example experiment from [Metzen, 2016]I Optimizing the parameters for a robot control task (ball throwing)

• Performance plots from [Metzen, 2016]:I We will see Entropy Search (ES) and Minimum Regret Search (MRS) next lecture

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 28/ 46

Page 49: Gaussian Process Methods in Machine Learning

Note:

In practice, the most popular acquisitionfunctions (UCB, EI, Thompson, etc.) tend to

perform fairly similarly(at least in standard settings)

The choice of kernel (or class of kernels)impacts the performance more.

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 29/ 46

Page 50: Gaussian Process Methods in Machine Learning

A Guarantee on the Cumulative Regret (I)

• The true optimum:

x? ∈ arg maxx∈D

f(x)

•Cumulative regret

RT =T∑t=1

(f(x?)− f(xt)

)

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 30/ 46

Page 51: Gaussian Process Methods in Machine Learning

A Guarantee on the Cumulative Regret (II)

• Guarantee for GP-UCB [Srinivas et al., 2012]

RT ≤ O(√

TγT)

γT = maxX⊆D : |X|=T I(f ; yX).

where yX denotes the collection of noisy observation upon querying the points in X,and I(·; ·) denotes the mutual information (from information theory).

Interpretation: γT is a kernel-dependent term capturing the difficulty of optimizingfunctions drawn according to that kernelI In particular, the regret is usually sub-linear in TI Specifically, it is typically upper bounded by something between O(

√T ) and

o(T ), depending on the kernelI e.g., O(

√T (log T )2d) for SE kernel, O(T c) for some c ∈

(12 , 1)for Matérn

• Similar guarantees exist for Thompson sampling [Russo and Van Roy, 2014]

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 31/ 46

Page 52: Gaussian Process Methods in Machine Learning

A Guarantee on the Cumulative Regret (II)

• Guarantee for GP-UCB [Srinivas et al., 2012]

RT ≤ O(√

TγT)

γT = maxX⊆D : |X|=T I(f ; yX).

where yX denotes the collection of noisy observation upon querying the points in X,and I(·; ·) denotes the mutual information (from information theory).

Interpretation: γT is a kernel-dependent term capturing the difficulty of optimizingfunctions drawn according to that kernelI In particular, the regret is usually sub-linear in TI Specifically, it is typically upper bounded by something between O(

√T ) and

o(T ), depending on the kernelI e.g., O(

√T (log T )2d) for SE kernel, O(T c) for some c ∈

(12 , 1)for Matérn

• Similar guarantees exist for Thompson sampling [Russo and Van Roy, 2014]

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 31/ 46

Page 53: Gaussian Process Methods in Machine Learning

A Commonly-Heard Quote

• Commonly-heard quote:

In theory, there is no difference between theory and practice.In practice, there is.

• Corrected quote:

In theory, there is no difference between theory and practice. Inpractice, there is.

• Limitations here: (i) Large constant/logarithmic factors that matter in practice;(ii) Perfect knowledge of the kernel is assumed

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 32/ 46

Page 54: Gaussian Process Methods in Machine Learning

A Commonly-Heard Quote

• Commonly-heard quote:

In theory, there is no difference between theory and practice.In practice, there is.

• Corrected quote:

In theory, there is no difference between theory and practice. Inpractice, there is.

• Limitations here: (i) Large constant/logarithmic factors that matter in practice;(ii) Perfect knowledge of the kernel is assumed

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 32/ 46

Page 55: Gaussian Process Methods in Machine Learning

Proof Outline 1

• Consider the case that D is finite (continuous domains such as D = [0, 1]d can behandled with a bit of extra effort)

• Key claim 1: If βt = 2 log(|D|π2t2/(6δ)

), then with probability at least 1− δ:

|f(x)− µt−1(x)| ≤ β1/2t σt−1(x), ∀x ∈ D, t ≥ 1.

That is, µt−1(x)± β1/2t σt−1(x) gives valid confidence bounds on f(x).

I Intuition: Random variables are within a few standard deviations of the mean,with high probability

I Proof outline:I Recall each observation has N(0, σ2) noise.I By Gaussian tail bound, probability of violation for a single (x, t) is at most e−βt/2.I Apply the union bound over x ∈ D and all integers t; the choice of βt givese−βt/2 = 6δ

|D|π2t2, and we can apply

∑∞t=1

1t2

= π26 .

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 33/ 46

Page 56: Gaussian Process Methods in Machine Learning

Proof Outline 2

• Key claim 2: Assuming valid confidence bounds, the instant regret at time tsatisfies rt ≤ 2β1/2

t σt−1(xt)I Instant regret: rt = f(x∗)− f(xt)I Proof:

rt = f(x∗)− f(xt)

≤ β1/2t σt−1(x∗) + µt−1(x∗)− f(xt) (by confidence bound)

≤ β1/2t σt−1(xt) + µt−1(xt)− f(xt) (by UCB rule)

≤ 2β1/2t σt−1(xt) (by confidence bound)

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 34/ 46

Page 57: Gaussian Process Methods in Machine Learning

Proof Outline 3• Wrapping up:I Sum the established bound rt ≤ 2β1/2

t σt−1(xt) over t = 1, . . . , T to get

RT ≤ 2β1/2T

T∑t=1

σt−1(xt)

where we applied βt ≤ βTI Cauchy-Schwartz inequality (or Jensen’s inequality, or `1/`2-norm relation)

T∑t=1

σt−1(xt) ≤

√√√√T

T∑t=1

σ2t−1(xt)

I Some simple (but less obvious) manipulations give

T∑t=1

σ2t−1(xt) ≤ C1γT

for a suitable constant C1I Shown by writing the mutual information (see definition of γT ) in terms of the valuesσt−1, using a well-known formula for mutual information between Gaussians

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 35/ 46

Page 58: Gaussian Process Methods in Machine Learning

Proof Outline 4

• Bounds on γT for SE and Matérn kernels:I Much more complicated; see [Srinivas et al., 2012]I Requires careful bounding of kernel eigenvalues (a natural generalization of

matrix eigenvalues to function spaces)

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 36/ 46

Page 59: Gaussian Process Methods in Machine Learning

Advantages and Disadvantages of Bayesian Optimization

• There are also other methods for black-box function optimization, a prominentexample being evolutionary methods

• Advantages of BO:I Rich GP modeling frameworkI Very efficient in terms of #points queriedI Gives principled estimates of uncertainty

• Potential disadvantages of BO:I Can be computationally expensive (naively order n3)I Often only suited to low #inputs (e.g., d < 10)I Choosing a good kernel can be very difficultI Standard algorithms don’t immediately parallelize

There is plenty of research addressing all of these disadvantages!

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 37/ 46

Page 60: Gaussian Process Methods in Machine Learning

Note:

• Bayesian optimization (with GPs) is greatfor optimizing low-dimensional smoothfunctions whose samples are expensive

• (and promising, but often a work inprogress, for other settings)

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 38/ 46

Page 61: Gaussian Process Methods in Machine Learning

Useful Programming Packages

• Useful libraries:I Python packages (some with other methods beyonds GPs):

I GPy and GPyOptI SpearmintI BayesianOptimizationI PyBoI HyperOptI MOE

I Packages for other languages:I GPML for MATLABI GPFit and rBayesianOptimization for R

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 39/ 46

Page 62: Gaussian Process Methods in Machine Learning

Ongoing Challenges I – Scaling Up the Number of Samples

• Computing the posterior after sampling n points (x1, . . . ,xn) typically requiresO(n3) computationI Problematic if n becomes moderate to large in sizeI An active research direction has been approximating the posterior efficiently

• Approach 1: Approximate by a GP with fewer points.I With k � n such points, O(k3) computation may be much more tolerableI Illustration from [Bauer et al., 2016]:

• Approach 2: Approximate the posterior directly using a deep neural networkI Can reduce the dependence from n3 to just n (linear time!)I See, e.g., [Huang et al., 2015]

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 40/ 46

Page 63: Gaussian Process Methods in Machine Learning

Ongoing Challenges II – Scaling to High Dimensions

• Typical applications of BO consist of optimizing at most 10 or so variables, e.g.,

f(x) = f(x1, x2, x3, x4, x5).

• High-dimensional setting. Tens/hundreds/thousands/millions of variables, e.g.,

f(x) = f(x1, x2, . . . , x999, x1000).

• To circumvent curse of dimensionality, assume low-dimensional structure:I Example 1. Only a few variables impact f significantly:

f(x1, x2, . . . , x1000) ≈ g(x4, x32, x99, x256, x965)

(e.g., see REMBO [Wang et al., 2013])I Example 2. Additive structure:

f(x1, x2, . . . , x10) = g(x1, x2) + h(x3, x4, x5) + u(x6, x7, x8) + v(x9, x10)

(e.g., see Add-GP-UCB [Kandasamy et al., 2016])

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 41/ 46

Page 64: Gaussian Process Methods in Machine Learning

Ongoing Challenges II – Scaling to High Dimensions (Cont.)• Let’s look more at the first example:

f(x1, x2, . . . , x1000) ≈ g(x4, x32, x99, x256, x965)

I Challenge: In advance, we don’t know exactly which (few) variables f depends on

• Embedding approach. Instead of optimizing the full function f(x) directly withx ∈ R1000, optimize f(Ax) with Ax ∈ R10 (for example).I Simplest choice: Generate A randomlyI More sophisticated: Try to learn a good choice of A from data

• Illustration from [Wang et al., 2013]:

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 42/ 46

Page 65: Gaussian Process Methods in Machine Learning

Ongoing Challenges III – Robustness Considerations

• In practice, any GP modeling assumption is only an approximation

• Minor deviations from the model can be treated as noise

• Rare but large deviations from the model should be treated as outliers

• To avoid degradation, one approach is to identify and remove such pointsI Illustration from [Martinez-Cantin et al., 2017]:

• More recent variations on robustness:I Adversarial perturbations of the final point returned [Bogunovic et al., 2018]I Adversarial perturbations of the sampled points [Bogunovic et al., 2020]

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 43/ 46

Page 66: Gaussian Process Methods in Machine Learning

Ongoing Challenges III – Robustness Considerations

• In practice, any GP modeling assumption is only an approximation

• Minor deviations from the model can be treated as noise

• Rare but large deviations from the model should be treated as outliers

• To avoid degradation, one approach is to identify and remove such pointsI Illustration from [Martinez-Cantin et al., 2017]:

• More recent variations on robustness:I Adversarial perturbations of the final point returned [Bogunovic et al., 2018]I Adversarial perturbations of the sampled points [Bogunovic et al., 2020]

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 43/ 46

Page 67: Gaussian Process Methods in Machine Learning

References I

[1] Christopher M. Bishop.Pattern Recognition and Machine Learning (Information Science and Statistics).Springer-Verlag, 2006.

[2] Ilija Bogunovic, Andreas Krause, and Jonathan Scarlett.Corruption-tolerant Gaussian process bandit optimization.In Int. Conf. Art. Intel. Stats. (AISTATS), 2020.

[3] Eric Brochu, Vlad M. Cora, and Nando de Freitas.A tutorial on bayesian optimization of expensive cost functions, with application to active usermodeling and hierarchical reinforcement learning.http://arxiv.org/abs/1012.2599, 2010.

[4] Rafael Gómez-Bombarelli, Jennifer N Wei, David Duvenaud, José Miguel Hernández-Lobato,Benjamín Sánchez-Lengeling, Dennis Sheberla, Jorge Aguilera-Iparraguirre, Timothy D Hirzel,Ryan P Adams, and Alán Aspuru-Guzik.Automatic chemical design using a data-driven continuous representation of molecules.ACS Central Science, 4(2):268–276, 2018.

[5] Daniel J Lizotte, Tao Wang, Michael H Bowling, and Dale Schuurmans.Automatic gait optimization with Gaussian process regression.In IJCAI, volume 7, pages 944–949, 2007.

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 44/ 46

Page 68: Gaussian Process Methods in Machine Learning

References II

[6] Ruben Martinez-Cantin, Kevin Tee, and Michael McCourt.Practical bayesian optimization in the presence of outliers.arXiv preprint arXiv:1712.04567, 2017.

[7] Jan Hendrik Metzen.Minimum regret search for single-and multi-task optimization.In Int. Conf. Mach. Learn. (ICML), 2016.

[8] J Moćkus, V Tiesis, and A Źilinskas.The application of Bayesian methods for seeking the extremum. vol. 2, 1978.

[9] Daniel Russo and Benjamin Van Roy.An information-theoretic analysis of Thompson sampling.http://arxiv.org/abs/1403.5341, 2014.

[10] Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P Adams, and Nando de Freitas.Taking the human out of the loop: A review of Bayesian optimization.Proc. IEEE, 104(1):148–175, 2016.

[11] Alex J Smola and Bernhard Schölkopf.Learning with kernels.GMD-Forschungszentrum Informationstechnik, 1998.

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 45/ 46

Page 69: Gaussian Process Methods in Machine Learning

References III

[12] Jasper Snoek, Hugo Larochelle, and Ryan P Adams.Practical Bayesian optimization of machine learning algorithms.In Adv. Neur. Inf. Proc. Sys. 2012.

[13] N. Srinivas, A. Krause, S.M. Kakade, and M. Seeger.Information-theoretic regret bounds for Gaussian process optimization in the bandit setting.IEEE Trans. Inf. Theory, 58(5):3250–3265, May 2012.

[14] William R Thompson.On the likelihood that one unknown probability exceeds another in view of the evidence of twosamples.Biometrika, 25(3/4):285–294, 1933.

[15] Hastagiri P Vanchinathan, Isidor Nikolic, Fabio de Bona, and Andreas Krause.Explore-exploit in top-n recommender systems via Gaussian processes.In Proc. ACM Conf. Rec. Sys., pages 225–232, 2014.

[16] Ziyu Wang, Masrour Zoghi, Frank Hutter, David Matheson, and Nando de Freitas.Bayesian optimization in high dimensions via random embeddings.In Int. Joint. Conf. Art. Int., 2013.

CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 46/ 46