Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
Gaussian Process Methods in Machine Learning
Jonathan [email protected]
Lecture 2: Optimization with Gaussian Processes
CS6216, Semester 1, AY2021/22
Outline of Lectures
• Lecture 0: Bayesian Modeling and Regression
• Lecture 1: Gaussian Processes, Kernels, and Regression
• Lecture 2: Optimization with Gaussian Processes
• Lecture 3: Advanced Bayesian Optimization Methods
• Lecture 4: GP Methods in Non-Bayesian Settings
CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 2/ 46
Outline: This Lecture
I This lecture1. Black-box function optimization2. Gaussian processes3. Bayesian optimization algorithms4. Regret bounds5. Ongoing challenges in Bayesian optimization
CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 3/ 46
Optimization (I)
• So far, we have talked about learning the entire input-output relation, e.g., y ≈ f(x)
• A potentially easier problem is optimization, where we seek to find x maximizing f :
x? ∈ arg maxx∈D
f(x)
I It is potentially easier because we typically don’t need to learn f accurately forinputs x that are far from x∗
• In this context, instead of simply having access to a data set D = {(xt, yt)}nt=1, weare able to adaptively query xt based on the past samples y1, . . . , yt−1 (a form ofactive learning)
CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 4/ 46
Optimization (II)
• A more familiar optimization setting:I Function f is knownI Algorithm can use derivatives (e.g., gradient descent/ascent)I Main bottleneck is computation
• What we will consider:I Function f is unknownI Algorithm only has access to “black-box” function queriesI Main bottleneck is the high cost of such queries
CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 5/ 46
Optimization (II)
• A more familiar optimization setting:I Function f is knownI Algorithm can use derivatives (e.g., gradient descent/ascent)I Main bottleneck is computation
• What we will consider:I Function f is unknownI Algorithm only has access to “black-box” function queriesI Main bottleneck is the high cost of such queries
CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 5/ 46
Optimization (III)
Black-box function optimization:
x? ∈ arg maxx∈D⊆Rd
f(x)
• Setting:
I Unknown “reward” function fI Expensive evaluations of fI Noisy evaluationsI Choose xt based on {(xt′ , yt′ )}t′<t
yt = f(xt) + zt
zt ∼ N(0, σ2)
• Note: This problem with a GP approach is often called Bayesian optimization
CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 6/ 46
Optimization (III)
Black-box function optimization:
x? ∈ arg maxx∈D⊆Rd
f(x)
• Setting:
I Unknown “reward” function fI Expensive evaluations of fI Noisy evaluationsI Choose xt based on {(xt′ , yt′ )}t′<t
yt = f(xt) + zt
zt ∼ N(0, σ2)
• Note: This problem with a GP approach is often called Bayesian optimization
CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 6/ 46
The Bayesian Mechanics
0.0 0.2 0.4 0.6 0.8 1.0
4
2
0
2
4
Mean predictions + 3 standard deviations
• Processing noisy observations in the Bayesian setting
yt = f(xt) + zt, where zt ∼ N (0, σ2)I Given samples yt = [y1, ..., yt] at points x1:t, the posterior is also a GP:
µt+1(x) = kt(x)T(
Kt + σ2It)−1
yt
σ2t+1(x) = k(x,x)− kt(x)T
(Kt + σ2It
)−1kt(x),
where Kt =[k(x,x′)
]x,x′∈x1:t
and kt(x) =[k(xi,x)
]ti=1
CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 7/ 46
The Bayesian Mechanics
0.0 0.2 0.4 0.6 0.8 1.0
4
2
0
2
4
Mean predictions + 3 standard deviations
• Processing noisy observations in the Bayesian setting
yt = f(xt) + zt, where zt ∼ N (0, σ2)I Given samples yt = [y1, ..., yt] at points x1:t, the posterior is also a GP:
µt+1(x) = kt(x)T(
Kt + σ2It)−1
yt
σ2t+1(x) = k(x,x)− kt(x)T
(Kt + σ2It
)−1kt(x),
where Kt =[k(x,x′)
]x,x′∈x1:t
and kt(x) =[k(xi,x)
]ti=1
CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 7/ 46
The Bayesian Mechanics
0.0 0.2 0.4 0.6 0.8 1.0
4
2
0
2
4
Mean predictions + 3 standard deviations
• Processing noisy observations in the Bayesian setting
yt = f(xt) + zt, where zt ∼ N (0, σ2)I Given samples yt = [y1, ..., yt] at points x1:t, the posterior is also a GP:
µt+1(x) = kt(x)T(
Kt + σ2It)−1
yt
σ2t+1(x) = k(x,x)− kt(x)T
(Kt + σ2It
)−1kt(x),
where Kt =[k(x,x′)
]x,x′∈x1:t
and kt(x) =[k(xi,x)
]ti=1
CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 7/ 46
The Bayesian Mechanics
0.0 0.2 0.4 0.6 0.8 1.0
4
2
0
2
4
Mean predictions + 3 standard deviations
• Processing noisy observations in the Bayesian setting
yt = f(xt) + zt, where zt ∼ N (0, σ2)I Given samples yt = [y1, ..., yt] at points x1:t, the posterior is also a GP:
µt+1(x) = kt(x)T(
Kt + σ2It)−1
yt
σ2t+1(x) = k(x,x)− kt(x)T
(Kt + σ2It
)−1kt(x),
where Kt =[k(x,x′)
]x,x′∈x1:t
and kt(x) =[k(xi,x)
]ti=1
CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 7/ 46
The Bayesian Mechanics
0.0 0.2 0.4 0.6 0.8 1.0
4
2
0
2
4
Mean predictions + 3 standard deviations
• Processing noisy observations in the Bayesian setting
yt = f(xt) + zt, where zt ∼ N (0, σ2)I Given samples yt = [y1, ..., yt] at points x1:t, the posterior is also a GP:
µt+1(x) = kt(x)T(
Kt + σ2It)−1
yt
σ2t+1(x) = k(x,x)− kt(x)T
(Kt + σ2It
)−1kt(x),
where Kt =[k(x,x′)
]x,x′∈x1:t
and kt(x) =[k(xi,x)
]ti=1
CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 7/ 46
The Bayesian Mechanics
0.0 0.2 0.4 0.6 0.8 1.0
4
2
0
2
4
Mean predictions + 3 standard deviations
• Processing noisy observations in the Bayesian setting
yt = f(xt) + zt, where zt ∼ N (0, σ2)I Given samples yt = [y1, ..., yt] at points x1:t, the posterior is also a GP:
µt+1(x) = kt(x)T(
Kt + σ2It)−1
yt
σ2t+1(x) = k(x,x)− kt(x)T
(Kt + σ2It
)−1kt(x),
where Kt =[k(x,x′)
]x,x′∈x1:t
and kt(x) =[k(xi,x)
]ti=1
CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 7/ 46
The Bayesian Mechanics
0.0 0.2 0.4 0.6 0.8 1.0
4
2
0
2
4
Mean predictions + 3 standard deviations
• Processing noisy observations in the Bayesian setting
yt = f(xt) + zt, where zt ∼ N (0, σ2)I Given samples yt = [y1, ..., yt] at points x1:t, the posterior is also a GP:
µt+1(x) = kt(x)T(
Kt + σ2It)−1
yt
σ2t+1(x) = k(x,x)− kt(x)T
(Kt + σ2It
)−1kt(x),
where Kt =[k(x,x′)
]x,x′∈x1:t
and kt(x) =[k(xi,x)
]ti=1
CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 7/ 46
The Bayesian Mechanics
0.0 0.2 0.4 0.6 0.8 1.0
4
2
0
2
4
Mean predictions + 3 standard deviations
• Processing noisy observations in the Bayesian setting
yt = f(xt) + zt, where zt ∼ N (0, σ2)I Given samples yt = [y1, ..., yt] at points x1:t, the posterior is also a GP:
µt+1(x) = kt(x)T(
Kt + σ2It)−1
yt
σ2t+1(x) = k(x,x)− kt(x)T
(Kt + σ2It
)−1kt(x),
where Kt =[k(x,x′)
]x,x′∈x1:t
and kt(x) =[k(xi,x)
]ti=1
CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 7/ 46
The Bayesian Mechanics
0.0 0.2 0.4 0.6 0.8 1.0
4
2
0
2
4
Mean predictions + 3 standard deviations
• Processing noisy observations in the Bayesian setting
yt = f(xt) + zt, where zt ∼ N (0, σ2)I Given samples yt = [y1, ..., yt] at points x1:t, the posterior is also a GP:
µt+1(x) = kt(x)T(
Kt + σ2It)−1
yt
σ2t+1(x) = k(x,x)− kt(x)T
(Kt + σ2It
)−1kt(x),
where Kt =[k(x,x′)
]x,x′∈x1:t
and kt(x) =[k(xi,x)
]ti=1
CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 7/ 46
The Bayesian Mechanics
0.0 0.2 0.4 0.6 0.8 1.0
4
2
0
2
4
Mean predictions + 3 standard deviations
• Processing noisy observations in the Bayesian setting
yt = f(xt) + zt, where zt ∼ N (0, σ2)I Given samples yt = [y1, ..., yt] at points x1:t, the posterior is also a GP:
µt+1(x) = kt(x)T(
Kt + σ2It)−1
yt
σ2t+1(x) = k(x,x)− kt(x)T
(Kt + σ2It
)−1kt(x),
where Kt =[k(x,x′)
]x,x′∈x1:t
and kt(x) =[k(xi,x)
]ti=1
CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 7/ 46
Exercise
Exercise. Let D = [0, 1], so we are looking for x∗ = arg maxx∈[0,1] f(x).1. Draw an f which will be extremely difficult to optimize based on expensive
evaluations f(xt) (even if they are noiseless)2. Draw an f where such optimization is relatively easy
CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 8/ 46
Challenge: Optimizing in the Dark
There are (infinitely) many functions consistent with a finite number of samples
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
-6
-4
-2
0
2
4
6
8
CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 9/ 46
Challenge: Optimizing in the Dark
There are (infinitely) many functions consistent with a finite number of samples
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
-6
-4
-2
0
2
4
6
8
CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 9/ 46
Challenge: Optimizing in the Dark
There are (infinitely) many functions consistent with a finite number of samples
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
-6
-4
-2
0
2
4
6
8
CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 9/ 46
Challenge: Optimizing in the Dark
There are (infinitely) many functions consistent with a finite number of samples
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
-6
-4
-2
0
2
4
6
8
CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 9/ 46
Relevance
black-box function optimization:
x? ∈ arg maxx∈D⊆Rd
f(x)
• Applications in a vast number of domains...I hyperparameter tuning for learning algorithms [Snoek et al., 2012]I environmental monitoring and sensor networks [Srinivas et al., 2012]I recommender systems and advertising [Vanchinathan et al., 2014]I robotics [Lizotte et al., 2007]I molecule discovery [Gómez-Bombarelli et al., 2018]I ...and many more...
CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 10/ 46
Example 1: Parameter Tuning
• A typical example is parameter tuningI RoboticsI Deep neural networksI . . .
Parameter 1Parameter 2
Performance
Parameter 1Parameter 2
Performance
Image source: (i) https://www.youtube.com/watch?v=x90UjisDPjM, (ii) https://www.ez-robot.com/,
(iii) https://wp.wwu.edu/machinelearning/2017/02/12/deep-neural-networks/
CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 11/ 46
Example 2: Molecule Discovery
• To give an idea of a “less standard” application of black-box optimization, here is arecent example from [Gómez-Bombarelli et al., 2018]
• Molecular design:I Represent molecule structures in some feature spaceI Explore that space to discover configurations with desirable properties
I e.g., elasticity, power conversion propertiesI Kernel encodes similarities between different configurations
CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 12/ 46
Impact
• Impact of GP-based black-box optimization: Test-of-time award at ICML 2020
• Famous application: Hyperparameter tuning in AlphaGo
CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 13/ 46
Towards Practical Solutions
A black-box function optimization problem
x? ∈ arg maxx∈D⊆Rd
f(x)
• The above problem is hard in general
• Sequentially choose xt based on {(xt′ , yt′ )}t′<t
CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 14/ 46
Towards Practical Solutions
A black-box function optimization problem
x? ∈ arg maxx∈D⊆Rd
f(x)
• The above problem is hard in general
• Sequentially choose xt based on {(xt′ , yt′ )}t′<t
I need a meaningful “success” metric: Regret
I need additional assumptions: Smoothness
CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 14/ 46
Simple and Cumulative Regret
• The true optimal point:
x? ∈ arg maxx∈D
f(x)
• Simple regret: After T rounds, report a point x and incur regret
r(x) = f(x?)− f(x)
• Cumulative regret: After T rounds, total regret incurred is
RT =T∑t=1
r(xt)
CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 15/ 46
Exercise
Question. In the following applications, are we more likely to care about simpleregret1 or cumulative regret?2
1. f(x) = performance of deep neural network with hyperparameters x (e.g., x1 =dropout rate, x2 = learning rate, x3 = number of layers, etc.)
2. f(x) = number of times a set of users clicked on the advertisement x shown ontheir Facebook page
1Suboptimality of final reported point x2Sum of suboptimality across all selected points x1, . . . ,xT
CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 16/ 46
Exploration vs. Exploitation
• Black-box optimization algorithms naturally trade off two competing goals:
I Exploration: Explore uncertain regions of space to learn more about themI Uncertain regions may contain very high values
I Exploitation: If we already know some “good” regions:I Sampling there will incur less cumulative regretI Even if we are interested in simple regret, we might want to sample nearby to attain
local improvements – this is important when we get close to the true maximum
CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 17/ 46
Note:
To find the best x, we need to exploreuncertain regions a sufficient amount
CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 18/ 46
Smoothness: Enter Gaussian Processes
Figure: Functions sampled from GP(0, kSE) and from GP(0, kMatérn)
• Model f by Gaussian process
f(·) ∼ GP(µ(·), k(·, ·)),
where µ(x) = E[f(x)] and k(x,x′) = Cov[f(x), f(x′)].
• Covariance specified by a kernel, e.g.,
I Squared exponential (SE): kSE(x,x′) = exp(−‖x− x′‖2
2`2
)I Matérn: kMat(x,x′) = 21−ν
Γ(ν)
( √2ν‖x−x′‖`
)Jν( √2ν‖x−x′‖
`
)
CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 19/ 46
Smoothness: Enter Gaussian Processes
Figure: Functions sampled from GP(0, kSE) and from GP(0, kMatérn)
• Model f by Gaussian process
f(·) ∼ GP(µ(·), k(·, ·)),
where µ(x) = E[f(x)] and k(x,x′) = Cov[f(x), f(x′)].
• Covariance specified by a kernel, e.g.,
I Squared exponential (SE): kSE(x,x′) = exp(−‖x− x′‖2
2`2
)I Matérn: kMat(x,x′) = 21−ν
Γ(ν)
( √2ν‖x−x′‖`
)Jν( √2ν‖x−x′‖
`
)CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 19/ 46
Making use of Confidence Bounds
• A confidence region can be thought of as a region in which we believe the truefunction lies with high probabilityI Highest part of region: Upper confidence boundI Lowest part of region: Lower confidence bound
• Example from [de Freitas, 2013 GP lecture]I According to confidence bounds, the maximizer should be in the white region
• How to construct the confidence bounds: We will see shortly
CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 20/ 46
Making use of Confidence Bounds
• A confidence region can be thought of as a region in which we believe the truefunction lies with high probabilityI Highest part of region: Upper confidence boundI Lowest part of region: Lower confidence bound
• Example from [de Freitas, 2013 GP lecture]I According to confidence bounds, the maximizer should be in the white region
• How to construct the confidence bounds: We will see shortly
CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 20/ 46
Note:
Confidence bounds allow us to rule out largeregions of the input as being suboptimal
CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 21/ 46
A Template for Sequentially Choosing xt Based on {(xt′ , yt′)}t′<t
1: for t = 1, 2, . . . , T do2: choose new xt by optimizing an acquisition function α(·)
xt ∈ arg maxx∈D
α(x;Dt−1)
where Dt−1 is the data collected up to time t− 13: query objective function f to obtain yt = f(xt) + zt4: augment data Dt = Dt−1 ∪ {(xt, yt)}5: update the GP model6: end for7: make final recommendation x
CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 22/ 46
Upper Confidence Bound Algorithm
• Upper Confidence Bound (GP-UCB): [Srinivas et al., 2012]
xt = arg maxx∈D
µt−1(x) +√βtσt−1(x)
Gaussian Process Bandit Optimization
Gaussian Process Optimization Algorithm
P(f ) = GP(0, K ), µi(x ) = E[f (x )|yi ], �i(x ) = Var[f (x )|yi ]1/2
How to sample f?
x i = argmaxx2D
µi�1(x ) + �1/2i �i�1(x )
GP-UCB algorithm: Srinivas et.al., arXiv: 0912.3995
Model f by Gaussian processMaximize posterior upper confidencelevel, instead of mean or variance alone
−6 −4 −2 0 2 4 6−5
−4
−3
−2
−1
0
1
2
3
4
5
Seeger (MMCI) Gaussian Process Information Gain 23/3/10 15 / 23
• Intuition: Optimism in the face of uncertaintyI We are obviously interested in high-mean pointsI At the same time, high-variance points have more chance of being much higher
than the mean, so we should try to learn more about themI In other words, exploration vs. exploitation
I Increasing βt leads to more exploration
CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 23/ 46
Upper Confidence Bound Algorithm
• Upper Confidence Bound (GP-UCB): [Srinivas et al., 2012]
xt = arg maxx∈D
µt−1(x) +√βtσt−1(x)
Gaussian Process Bandit Optimization
Gaussian Process Optimization Algorithm
P(f ) = GP(0, K ), µi(x ) = E[f (x )|yi ], �i(x ) = Var[f (x )|yi ]1/2
How to sample f?
x i = argmaxx2D
µi�1(x ) + �1/2i �i�1(x )
GP-UCB algorithm: Srinivas et.al., arXiv: 0912.3995
Model f by Gaussian processMaximize posterior upper confidencelevel, instead of mean or variance alone
−6 −4 −2 0 2 4 6−5
−4
−3
−2
−1
0
1
2
3
4
5
Seeger (MMCI) Gaussian Process Information Gain 23/3/10 15 / 23• Intuition: Optimism in the face of uncertaintyI We are obviously interested in high-mean pointsI At the same time, high-variance points have more chance of being much higher
than the mean, so we should try to learn more about themI In other words, exploration vs. exploitation
I Increasing βt leads to more exploration
CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 23/ 46
Note:
If we are optimistic in the face of uncertainty,we will naturally explore uncertain regions
CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 24/ 46
Expected Improvement Algorithm
• Expected Improvement (EI): [Mockus et al., 1978]
xt = arg maxx∈D
Et−1[(f(x)− ξt−1)1{f(x) > ξt−1}
]whereI ξt−1 is the highest observation up to time t− 1I Et−1[·] the average conditioned on the previous observations (i.e., the current GP
posterior)
• Intuition:I We want to improve on the best value ξt found so farI If we sample x, the improvement over ξt is (f(x)− ξt)1{f(x) > ξt}I Try to optimize this on average
• Advantage: No parameter tuning (unlike GP-UCB) – but sometimes ξt can bechosen differently to above, especially in noisy settings. Then it becomes a parameter.
CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 25/ 46
Expected Improvement Algorithm
• Expected Improvement (EI): [Mockus et al., 1978]
xt = arg maxx∈D
Et−1[(f(x)− ξt−1)1{f(x) > ξt−1}
]whereI ξt−1 is the highest observation up to time t− 1I Et−1[·] the average conditioned on the previous observations (i.e., the current GP
posterior)
• Intuition:I We want to improve on the best value ξt found so farI If we sample x, the improvement over ξt is (f(x)− ξt)1{f(x) > ξt}I Try to optimize this on average
• Advantage: No parameter tuning (unlike GP-UCB) – but sometimes ξt can bechosen differently to above, especially in noisy settings. Then it becomes a parameter.
CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 25/ 46
Thompson Sampling Algorithm
• Thompson sampling (TS): [Thompson, 1933]
xt = arg maxx∈D
f(x)
where f is a random sample from the current posterior distribution
• Intuition:I Samples a point randomly according to the probability of it being optimalI This implicitly balances high mean and high variance, like the previous algorithms
• Advantage: Versatile (applies to general Bayesian settings beyond GPs), noparameter tuning as stated (but may arise via the choice of prior/kernel)
CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 26/ 46
Thompson Sampling Algorithm
• Thompson sampling (TS): [Thompson, 1933]
xt = arg maxx∈D
f(x)
where f is a random sample from the current posterior distribution
• Intuition:I Samples a point randomly according to the probability of it being optimalI This implicitly balances high mean and high variance, like the previous algorithms
• Advantage: Versatile (applies to general Bayesian settings beyond GPs), noparameter tuning as stated (but may arise via the choice of prior/kernel)
CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 26/ 46
Experimental Example 1
• Example plots of average regret from [Srinivas et al., 2012]:
xt = arg maxx∈D
µt−1(x) +√βtσt−1(x)
I (Left) Function drawn from a GP (synthetic)I (Middle) Activating sensors to find the highest temperature location in a roomI (Right) Activating sensors to find the most congested region of a highway
CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 27/ 46
Experimental Example 2
• Another example experiment from [Metzen, 2016]I Optimizing the parameters for a robot control task (ball throwing)
• Performance plots from [Metzen, 2016]:I We will see Entropy Search (ES) and Minimum Regret Search (MRS) next lecture
CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 28/ 46
Note:
In practice, the most popular acquisitionfunctions (UCB, EI, Thompson, etc.) tend to
perform fairly similarly(at least in standard settings)
The choice of kernel (or class of kernels)impacts the performance more.
CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 29/ 46
A Guarantee on the Cumulative Regret (I)
• The true optimum:
x? ∈ arg maxx∈D
f(x)
•Cumulative regret
RT =T∑t=1
(f(x?)− f(xt)
)
CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 30/ 46
A Guarantee on the Cumulative Regret (II)
• Guarantee for GP-UCB [Srinivas et al., 2012]
RT ≤ O(√
TγT)
γT = maxX⊆D : |X|=T I(f ; yX).
where yX denotes the collection of noisy observation upon querying the points in X,and I(·; ·) denotes the mutual information (from information theory).
Interpretation: γT is a kernel-dependent term capturing the difficulty of optimizingfunctions drawn according to that kernelI In particular, the regret is usually sub-linear in TI Specifically, it is typically upper bounded by something between O(
√T ) and
o(T ), depending on the kernelI e.g., O(
√T (log T )2d) for SE kernel, O(T c) for some c ∈
(12 , 1)for Matérn
• Similar guarantees exist for Thompson sampling [Russo and Van Roy, 2014]
CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 31/ 46
A Guarantee on the Cumulative Regret (II)
• Guarantee for GP-UCB [Srinivas et al., 2012]
RT ≤ O(√
TγT)
γT = maxX⊆D : |X|=T I(f ; yX).
where yX denotes the collection of noisy observation upon querying the points in X,and I(·; ·) denotes the mutual information (from information theory).
Interpretation: γT is a kernel-dependent term capturing the difficulty of optimizingfunctions drawn according to that kernelI In particular, the regret is usually sub-linear in TI Specifically, it is typically upper bounded by something between O(
√T ) and
o(T ), depending on the kernelI e.g., O(
√T (log T )2d) for SE kernel, O(T c) for some c ∈
(12 , 1)for Matérn
• Similar guarantees exist for Thompson sampling [Russo and Van Roy, 2014]
CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 31/ 46
A Commonly-Heard Quote
• Commonly-heard quote:
In theory, there is no difference between theory and practice.In practice, there is.
• Corrected quote:
In theory, there is no difference between theory and practice. Inpractice, there is.
• Limitations here: (i) Large constant/logarithmic factors that matter in practice;(ii) Perfect knowledge of the kernel is assumed
CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 32/ 46
A Commonly-Heard Quote
• Commonly-heard quote:
In theory, there is no difference between theory and practice.In practice, there is.
• Corrected quote:
In theory, there is no difference between theory and practice. Inpractice, there is.
• Limitations here: (i) Large constant/logarithmic factors that matter in practice;(ii) Perfect knowledge of the kernel is assumed
CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 32/ 46
Proof Outline 1
• Consider the case that D is finite (continuous domains such as D = [0, 1]d can behandled with a bit of extra effort)
• Key claim 1: If βt = 2 log(|D|π2t2/(6δ)
), then with probability at least 1− δ:
|f(x)− µt−1(x)| ≤ β1/2t σt−1(x), ∀x ∈ D, t ≥ 1.
That is, µt−1(x)± β1/2t σt−1(x) gives valid confidence bounds on f(x).
I Intuition: Random variables are within a few standard deviations of the mean,with high probability
I Proof outline:I Recall each observation has N(0, σ2) noise.I By Gaussian tail bound, probability of violation for a single (x, t) is at most e−βt/2.I Apply the union bound over x ∈ D and all integers t; the choice of βt givese−βt/2 = 6δ
|D|π2t2, and we can apply
∑∞t=1
1t2
= π26 .
CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 33/ 46
Proof Outline 2
• Key claim 2: Assuming valid confidence bounds, the instant regret at time tsatisfies rt ≤ 2β1/2
t σt−1(xt)I Instant regret: rt = f(x∗)− f(xt)I Proof:
rt = f(x∗)− f(xt)
≤ β1/2t σt−1(x∗) + µt−1(x∗)− f(xt) (by confidence bound)
≤ β1/2t σt−1(xt) + µt−1(xt)− f(xt) (by UCB rule)
≤ 2β1/2t σt−1(xt) (by confidence bound)
CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 34/ 46
Proof Outline 3• Wrapping up:I Sum the established bound rt ≤ 2β1/2
t σt−1(xt) over t = 1, . . . , T to get
RT ≤ 2β1/2T
T∑t=1
σt−1(xt)
where we applied βt ≤ βTI Cauchy-Schwartz inequality (or Jensen’s inequality, or `1/`2-norm relation)
T∑t=1
σt−1(xt) ≤
√√√√T
T∑t=1
σ2t−1(xt)
I Some simple (but less obvious) manipulations give
T∑t=1
σ2t−1(xt) ≤ C1γT
for a suitable constant C1I Shown by writing the mutual information (see definition of γT ) in terms of the valuesσt−1, using a well-known formula for mutual information between Gaussians
CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 35/ 46
Proof Outline 4
• Bounds on γT for SE and Matérn kernels:I Much more complicated; see [Srinivas et al., 2012]I Requires careful bounding of kernel eigenvalues (a natural generalization of
matrix eigenvalues to function spaces)
CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 36/ 46
Advantages and Disadvantages of Bayesian Optimization
• There are also other methods for black-box function optimization, a prominentexample being evolutionary methods
• Advantages of BO:I Rich GP modeling frameworkI Very efficient in terms of #points queriedI Gives principled estimates of uncertainty
• Potential disadvantages of BO:I Can be computationally expensive (naively order n3)I Often only suited to low #inputs (e.g., d < 10)I Choosing a good kernel can be very difficultI Standard algorithms don’t immediately parallelize
There is plenty of research addressing all of these disadvantages!
CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 37/ 46
Note:
• Bayesian optimization (with GPs) is greatfor optimizing low-dimensional smoothfunctions whose samples are expensive
• (and promising, but often a work inprogress, for other settings)
CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 38/ 46
Useful Programming Packages
• Useful libraries:I Python packages (some with other methods beyonds GPs):
I GPy and GPyOptI SpearmintI BayesianOptimizationI PyBoI HyperOptI MOE
I Packages for other languages:I GPML for MATLABI GPFit and rBayesianOptimization for R
CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 39/ 46
Ongoing Challenges I – Scaling Up the Number of Samples
• Computing the posterior after sampling n points (x1, . . . ,xn) typically requiresO(n3) computationI Problematic if n becomes moderate to large in sizeI An active research direction has been approximating the posterior efficiently
• Approach 1: Approximate by a GP with fewer points.I With k � n such points, O(k3) computation may be much more tolerableI Illustration from [Bauer et al., 2016]:
• Approach 2: Approximate the posterior directly using a deep neural networkI Can reduce the dependence from n3 to just n (linear time!)I See, e.g., [Huang et al., 2015]
CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 40/ 46
Ongoing Challenges II – Scaling to High Dimensions
• Typical applications of BO consist of optimizing at most 10 or so variables, e.g.,
f(x) = f(x1, x2, x3, x4, x5).
• High-dimensional setting. Tens/hundreds/thousands/millions of variables, e.g.,
f(x) = f(x1, x2, . . . , x999, x1000).
• To circumvent curse of dimensionality, assume low-dimensional structure:I Example 1. Only a few variables impact f significantly:
f(x1, x2, . . . , x1000) ≈ g(x4, x32, x99, x256, x965)
(e.g., see REMBO [Wang et al., 2013])I Example 2. Additive structure:
f(x1, x2, . . . , x10) = g(x1, x2) + h(x3, x4, x5) + u(x6, x7, x8) + v(x9, x10)
(e.g., see Add-GP-UCB [Kandasamy et al., 2016])
CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 41/ 46
Ongoing Challenges II – Scaling to High Dimensions (Cont.)• Let’s look more at the first example:
f(x1, x2, . . . , x1000) ≈ g(x4, x32, x99, x256, x965)
I Challenge: In advance, we don’t know exactly which (few) variables f depends on
• Embedding approach. Instead of optimizing the full function f(x) directly withx ∈ R1000, optimize f(Ax) with Ax ∈ R10 (for example).I Simplest choice: Generate A randomlyI More sophisticated: Try to learn a good choice of A from data
• Illustration from [Wang et al., 2013]:
CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 42/ 46
Ongoing Challenges III – Robustness Considerations
• In practice, any GP modeling assumption is only an approximation
• Minor deviations from the model can be treated as noise
• Rare but large deviations from the model should be treated as outliers
• To avoid degradation, one approach is to identify and remove such pointsI Illustration from [Martinez-Cantin et al., 2017]:
• More recent variations on robustness:I Adversarial perturbations of the final point returned [Bogunovic et al., 2018]I Adversarial perturbations of the sampled points [Bogunovic et al., 2020]
CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 43/ 46
Ongoing Challenges III – Robustness Considerations
• In practice, any GP modeling assumption is only an approximation
• Minor deviations from the model can be treated as noise
• Rare but large deviations from the model should be treated as outliers
• To avoid degradation, one approach is to identify and remove such pointsI Illustration from [Martinez-Cantin et al., 2017]:
• More recent variations on robustness:I Adversarial perturbations of the final point returned [Bogunovic et al., 2018]I Adversarial perturbations of the sampled points [Bogunovic et al., 2020]
CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 43/ 46
References I
[1] Christopher M. Bishop.Pattern Recognition and Machine Learning (Information Science and Statistics).Springer-Verlag, 2006.
[2] Ilija Bogunovic, Andreas Krause, and Jonathan Scarlett.Corruption-tolerant Gaussian process bandit optimization.In Int. Conf. Art. Intel. Stats. (AISTATS), 2020.
[3] Eric Brochu, Vlad M. Cora, and Nando de Freitas.A tutorial on bayesian optimization of expensive cost functions, with application to active usermodeling and hierarchical reinforcement learning.http://arxiv.org/abs/1012.2599, 2010.
[4] Rafael Gómez-Bombarelli, Jennifer N Wei, David Duvenaud, José Miguel Hernández-Lobato,Benjamín Sánchez-Lengeling, Dennis Sheberla, Jorge Aguilera-Iparraguirre, Timothy D Hirzel,Ryan P Adams, and Alán Aspuru-Guzik.Automatic chemical design using a data-driven continuous representation of molecules.ACS Central Science, 4(2):268–276, 2018.
[5] Daniel J Lizotte, Tao Wang, Michael H Bowling, and Dale Schuurmans.Automatic gait optimization with Gaussian process regression.In IJCAI, volume 7, pages 944–949, 2007.
CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 44/ 46
References II
[6] Ruben Martinez-Cantin, Kevin Tee, and Michael McCourt.Practical bayesian optimization in the presence of outliers.arXiv preprint arXiv:1712.04567, 2017.
[7] Jan Hendrik Metzen.Minimum regret search for single-and multi-task optimization.In Int. Conf. Mach. Learn. (ICML), 2016.
[8] J Moćkus, V Tiesis, and A Źilinskas.The application of Bayesian methods for seeking the extremum. vol. 2, 1978.
[9] Daniel Russo and Benjamin Van Roy.An information-theoretic analysis of Thompson sampling.http://arxiv.org/abs/1403.5341, 2014.
[10] Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P Adams, and Nando de Freitas.Taking the human out of the loop: A review of Bayesian optimization.Proc. IEEE, 104(1):148–175, 2016.
[11] Alex J Smola and Bernhard Schölkopf.Learning with kernels.GMD-Forschungszentrum Informationstechnik, 1998.
CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 45/ 46
References III
[12] Jasper Snoek, Hugo Larochelle, and Ryan P Adams.Practical Bayesian optimization of machine learning algorithms.In Adv. Neur. Inf. Proc. Sys. 2012.
[13] N. Srinivas, A. Krause, S.M. Kakade, and M. Seeger.Information-theoretic regret bounds for Gaussian process optimization in the bandit setting.IEEE Trans. Inf. Theory, 58(5):3250–3265, May 2012.
[14] William R Thompson.On the likelihood that one unknown probability exceeds another in view of the evidence of twosamples.Biometrika, 25(3/4):285–294, 1933.
[15] Hastagiri P Vanchinathan, Isidor Nikolic, Fabio de Bona, and Andreas Krause.Explore-exploit in top-n recommender systems via Gaussian processes.In Proc. ACM Conf. Rec. Sys., pages 225–232, 2014.
[16] Ziyu Wang, Masrour Zoghi, Frank Hutter, David Matheson, and Nando de Freitas.Bayesian optimization in high dimensions via random embeddings.In Int. Joint. Conf. Art. Int., 2013.
CS6216 Advanced Topics in Machine Learning | Jonathan Scarlett ([email protected]) Slide 46/ 46