70
A tutorial on Bayesian optimization Zi Wang @ Google Brain

Bayesian optimization A tutorial on

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Bayesian optimization A tutorial on

A tutorial on Bayesian optimizationZi Wang @ Google Brain

Page 2: Bayesian optimization A tutorial on

Blackbox Function Optimization

Goal:

[Calandra et al., 2015] Zi Wang - BayesOpt / 2

Page 3: Bayesian optimization A tutorial on

[Snoek et al., 2012; Gonzalez et al., 2015; Hernández-Lobato et al., 2017]

SyntheticGene

Design

Blackbox Function Optimization

Goal:

Challenges:

• f is expensive to evaluate • f is multi-peak• no gradient information• evaluations can be noisy

Zi Wang - BayesOpt / 3

Page 4: Bayesian optimization A tutorial on

[Koch, 2016]

Blackbox Function Optimization

Goal:

Grid search?

Challenges:

• f is expensive to evaluate • f is multi-peak• no gradient information• evaluations can be noisy

Many evaluations are wasted!

Random search?

Zi Wang - BayesOpt / 4

Page 5: Bayesian optimization A tutorial on

Bayesian Optimization

Idea: build a probabilistic model of the function f

LOOP• choose new query point(s) to evaluate

• update model decision criterion: acquisition function

Zi Wang - BayesOpt / 5

Page 6: Bayesian optimization A tutorial on

Gaussian Processes (GPs)

• probability distribution over functions

• any finite set of function values is a multi-variate Gaussian

Input

Out

put

Zi Wang - BayesOpt / 6

Page 7: Bayesian optimization A tutorial on

Gaussian Processes (GPs)

• kernel function ; mean function

• function ; observe noisy output at

• probability distribution over functions

• any finite set of function values is a multi-variate Gaussian

Zi Wang - BayesOpt / 7

Page 8: Bayesian optimization A tutorial on

Given observations , predict posterior mean and variance in closed form via conditional Gaussian

Samples from the posterior

Gaussian Processes (GPs)Samples from the prior

Zi Wang - BayesOpt / 8

Page 9: Bayesian optimization A tutorial on

Bayesian Optimization

Idea: build a probabilistic model of the function f

LOOP• choose new query point(s) to evaluate

• update model decision criterion: acquisition function

Zi Wang - BayesOpt / 9

Page 10: Bayesian optimization A tutorial on

How to design acquisition functions?

Acquisition functions in BayesOpt

Zi Wang - BayesOpt / 10

Page 11: Bayesian optimization A tutorial on

Zi Wang - BayesOpt / 11

Examples of acquisition functions in BayesOpt

Upper confidence bounds, expected improvement, probability of improvement, entropy search methods…

[Auer, 2002; Srinivas et al., 2010; Kushner, 1964; Mockus, 1974; Hennig & Schuler, 2012; Hernandez-Lobato et al., 2014; Wang&Jegelka, 2017; Hoffman&Zoubin, 2015…]

Page 12: Bayesian optimization A tutorial on

[Auer, 2002; Srinivas et al., 2010]

Zi Wang - BayesOpt / 12

GP-UCB: an example of acquisition functions

Page 13: Bayesian optimization A tutorial on

Zi Wang - BayesOpt / 13

Page 14: Bayesian optimization A tutorial on

Zi Wang - BayesOpt / 14

GP-UCB: an example of acquisition functions

Page 15: Bayesian optimization A tutorial on

Zi Wang - BayesOpt / 15

GP-UCB: an example of acquisition functions

Page 16: Bayesian optimization A tutorial on

Zi Wang - BayesOpt / 16

GP-UCB: an example of acquisition functions

Page 17: Bayesian optimization A tutorial on

Zi Wang - BayesOpt / 17

GP-UCB: an example of acquisition functions

Page 18: Bayesian optimization A tutorial on

Zi Wang - BayesOpt / 18

GP-UCB: an example of acquisition functions

Page 19: Bayesian optimization A tutorial on

Zi Wang - BayesOpt / 19

GP-UCB: an example of acquisition functions

Page 20: Bayesian optimization A tutorial on

Zi Wang - BayesOpt / 20

GP-UCB: an example of acquisition functions

Page 21: Bayesian optimization A tutorial on

Zi Wang - BayesOpt / 21

GP-UCB: an example of acquisition functions

Page 22: Bayesian optimization A tutorial on

Zi Wang - BayesOpt / 22

GP-UCB: an example of acquisition functions

Page 23: Bayesian optimization A tutorial on

Zi Wang - BayesOpt / 23

GP-UCB: an example of acquisition functions

Page 24: Bayesian optimization A tutorial on

Zi Wang - BayesOpt / 24

GP-UCB: an example of acquisition functions

Page 25: Bayesian optimization A tutorial on

Zi Wang - BayesOpt / 25

GP-UCB: an example of acquisition functions

Page 26: Bayesian optimization A tutorial on

Zi Wang - BayesOpt / 26

GP-UCB: an example of acquisition functions

Page 27: Bayesian optimization A tutorial on

Example of acquisition functions: PI[Kushner, 1964]

Probability of Improvement (PI)

• Observations: • The best observation is • for each x, predict the posterior mean and variance

Zi Wang - BayesOpt / 27

Page 28: Bayesian optimization A tutorial on

Example of acquisition functions: EI[Kushner, 1964]

Expected Improvement (EI)

• Observations: • The best observation is • for each x, predict the posterior mean and variance

Zi Wang - BayesOpt / 28

Page 29: Bayesian optimization A tutorial on

Entropy Search and Predictive Entropy Search

[Hennig & Schuler, 2012; Hernandez-Lobato et al., 2014]

Observed Data

Point to query

Location of global optimum

Zi Wang - BayesOpt / 29

Page 30: Bayesian optimization A tutorial on

Max-value Entropy Search

Observed Data

Point to query

Gaussian Truncated Gaussian

Something Closed-form

Location of global optimum

[Wang&Jegelka, 2017; Hoffman&Zoubin, 2015]

D-dimensional input space

1-dimensional Output space

Zi Wang - BayesOpt / 30

Global max-value

Page 31: Bayesian optimization A tutorial on

Sample with a Gumbel Distribution

Intuition: each f(x) is a Gaussian

• Sample representative points

• Approximate the max-value of the representative points by a Gumbel distribution [Wang&Jegelka, 2017]

Zi Wang - BayesOpt / 31

Page 32: Bayesian optimization A tutorial on

PES 1

PES 10PES 100

MES 1MES 10

MES 100

[Wang&Jegelka, 2017] Zi Wang - BayesOpt / 32

MES gets faster and better empirical results than PES

Page 33: Bayesian optimization A tutorial on

MES gets faster and better empirical results than PES

[Wang&Jegelka, 2017] Zi Wang - BayesOpt / 33

Page 34: Bayesian optimization A tutorial on

Understanding the acquisition function in MES

So,

is equivalent to

[Wang&Jegelka, 2017]Zi Wang - BayesOpt / 34

Page 35: Bayesian optimization A tutorial on

Relations among GP-UCB, PI and MES

MES

[Jones, 2001; Wang&Jegelka, 2017]

PI

Zi Wang - BayesOpt / 35

GP-UCB

GP-UCB, PI and MES are equivalent to under special cases of .

Page 36: Bayesian optimization A tutorial on

Relations among GP-UCB, PI and MES

MES

GP-UCB

PI

[Jones, 2001; Wang&Jegelka, 2017]

They are also equivalent to under special cases of .

Zi Wang - BayesOpt / 36

Page 37: Bayesian optimization A tutorial on

After T iterations, GP-UCB obtains

Regret bounds for GP-UCB and related methods

Define regret as:

Key assumptions: Mean function and kernel are both given. Optimize in a d-dimensional compact space.

* For simplicity we only show regret bounds for Gaussian kernels. Regret for other kernels may look different. Zi Wang - BayesOpt / 37

[Srinivas et al., 2010; Wang et al., 2016]

MES obtains .

.

Page 38: Bayesian optimization A tutorial on

Summary of how BayesOpt works

Posterior estimation

Evaluate at

Define an acquisition function

UCB:

EI:

PI:

ES:

and others…

Zi Wang - BayesOpt / 38

Page 39: Bayesian optimization A tutorial on

Challenges, open problems and some attempts

Selected topics in BayesOpt

Zi Wang - BayesOpt / 39

➔ High dimensional search space➔ Unknown GP prior➔ Parallel evaluations➔ Unknown constraints➔ Applications in robotics

Page 40: Bayesian optimization A tutorial on

Challenges, open problems and some attempts

High dimensional search space

Zi Wang - BayesOpt / 40

Page 41: Bayesian optimization A tutorial on

Challenges in high-dimensional BO

• optimizing multi-peak acquisitionfunctions in high dimensionscomputationally challenging

• estimating a nonlinear function in high input dimensions: need more observationsstatistically challenging

[Wang et al., 2013; Djolonga el al., 2013; Kandasamy et al., 2015]

regret

Zi Wang - BayesOpt / 41

Page 42: Bayesian optimization A tutorial on

Possible solution: additive Gaussian processes

[Hastie&Tibshirani, 1990; Kandasamy et al., 2015]

What is the additive structure?

• optimize acquisition function block-wisecomputational efficiency

• lower-complexity functionsstatistical efficiency

Zi Wang - BayesOpt / 42

Page 43: Bayesian optimization A tutorial on

Structural Kernel Learning (SKL) [Wang et al., 2017]

0 1 0 0 1 1 1 0 2

Decomposition indicator:z = [0 1 0 0 1 1 1 0 2]Learn z!

Integrate out

Zi Wang - BayesOpt / 43

Page 44: Bayesian optimization A tutorial on

Empirical results for Structural Kernel Learning (SKL)

No Partition

Fully Partitioned

SKL

Heuristics

Iteration

Reg

ret

[Wang et al., 2018] Zi Wang - BayesOpt / 44

Page 45: Bayesian optimization A tutorial on

Evolutionary/Genetic algorithms:

● maintain ensemble of promising points

● new points from exchanging coordinates of good points randomly

Connection to genetic algorithms?

https://en.wikipedia.org/wiki/Crossover_(genetic_algorithm) Zi Wang - BayesOpt / 45

Page 46: Bayesian optimization A tutorial on

Connection to genetic algorithms? [Wang et al., 2018]

-1

0

2

2

-1

2

2

0

Zi Wang - BayesOpt / 46

Observed good points: [-1,0], [2,2]

Query points: [-1,2], [2,0]

Learned instead of completely random coordinate partition.

BayesOpt with additive GPs toy example: 2D

Page 47: Bayesian optimization A tutorial on

Zi Wang - BayesOpt / 47

Other ideas to solve high-dim BayesOpt

● REMBO: low-dim embedding [Wang et al., JAIR 2016]

● BOCK: BO with cylindrical kernels [Oh et al., ICML 2018]

● Additive GPs with overlapping groups [Rolland et al., AISTATS 2018]

● ……

Joint problems:

Assume special structures of high-dim functions but with little

data, it is difficult to verify if the assumptions are true.

Page 48: Bayesian optimization A tutorial on

Challenges, open problems and some attempts

Parallel evaluations

Zi Wang - BayesOpt / 48

Page 49: Bayesian optimization A tutorial on

● GPUs running in parallel for hyperparameter tuning in deep learning;

● group of robots for offline learning of control parameter;

● parallel wet lab experiments for biology and chemistry applications; etc.

BayesOpt with parallel compute resources

Zi Wang - BayesOpt / 49[https://medium.com/inside-machine-learning/what-is-a-transformer-d07dd1fbec04; https://d2l.ai/chapter_convolutional-modern/resnet.html; https://ai.googleblog.com/2016/03/deep-learning-for-robots-learning-from.html; https://en.wikipedia.org/wiki/Wet_lab]

Page 50: Bayesian optimization A tutorial on

Some ideas to propose a batch of queries

● Instead of optimizing one input over information gain, optimize Q

inputs. [Shah&Ghahramani, 2015]

● Choose a new point based on expected acquisition function under all

possible outcomes of pending evaluations. [Snoek et al., 2012]

Zi Wang - BayesOpt / 50

Page 51: Bayesian optimization A tutorial on

Some ideas to propose a batch of queries

● Use determinantal point process (DPP) to generate a diverse set of queries.

[Kathuria et al., 2016]

● Use a Mondrian process to propose one query per partition. [Wang et al., 2018]

Zi Wang - BayesOpt / 51

[Roy&Teh, 2009][Kulesza&Taskar, 2011]

Page 52: Bayesian optimization A tutorial on

Potential issues with existing methods

● Computational cost is usually high.

● Not all adapt to asynchronous parallel BayesOpt settings.

● Difficult to debug especially in high-dimensional settings.

● Parallel BayesOpt typically co-occur with large scale high-dimensional

problems, but a joint solution for these conditions is not yet satisfying.

Zi Wang - BayesOpt / 52

Page 53: Bayesian optimization A tutorial on

Challenges, open problems and some attempts

Unknown priors

Zi Wang - BayesOpt / 53

Page 54: Bayesian optimization A tutorial on

Bayesian optimization with an unknown prior

Estimate “prior” from data• maximum likelihood• hierarchical Bayes

• Regret bounds exist only when prior is assumed given • bad settings of priors make BO perform poorly and seem to be a

bad approach

Zi Wang - BayesOpt / 54

Page 55: Bayesian optimization A tutorial on

• Our idea: learn the “prior” from past experience with similar functions

• Assumption: we can collect data on functions sampled from the same prior

Bayesian optimization with an unknown priormeta / multi-task / transfer learning

Zi Wang - BayesOpt / 55[Wang*&Kim*&Kaelbling, NeurIPS 2018]

Page 56: Bayesian optimization A tutorial on

Zi Wang - BayesOpt / 56

How to learn the GP prior?

Use finite input space to illustrate; extensions to continuous case requires more assumptions. [Wang et al., 2018 + ongoing work]

Prior estimation with meta training data

Task 1 (x_1, y_11) (x_2, y_12) …... (x_M, y_1M)

Task 2 (x_1, y_21) (x_2, y_22) …... (x_M, y_2M)

…... …... …... …... …...

Task N (x_1, y_N1) (x_2, y_N2) …... (x_M, y_NM)

New Task ? ? …... ?

Page 57: Bayesian optimization A tutorial on

Task 1 (x_1, y_11) (x_2, y_12) …... (x_M, y_1M)

Task 2 (x_1, y_21) (x_2, y_22) …... (x_M, y_2M)

…... …... …... …... …...

Task N (x_1, y_N1) (x_2, y_N2) …... (x_M, y_NM)

New Task ? ? …... ?

Zi Wang - BayesOpt / 57

How to estimate the GP prior?

Unbiased prior estimator

Page 58: Bayesian optimization A tutorial on

Task 1 (x_1, y_11) (x_2, y_12) …... (x_M, y_1M)

Task 2 (x_1, y_21) (x_2, y_22) …... (x_M, y_2M)

…... …... …... …... …...

Task N (x_1, y_N1) (x_2, y_N2) …... (x_M, y_NM)

New Task ? ? …... ?

Zi Wang - BayesOpt / 58

How to estimate the GP posterior?

Unbiased posterior estimator

Page 59: Bayesian optimization A tutorial on

Regret bound without the knowledge of the GP prior

[Wang*&Kim*&Kaelbling, NeurIPS 2018] Zi Wang - BayesOpt / 59

Page 60: Bayesian optimization A tutorial on

Empirical results on block picking and placing

[Wang*&Kim*&Kaelbling, NeurIPS 2018] Zi Wang - BayesOpt / 60

Page 61: Bayesian optimization A tutorial on

MetaBO gives better performance with fewer samples[Wang et al., NeurIPS 2018]

Zi Wang - BayesOpt / 61

Page 62: Bayesian optimization A tutorial on

Challenges, open problems and some attempts

Applications in robotics

Zi Wang - BayesOpt / 62

Page 63: Bayesian optimization A tutorial on

Zi Wang - BayesOpt / 63

Type of tasks:

complex

long-horizon

stochastic

fully-observable

Page 64: Bayesian optimization A tutorial on

Learning a pushing skill with multi-modal dynamics

Zi Wang - BayesOpt / 64

Page 65: Bayesian optimization A tutorial on

S

G

Evaluate on demand.

Computing the action values is expensive

Zi Wang - BayesOpt / 65

Solve via BO

Page 66: Bayesian optimization A tutorial on

S

G

[Barto et al., 1995]

Evaluate on demand.

Focusing on relevant states with RTDP

Zi Wang - BayesOpt / 66

Solve via BO

Page 67: Bayesian optimization A tutorial on

How to plan with learned skills?

● Sample continuous skill parameters

● Tree search on both discrete and continuous variables

Zi Wang - BayesOpt / 67

Page 68: Bayesian optimization A tutorial on

How to sample skill parameters for the planner?

Zi Wang - BayesOpt / 68

Page 69: Bayesian optimization A tutorial on

More talks on BayesOpt

2021 Google BayesOpt Speaker Seriesnow publicly available at Google TechTalks channel on YouTube

Page 70: Bayesian optimization A tutorial on

Questions?