Bayesian Optimization (BO)

Preview:

DESCRIPTION

Bayesian Optimization (BO). Javad Azimi Fall 2010 http://web.engr.oregonstate.edu/~azimi/. Outline. Formal Definition Application Bayesian Optimization Steps Surrogate Function(Gaussian Process) Acquisition Function PMAX IEMAX MPI MEI UCB GP-Hedge. Formal Definition. Input: - PowerPoint PPT Presentation

Citation preview

Bayesian Optimization

(BO)Javad Azimi

Fall 2010http://web.engr.oregonstate.edu/~azimi/

Outline

• Formal Definition• Application• Bayesian Optimization Steps– Surrogate Function(Gaussian Process)– Acquisition Function

• PMAX• IEMAX• MPI• MEI• UCB• GP-Hedge

Formal Definition

• Input:

• Goal:

Fuel Cell Application

AnodeCathode

bact

eria

Oxidation products

(CO2)

Fuel (organic matter)

e-

e-

O2

H2OH+

This is how an MFC works

SEM image of bacteria sp. on Ni nanoparticle enhanced carbon fibers.

Nano-structure of anode significantly impact the electricity production.

We should optimize anode nano-structure to maximize power by selecting a set of experiment.4

Big Picture• Since Running experiment is very expensive we use BO.

• Select one experiment to run at a time based on results of previous experiments.Current Experiments Our Current Model Select Single Experiment

Run Experiment 5

BO Main Steps

• Surrogate Function(Response Surface , Model)– Make a posterior over unobserved points based

on the prior.– Its parameter might be based on the prior.

Remember it is a BAYESIAN approach.• Acquisition Criteria(Function)– Which sample should be selected next.

Surrogate Function• Simulates the unknown function distribution based

on the prior.– Deterministic (Classical Linear Regression,…)• There is a deterministic prediction for each point x in

the input space.– Stochastic (Bayesian regression, Gaussian Process,

…)• There is a distribution over the prediction for each

point x in the input space. (i.e Normal distribution)– Example• Deterministic: f(x1)=y1, f(x2)=y2• Stochastic: f(x1)=N(y1,2) f(x2)=N(y2,5)

Gaussian Process(GP)

• A Gaussian process is a collection number of random variables, any finite number of which have a joint Gaussian distribution.– Consistency requirement or marginalization

property.• Marginalization property:

Gaussian Process(GP)• Formal prediction:

• Interesting points:– Squared exponential function corresponds to Bayesian linear

regression with an infinite number of basis function.– Variance is independent from observation– The mean is a linear combination of observation.– If the covariance function specifies the entries of covariance

matrix, marginalization is satisfied!

Gaussian Process(GP)• Gaussian Process is:– An exact interpolating regression method.• Predict the training data perfectly. (not true in classical

regression)– A natural generalization of linear regression.• Nonlinear regression approach!

– A simple example of GP can be obtained from Bayesian regression.• Identical results

– Specifies a distribution over functions.

Gaussian process(2):distribution over functions

95% confidence interval for each point x.

Three sampled functions

Gaussian process(2):GP vs Bayesian regression

• Bayesian regression:– Distribution over weight– The prior is defined over the weights.

• Gaussian Process– Distribution over function– The prior is defined over the function space.

• These are the same but from different view.

Short Summary

• Given any unobserved point z, we can define a normal distribution of its prediction value such that:– Its means is the linear combination of the

observed value.– Its variance is related to its distance from

observed value. (closer to observed data, less variance)

BO Main Steps

• Surrogate Function(Response Surface , Model)– Make a posterior over unobserved points based

on the prior.– Its parameter might be based on the prior.

Remember it is a BAYESIAN approach.• Acquisition Criteria(Function)– Which sample should be selected next.

Bayesian Optimization:(Acquisition criterion)

• Remember: we are looking for:

• Input:– Set of observed data.– A set of points with their corresponding mean and variance.

• Goal: Which point should be selected next to get to the maximizer of the function faster.

• Different Acquisition criterion(Acquisition functions or policies)

Policies

• Maximum Mean (MM).• Maximum Upper Interval (MUI).• Maximum Probability of Improvement (MPI).• Maximum Expected of Improvement (MEI).

Policies:Maximum Mean (MM).

• Returns the point with highest expected value.

• Advantage:– If the model is stable and has been learnt very good,

performs very good.• Disadvantage:– There is a high chance to fall in local minimum(just exploit).

• Can converge to global optimum finally?– No

Policies:Maximum Upper Interval (MUI).

• Returns the point with highest 95% upper interval.

• Advantage:– Combination of mean and variance(exploitation and exploration).

• Disadvantage:– Dominated by variance and mainly explore the input space.

• Can converge to global optimum finally?– Yes.– But needs almost infinite number of samples.

Policies:Maximum Probability of Improvement (MPI)

• Selects the sample with highest probability of improving the current best observation (ymax) by some margins m.

Policies:Maximum Probability of Improvement (MPI)

• Advantage:– Considers mean and variance and ymax in policy(smarter than MUI)

• Disadvantage:– Ad-hoc parameter m – Large value of m?

• Exploration– Small value of m?

• Exploitation

Policies:Maximum Expected of Improvement (MEI)

• Maximum Expected of improvement. • Question: Expectation over which variable?– m

Policies:Upper Confidence Bounds

• Select based on the variance and mean of each point.

– The selection of k left to the user.– Recently, a principle approach to select this

parameter has been proposed.

Summary

• We introduced several approaches, each of which has advantage and disadvantage.– MM– MUI– MPI– MEI– GP-UCB

• Which one should be selected for an unknown model?

GP-Hedge• GP-Hedge(2010) • It select one of the baseline policy based on the theoretical

results of multi-armed bandit problem, although the objective is a bit different!

• They show that they can perform better than (or as well as) the best baseline policy in some framework.

Future Works

• Method selection smarter than GP-Hedge with theoretical analysis.

• Batch Bayesian optimization.• Scheduling Bayesian optimization.

Recommended