Practical Bayesian Optimization of Machine Learning Algorithms · Deep Neural Networks Require...

Preview:

Citation preview

Practical Bayesian Optimizationof Machine Learning Algorithms

Jasper Snoek, Ryan Adams, Hugo LaRochelle – NIPS 2012

“ ... (Gaussian Processes) are inadequate for doing speech and vision. I still think they're inadequate for doing speech and vision. But when you're in a domain where you have no prior knowledge and the only thing that you can expect is that similar inputs should have similar outputs, then Gaussian Processes are ideal”.

“ ... (Gaussian Processes) are inadequate for doing speech and vision. I still think they're inadequate for doing speech and vision. But when you're in a domain where you have no prior knowledge and the only thing that you can expect is that similar inputs should have similar outputs, then Gaussian Processes are ideal”.

“... Gaussian processes are a way of using Machine Learning to simulate the graduate student”

- Geoff Hinton

Motivation

N ….

1

23

...... ...

...

Deep Neural Networks Require Skill to Set Hyperparameters

Common Strategies

Grid Search Random Search

Common Strategies

Grid Search Random Search

- Sometimes better because some parameters have no effect

Can we use Machine Learning instead ?

- To predict regions of the hyperparameterSpace that might give better results.

- to predict how well a new combination of hyperparameters will do and also model the uncertainty of that prediction

Bayesian Optimization

- Frame Hyperparameter Search as an Optimization Problem

Bayesian Optimization

- Frame Hyperparameter Search as an Optimization Problem

- Model the estimation of the function from high level parameters (hyperparameters) to the error metric as a regression problem

Bayesian Optimization

- Frame Hyperparameter Search as an Optimization Problem

- Model the estimation of the function from high level parameters (hyperparameters) to the error metric as a regression problem

- Use G.P Prior : “Similar inputs have similar outputs” to build a statistical model of the function. Prior is weak but general and effective.

Bayesian Optimization

- Frame Hyperparameter Search as an Optimization Problem

- Model the estimation of the function from high level parameters (hyperparameters) to the error metric as a regression problem

- Use G.P Prior : “Similar inputs have similar outputs” to build a statistical model of the function. Prior is weak but general and effective.

- Use statistics to tell us:• Location of expected minimum of the function• Expected Improvement of trying other parameters

Bayesian Optimization (Mockus '78)

- Method for the global optimization of multi-modal, computationally expensive black box functions

- Assumes that the unknown function was sampled from a Gaussian Process (prior) and uses the observations (likelihood) to maintain a posterior

- Observations are the measure of generalization performance under different settings of the hyperparameters we wish to optimize.

- The next set of hyperparameters are selected using the maintained posterior – using a strategy determined by the acquisition function

Gaussian Processes

Specifies a distribution over functions such that any finite subset of N points follows a Multivariate Gaussian Distribution.

Gaussian Processes

Specifies a distribution over functions such that any finite subset of N points follows a Multivariate Gaussian Distribution.

The properties of the resulting distribution on functions is specified by a mean and a positive definite covariance function

The predictive mean and covariance given the observationsIs given by:

Intuition

• GP's are a prior for smooth functions

• Similar inputs (high covariance) should have similar outputs

Intuition

Exploration: Seek Places with High VarianceExploitation: Seek Places in the locality of places you're already doing well at.

Intuition

Exploration: Seek Places with High VarianceExploitation: Seek Places in the locality of places you're already doing well at.

The acquisition function balances these to determine point of next evaluation

Acquisition Functions

The Acquisition function tells us which experiment to run next and what it's goodness will be

1. GP Upper Confidence BoundIdea: Minimize regret over course of optimization. Balance exploration and exploitation

2. Expected ImprovementIdea: How much can I expect to improve over the best I've seen so far by running an experiment with these parameters?

Intuition

Intuition

Intuition

Intuition

Intuition

Intuition

Intuition

Intuition

An Eggsperiment

Parameters:

Boiling Time (1-12m)Cooling Time (1-12m)Salt (0-10 pinches)Pepper (0-10 pinches)

Optimal 'Soft Boiled Egg'

After 5 Iterations....

After 5 Iterations....

After 10 Iterations....

After 10 Iterations....

After 12 Iterations....

After 14 Iterations....

After 16 Iterations....

After 20 Iterations....

After 25 Iterations....

After 25 Iterations....

Practical Bayesian Optimization

• Integrate out all parameters in Bayesian Optimization• Choose appropriate covariance• Choice of acquisition function is important

Accounting for additional cost – Expected Improvement per Second

Incorporate a preference towards choosing points that are not only good, but likely to be evaluated quickly

Parallelizing Bayesian Optimization

'N' completed evaluations'J' pending evaluations

Parallelizing Bayesian Optimization

'N' completed evaluations'J' pending evaluations

Posterior samples after 3Observations

Expected improvementunder individual samples

Integrated expectedimprovement

Implications

Implications

Impossible to find by hand!!

CIFAR-10, 9 Hyperparameters

Benefits

For each input dimension, an appropriate scale for measuring similarity is learned. - are 200 and 300 as similar as 2.0 and 3.0?

Benefits

For each input dimension, an appropriate scale for measuring similarity is learned. - are 200 and 300 as similar as 2.0 and 3.0?

What is the sensitivity to each dimension? Which dimensions don't matter?

Benefits

For each input dimension, an appropriate scale for measuring similarity is learned. - are 200 and 300 as similar as 2.0 and 3.0?

What is the sensitivity to each dimension? Which dimensions don't matter?

Reproducible Research – level the playing field. Its a lot more honest than human beings

Benefits

For each input dimension, an appropriate scale for measuring similarity is learned. - are 200 and 300 as similar as 2.0 and 3.0?

What is the sensitivity to each dimension? Which dimensions don't matter?

Reproducible Research – level the playing field. Its a lot more honest than human beings

If you have the resources to run a fairly large number of experiments, bayesian optimization is better than a person at finding good combinations of hyperparameters

References:

[Paper] Practical Bayesian Optimization of Machine Learning AlgorithmsJasper Snoek, Hugo Larochelle and Ryan P. AdamsAdvances in Neural Information Processing Systems, 2012

[Talk/Slides] Jasper Snoek: "Bayesian Optimization for Machine Learning and Science" https://www.youtube.com/watch?v=a79klpzaPgY

[Book] Machine Learning: a Probabilistic PerspectiveKevin Murphyhttp://www.cs.ubc.ca/~murphyk/MLbook/index.html

Recommended