Practical Bayesian Optimization of Machine Learning Algorithms · Deep Neural Networks Require...

Practical Bayesian Optimizationof Machine Learning Algorithms

Jasper Snoek, Ryan Adams, Hugo LaRochelle – NIPS 2012

“ ... (Gaussian Processes) are inadequate for doing speech and vision. I still think they're inadequate for doing speech and vision. But when you're in a domain where you have no prior knowledge and the only thing that you can expect is that similar inputs should have similar outputs, then Gaussian Processes are ideal”.

“... Gaussian processes are a way of using Machine Learning to simulate the graduate student”

- Geoff Hinton

Motivation

N ….

...... ...

Deep Neural Networks Require Skill to Set Hyperparameters

Common Strategies

Grid Search Random Search

Common Strategies

Grid Search Random Search

- Sometimes better because some parameters have no effect

Can we use Machine Learning instead ?

- To predict regions of the hyperparameterSpace that might give better results.

- to predict how well a new combination of hyperparameters will do and also model the uncertainty of that prediction

Bayesian Optimization

- Frame Hyperparameter Search as an Optimization Problem

- Model the estimation of the function from high level parameters (hyperparameters) to the error metric as a regression problem

- Use G.P Prior : “Similar inputs have similar outputs” to build a statistical model of the function. Prior is weak but general and effective.

- Use statistics to tell us:• Location of expected minimum of the function• Expected Improvement of trying other parameters

Bayesian Optimization (Mockus '78)

- Method for the global optimization of multi-modal, computationally expensive black box functions

- Assumes that the unknown function was sampled from a Gaussian Process (prior) and uses the observations (likelihood) to maintain a posterior

- Observations are the measure of generalization performance under different settings of the hyperparameters we wish to optimize.

- The next set of hyperparameters are selected using the maintained posterior – using a strategy determined by the acquisition function

Gaussian Processes

Specifies a distribution over functions such that any finite subset of N points follows a Multivariate Gaussian Distribution.

Gaussian Processes

Specifies a distribution over functions such that any finite subset of N points follows a Multivariate Gaussian Distribution.

The properties of the resulting distribution on functions is specified by a mean and a positive definite covariance function

The predictive mean and covariance given the observationsIs given by:

Intuition

• GP's are a prior for smooth functions

• Similar inputs (high covariance) should have similar outputs

Intuition

Exploration: Seek Places with High VarianceExploitation: Seek Places in the locality of places you're already doing well at.

Intuition

Exploration: Seek Places with High VarianceExploitation: Seek Places in the locality of places you're already doing well at.

The acquisition function balances these to determine point of next evaluation

Acquisition Functions

The Acquisition function tells us which experiment to run next and what it's goodness will be

1. GP Upper Confidence BoundIdea: Minimize regret over course of optimization. Balance exploration and exploitation

2. Expected ImprovementIdea: How much can I expect to improve over the best I've seen so far by running an experiment with these parameters?

Intuition

An Eggsperiment

Parameters:

Boiling Time (1-12m)Cooling Time (1-12m)Salt (0-10 pinches)Pepper (0-10 pinches)

Optimal 'Soft Boiled Egg'

After 5 Iterations....

Practical Bayesian Optimization

• Integrate out all parameters in Bayesian Optimization• Choose appropriate covariance• Choice of acquisition function is important

Accounting for additional cost – Expected Improvement per Second

Incorporate a preference towards choosing points that are not only good, but likely to be evaluated quickly

Parallelizing Bayesian Optimization

'N' completed evaluations'J' pending evaluations

Parallelizing Bayesian Optimization

'N' completed evaluations'J' pending evaluations

Posterior samples after 3Observations

Expected improvementunder individual samples

Integrated expectedimprovement

Implications

Impossible to find by hand!!

CIFAR-10, 9 Hyperparameters

Benefits

For each input dimension, an appropriate scale for measuring similarity is learned. - are 200 and 300 as similar as 2.0 and 3.0?

Benefits

What is the sensitivity to each dimension? Which dimensions don't matter?

Benefits

Reproducible Research – level the playing field. Its a lot more honest than human beings

Benefits

Reproducible Research – level the playing field. Its a lot more honest than human beings

If you have the resources to run a fairly large number of experiments, bayesian optimization is better than a person at finding good combinations of hyperparameters

References:

[Paper] Practical Bayesian Optimization of Machine Learning AlgorithmsJasper Snoek, Hugo Larochelle and Ryan P. AdamsAdvances in Neural Information Processing Systems, 2012

[Talk/Slides] Jasper Snoek: "Bayesian Optimization for Machine Learning and Science" https://www.youtube.com/watch?v=a79klpzaPgY

[Book] Machine Learning: a Probabilistic PerspectiveKevin Murphyhttp://www.cs.ubc.ca/~murphyk/MLbook/index.html

Practical Bayesian Optimization of Machine Learning Algorithms · Deep Neural Networks Require...

Documents

Bayesian Optimization Under Uncertainty

Bayesian Optimization (BO)

Reducing dimension in Bayesian Optimization

Bayesian Hyperparameter Optimization : Overfitting ... · Bayesian Hyperparameter Optimization: Overﬁtting, Ensembles and Conditional Spaces Thèse Julien-Charles Lévesque Sous

Bayesian optimization for selecting training and

BOA (Bayesian Optimization Algorithm)

Bayesian Optimization for Conformer Generationitempdf74155353254prod.s3.amazonaws.com/7228940/Bayesian... · 2018-12-17 · Chan et al. RESEARCH Bayesian Optimization for Conformer

Efficient Hyperparameter Optimization of Deep Learning ... · Optimization of 19 CNN hyperparameters Optimization of 8 CNN hyperparameters Optimization of 15 CNN hyperparameters Optimization

Bayesian Optimization with Robust Bayesian Neural Networks

Scalable Bayesian Optimization Using Deep Neural …proceedings.mlr.press/v37/snoek15.pdfScalable Bayesian Optimization Using Deep Neural Networks number of hyperparameters, this has

Bayesian Optimization with Experimental Constraints

Distributed Constrained Optimization for Bayesian

Scalable Bayesian Optimization Using Deep Neural …kswersky/wp-content/uploads/dngo.pdf · Scalable Bayesian Optimization Using Deep Neural Networks number of hyperparameters, this

Improving EEG-based BCI Neural Networks for Mobile Robot ...Keywords: brain computer interface, electroencephalography, neural network, hyperparameters, Bayesian optimization, mobile

Bayesian Optimization with Robust Bayesian Neural …papers.nips.cc/paper/6117-bayesian-optimization-with-robust... · Bayesian Optimization with Robust Bayesian Neural Networks Jost

Maximizing acquisition functions for Bayesian optimization

Integrated Predictive Entropy Search for Bayesian Optimization · Bayesian Optimization 2.1 The Bayesian Optimization Algorithm The goal of Bayesian optimization is to solve optimization

Gaussian Processes and Bayesian Optimization

Combining Bayesian Optimization and Lipschitz Optimization

Bayesian Optimization using Deep Gaussian Processes