Bayesian Optimization and Semiparametric Models …...We establish methods for automatically con guring machine learning model hyperpa-rameters using Bayesian optimization. We develop

Bayesian Optimization and Semiparametric Models

with Applications to Assistive Technology

by

Jasper Snoek

A thesis submitted in conformity with the requirementsfor the degree of Doctor of Philosophy

Graduate Department of Computer ScienceUniversity of Toronto

Copyright by Jasper Snoek (2013)

Abstract

Bayesian Optimization and Semiparametric Modelswith Applications to Assistive Technology

Jasper SnoekDoctor of Philosophy

Graduate Department of Computer ScienceUniversity of Toronto

2013

Advances in machine learning are having a profound impact on disciplines spanning the

sciences. Assistive technology and health informatics are fields for which minor improvements

achieved through leveraging more advanced machine learning algorithms can translate to

major real world impact. However, successful application of machine learning currently

requires broad domain knowledge to determine which model is appropriate for a given task,

and model specific expertise to configure a model to a problem of interest. A major motivation

for this thesis was: How can we make machine learning more accessible to assistive technology

and health informatics researchers? Naturally, a complementary goal is to make machine

learning more accessible in general. Specifically, in this thesis we explore how to automate

the role of a machine learning expert through automatically adapting models and adjusting

parameters to a given task of interest. This thesis consists of a number of contributions

towards solving this challenging open problem in machine learning and these are empirically

validated on four real-world applications.

Through an interesting theoretical link between two seemingly disparate latent variable

models, we create a hybrid model that allows one to flexibly interpolate over a parametric

unsupervised neural network, a classification neural network and a non-parametric Gaussian

process. We demonstrate empirically that this non-parametrically guided autoencoder allows

one to learn a latent representation that is more useful for a given task of interest.

ii

We establish methods for automatically configuring machine learning model hyperpa-

rameters using Bayesian optimization. We develop Bayesian methods for integrating over

parameters, explore the use of different priors over functions, and develop methods to run

experiments in parallel. We demonstrate empirically that these methods find better hyper-

parameters on recent benchmark problems spanning machine learning in significantly less

experiments than the methods employed by the problems’ authors. We further establish

methods for incorporating parameter dependent variable cost in the optimization procedure.

These methods find better hyperparameters in less cost, such as time, or within bounded

cost, such as before a deadline. Additionally, we develop a constrained Bayesian optimization

variant and demonstrate its superiority over the standard procedure in the presence of

unknown constraints.

iii

Acknowledgements

This dissertation is thus far my greatest personal academic achievement. However, it isdifficult to consider it as such without acknowledging the many people without whom itwould have not have been possible.

I have had the rare fortune of having a supervisor who considered the purpose of thePh.D. to be solely for his student’s personal academic development. I thank Alex Mihailidisfor wholeheartedly supporting me both financially and academically as I forged my own path,which at times diverged significantly from his objectives and those of his group. In manyways, Alex was the ideal supervisor for me, by giving me freedom to pursue my own interestswhile guiding and pressuring me just enough to fulfill all the requirements of my degree in areasonable amount of time. I am greatly indebted to him for his patience and guidance as Ipursued my interest in machine learning, at times with at best a vague notion of where itwould lead and how I could personally contribute. Alex encouraged me to study Gaussianprocesses for my qualifying exam simply because I was fascinated by them, without promiseof their direct applicability to his field of assistive technology.

It is difficult to overstate the influence of Ryan Adams and Hugo Larochelle on the work inthis thesis and my development as a researcher. I was extremely fortunate to have two peopleof such brilliance get excited about my ideas. It was through their guidance and collaborationthat I was able to do work at a level of quality and rigor that I otherwise would never haveachieved. They taught me how to develop a model, validate it empirically and describe it inhigh quality writing. They taught me how to be a researcher in machine learning.

I am very thankful to the machine learning group at the University of Toronto. Thefaculty, Richard Zemel, Geoffrey Hinton and Sam Roweis, inspired me and taught me machinelearning. I am fortunate for having been able to participate in group meetings, seminars andtea talks with such a strong group of brilliant, world-class researchers and students. Some ofthe fundamental ideas introduced in this thesis were developed through discussion in RichZemel’s group meetings and others through personal discussion with Rich and Geoff. Manyof the ideas were developed through personal discussion with fellow students in this group, ofwhich some are now among my closest personal friends.

I also thank all of the members of the Intelligent Assistive Technology and Systems Labwho without exception are wonderful people and helped me develop as a researcher. I thankJennifer Boger for her unwavering support and friendship and Babak Taati for being both agreat friend and collaborator throughout my studies. Yani Ioannou, Babak Taati and StephenCzarnuch were instrumental in developing the applications in Chapter 7.

A great friend once told me that it is the people around you that make any experience or

iv

achievement worthwhile. I was lucky to share this experience with Danny Tarlow, LaurentCharlin and Kevin Regan, who kept me sane both inside and out of university. I also amgrateful for sharing this experience and being able to discuss ideas with Ilya Sutskever, KevinSwersky, George Dahl, Charlie Tang and Fernando Flores-Mangas.

I would like to thank my committee, Allan Jepson, Richard Zemel, David Fleet, RuslanSalakhutdinov, Anna Goldenberg and Nando de Freitas for their guidance and thoughtfulcomments that have improved my research and this document.

Finally, most of all I am grateful for my parents and brothers. Their support, encourage-ment and friendship have made me who I am and brought meaning to my endeavors.

v

Contents

Abstract ii

Acknowledgements iv

Table of Contents v

List of Figures x

List of Tables xi

List of Equations xii

Mathematical Conventions xiii

1 Introduction 11.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Relationship to Published Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.4 Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Background 82.1 Gaussian Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 Gaussian Process Latent Variable Models . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2.1 Specifying a prior for the latent variable space . . . . . . . . . . . . . . . . . 152.2.2 Mapping from the data to the latent space using back constraints . . . . . . 18

2.3 Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.4 Bayesian Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3 Nonparametrically Guided Autoencoder 233.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.2 Unsupervised Learning of Latent Representations . . . . . . . . . . . . . . . . . . . 25

3.2.1 Autoencoder Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 253.2.2 Gaussian Process Latent Variable Models . . . . . . . . . . . . . . . . . . . 263.2.3 Gaussian Process Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.3 Covariance Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.3.1 The Back-Constrained GPLVM . . . . . . . . . . . . . . . . . . . . . . . . . 29

vi

3.3.2 GPLVM as an Infinite Autoencoder . . . . . . . . . . . . . . . . . . . . . . . 303.4 Supervised Guidance of Latent Representations . . . . . . . . . . . . . . . . . . . . . 31

3.4.1 Nonparametrically Guided Autoencoder . . . . . . . . . . . . . . . . . . . . 323.4.2 Related Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.5 Empirical Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.5.1 Oil Flow Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.5.2 CIFAR 10 Image Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383.5.3 Small NORB Image Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4 Practical Bayesian Optimization of Machine Learning Algorithms 454.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.2 Bayesian Optimization with Gaussian Process Priors . . . . . . . . . . . . . . . . . . 47

4.2.1 Gaussian Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.2.2 Acquisition Functions for Bayesian Optimization . . . . . . . . . . . . . . . 48

4.3 Practical Considerations for Bayesian Optimization of Hyperparameters . . . . . . . 494.3.1 Covariance Functions and Treatment of Covariance

Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494.3.2 Modeling Costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.3.3 Monte Carlo Acquisition for Parallelizing Bayesian Optimization . . . . . . 524.3.4 Optimizing the Acquisition Function . . . . . . . . . . . . . . . . . . . . . . 534.3.5 Hyperparameter Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.4 Empirical Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544.4.1 Branin-Hoo and Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . 554.4.2 Online LDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.4.3 Motif Finding with Structured Support Vector Machines . . . . . . . . . . . 584.4.4 Convolutional Networks on CIFAR-10 . . . . . . . . . . . . . . . . . . . . . 60

4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5 Opportunity Cost in Bayesian Optimization 625.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625.2 Expected Improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635.3 Expected Improvement with a Deadline . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.3.1 Modeling Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645.3.2 Expected Improvement per Second . . . . . . . . . . . . . . . . . . . . . . . 655.3.3 Monte Carlo Multi-Step Myopic EI (MCMS) . . . . . . . . . . . . . . . . . . 65

5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 655.4.1 Simple Branin-Hoo Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 655.4.2 Multiple Kernel Learning with Support Vector Machines . . . . . . . . . . . 665.4.3 Training a Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

6 Bayesian Optimization under Unknown Constraints 706.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 706.2 A Constraint Weighted Acquisition Function . . . . . . . . . . . . . . . . . . . . . . 71

6.2.1 Optimizing the Acquisition Function . . . . . . . . . . . . . . . . . . . . . . 72

vii

6.2.2 Obtaining Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 726.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 736.4 Empirical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

6.4.1 Constrained Branin-Hoo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 736.4.2 Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

7 Applications to Assistive Technology and Health Informatics 777.1 Rehabilitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

7.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 787.1.2 Empirical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 797.1.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

7.2 Mortality in Bone Marrow Transplant Patients . . . . . . . . . . . . . . . . . . . . . 837.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 837.2.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 847.2.3 Empirical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 857.2.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

7.3 Fall Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 867.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 867.3.2 Empirical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 897.3.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

7.4 Prompting Alzheimer’s Patients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 907.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 917.4.2 Empirical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 937.4.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

8 Discussion 958.1 Limitations of the Nonparametrically Guided

Autoencoder and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . 958.1.1 Non-Gaussian Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 958.1.2 Computational Complexity and Mini-Batch Learning . . . . . . . . . . . . . 96

8.2 Limitations of Bayesian Optimization and Interesting Future Directions . . . . . . . 968.2.1 Complexity and Large Scale Learning . . . . . . . . . . . . . . . . . . . . . . 968.2.2 High Dimensional Bayesian Optimization . . . . . . . . . . . . . . . . . . . . 988.2.3 Priors over Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

9 Conclusion 103

Bibliography 107

viii

List of Figures

3.1 An empirical analysis of the nonparametrically guided autoencoder on oil data. . . . 363.2 A comparison of the latent representations learned by the GPLVM and NPGA . . . 373.3 A sample of filters learned by the NPGA on the CIFAR 10 data set. . . . . . . . . . 383.4 Visualisations of the NORB training and test data latent space representations in

the NPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.1 Illustration of integrated expected improvement. . . . . . . . . . . . . . . . . . . . . 514.2 Illustration of the acquisition with pending evaluations. . . . . . . . . . . . . . . . . 514.3 Comparison of various Bayesian optimization strategies on the Branin-Hoo function

(4.3a) and training logistic regression on MNIST (4.3b) . . . . . . . . . . . . . . . . 564.4 Different strategies of optimization on the Online LDA problem compared in terms

of function evaluations (4.4a), walltime (4.4b) and constrained to a grid or not (4.4c). 574.5 A comparison of various strategies for optimizing the hyperparameters of M3E models

on the protein motif finding task in terms of walltime (4.5a), function evaluations(4.5b) and different covariance functions(4.5c). . . . . . . . . . . . . . . . . . . . . . 59

4.6 Validation error on the CIFAR-10 data for different optimization strategies. . . . . . 60

5.1 Performance of cost-based strategies and standard EI on the Braninhoo function . . 675.2 Comparison of different Bayesian optimization strategies in terms of walltime on an

SVM training problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685.3 A comparison of the minimum error achieved before the deadline by the various

algorithms on the neural network problem. . . . . . . . . . . . . . . . . . . . . . . . 68

6.1 A comparison of GP EI MCMC and Constrained GP EI MCMC on the constrainedBranin-Hoo function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

6.2 A comparison of GP EI MCMC and Constrained GP EI MCMC on the constraineddeep neural network example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

7.1 The rehabilitation robot setup and sample data captured by the sensor. . . . . . . . 807.2 The posterior mean learned by Bayesian optimization over the validation set classifi-

cation error (in percent) for α and β with H fixed at 2 and three different settingsof autoencoder hidden units: (a) 10, (b) 500, and (c) 1000. This shows how therelationship between validation error and the amount of nonparametric guidance, α,and parametric guidance, β, is expected to change as the number of autoencoderhidden units is increased. The red x’s indicate points that were explored by theBayesian optimization routine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

ix

7.3 The fall detection unit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 887.4 Progression over time of the loss being minimized by the constrained Bayesian

optimization on the fall detection problem. . . . . . . . . . . . . . . . . . . . . . . . 907.5 An image of the COACH system experimental set-up. . . . . . . . . . . . . . . . . . 927.6 A qualitative demonstration of the hand tracking results . . . . . . . . . . . . . . . 92

8.1 Running the treed-Bayesian optimization on the Branin-Hoo function. The translu-cent rectangles are different tree splits. Each blue dot is an observation that theBayesian optimization routine suggested. . . . . . . . . . . . . . . . . . . . . . . . . 97

8.2 Effect of different smoothness priors . . . . . . . . . . . . . . . . . . . . . . . . . . . 1008.3 Results of running a deep neural network trained on a subset of the MNIST digits

data according to the “dropout” strategy of Hinton et al. [2012] one hundred timesand plotting the mean error (Figure 8.3a) and variance in error (Figure 8.3a) perlearning epoch (iteration). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

x

List of Tables

3.1 Results on CIFAR 10 for various training strategies, varying the nonparametricguidance α. Recently published convolutional results are shown for comparison. . . 39

3.2 Experimental results on the small NORB data test set. Relevant published resultsare shown for comparison. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.1 Minimum cross-validation error values found by the optimization routine for vari-ous Bayesian optimization algorithms performing hyperparameter optimization onmultiple kernel svms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

7.1 Experimental results on the rehabilitation data. . . . . . . . . . . . . . . . . . . . . 817.2 Classification error on the leukemia transplant patient classification task. . . . . . . 867.3 Body joint classification accuracy on the COACH task. . . . . . . . . . . . . . . . . 93

xi

List of Equations

2.1 Gaussian process product over observations . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2 Gaussian process posterior over parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3 Gaussian process marginal likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.4 Gaussian process test predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.5 The exponentiated quadratic covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.6 Matern 5/2 covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.7 Summed covariance functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.1 Autoencoder squared reconstruction loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.2 NPGA neural network covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.2 Neural network covariance function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.3 GPLVM objective with back constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.4 GPLVM covariance with back constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.5 Blended NPGA loss function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.1 The probability of improvement (PI) acquisition function . . . . . . . . . . . . . . . . . . 484.2 The expected improvement (EI) acquisition function . . . . . . . . . . . . . . . . . . . . . 494.3 The GP upper confidence bound (UCB) acquisition function . . . . . . . . . . . . . . . . 494.4 The Bayesian optimization exponentiated quadratic covariance . . . . . . . . . . . . . . . 504.5 The Matern covariance function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.6 The integrated acquisition function for Bayesian optimization . . . . . . . . . . . . . . . . 504.7 The expected acquisition function for parallelized Bayesian optimization . . . . . . . . . . 535.1 The expected improvement per second acquisition function. . . . . . . . . . . . . . . . . . 656.1 The constraint weighted acquisition function . . . . . . . . . . . . . . . . . . . . . . . . . 716.2 A latent Gaussian process for probit regression . . . . . . . . . . . . . . . . . . . . . . . . 72

Mathematical Conventions

In this thesis, bold lowercase letters, such as x, will denote vectors, lowercase letters, such asx, will denote scalars and uppercase bold letters, such as X will denote matrices. Subscriptswill be used to index into matrices and vectors, for example Xi,j will index the element at thei-th row and j-th column of the matrix X and xj will index the j-th element of the vectorx. I will use N(µ,Σ) to denote the probability density function of the Gaussian or normaldistribution with mean µ and covariance matrix Σ. Standard notation will be followed forfunctions, for example f(x) will denote that f is a function operating on x. I will use I todenote the identity matrix.

xiii

Chapter 1

Introduction

Advances in machine learning are having a profound impact on disciplines spanning the

sciences. Assistive technology and health informatics in general are fields for which achieving

even minor improvements through leveraging more advanced machine learning algorithms can

translate to major real world impact. However, machine learning algorithms are frequently

challenging to use for non-experts. Successful application of machine learning currently

requires broad domain knowledge to determine which model is appropriate for a given task,

and model specific expertise to configure a given model to a problem of interest. A major

motivation for this thesis was: How can we make machine learning more accessible to assistive

technology and health informatics researchers? Naturally, a complementary goal is to make

machine learning more accessible in general. Specifically, in this thesis we attempt to automate

the role of a machine learning expert through automatically adapting models and adjusting

parameters to a given task of interest. This thesis consists of a number of contributions

towards solving this challenging open problem in machine learning.

1.1 Overview

Latent variable models provide an automated manner of extracting salient structure from

empirical data. The resulting latent representation is often found to produce features

that facilitate discriminative tasks. We consider the task of creating an effective model for

discrimination in the context of exploring latent variable models of varying complexity followed

by a simple linear discriminative model. The rationale is to translate a complicated search over

discrete combinations of feature extraction, feature selection and classification models with a

1

simpler search over a continuous space of latent variable models. In Chapter 3, we establish

an interesting theoretical connection between two types of seemingly disparate latent variable

models. A hybridization of these results in a type of semiparametric latent variable model for

extracting features that are more useful for a given discriminative task. Certain instantiations

of the model, which we call the nonparametrically guided autoencoder, allow one to flexibly

interpolate between a fully supervised neural network classifier, an unsupervised autoencoder

and a non-parametric distribution over functional mappings through adjusting a small number

of model hyperparameters. In empirical analysis we show that, with an appropriate setting

of the parameters, this model outperforms several common models on challenging machine

learning benchmark problems.

In Chapter 4, we formulate procedures for using Bayesian optimization as a statistically

rigorous methodology for setting the hyperparameters of machine learning models. We

present methods for performing Bayesian optimization for hyperparameter selection of general

machine learning algorithms. A fully Bayesian treatment for the expected improvement

of running an experiment is introduced, as well as algorithms for dealing with variable

time regimes and running experiments in parallel. The resulting Bayesian optimization is

empirically shown to find better hyperparameters significantly faster than top experts in

machine learning on competitive benchmark tasks.

In Chapter 5, two algorithms are introduced for incorporating the notion of parameter-

dependent variable cost in Bayesian optimization. In the common scenario where the

cost of performing a function evaluation depends on the parameters being optimized over,

these algorithms are significantly more efficient than standard Bayesian optimization. We

demonstrate empirically that in the scenario of a fixed cost budget, for example a deadline,

our algorithms find significantly better results than the standard greedy myopic Bayesian

optimization strategy.

In Chapter 6, we consider the common scenario where there are unknown and complex

constraints on the parameters being optimized over using Bayesian optimization. We intro-

duce a constrained acquisition function that allows Bayesian optimization to estimate the

probability that a candidate experiment is a constraint violation. This constrained Bayesian

optimization significantly outperforms the other Bayesian optimization strategies in the

presence of unknown parameter-dependent constraints.

Finally, in Chapter 7, we demonstrate empirically the impact of the algorithms and methods

introduced in this thesis on well-motivated real world applications in assistive technology and

2

health informatics. In Section 7.1 we apply the nonparametrically guided autoencoder and

Bayesian optimization towards the automation of rehabilitation therapy. Section 7.2 demon-

strates the use of Bayesian optimization to set hyperparameters in a leukemia transplantation

survival classification problem. In Section 7.3 the parallel constrained Bayesian optimization

algorithm is used to tune numerous parameters to optimize a complex computer vision based

fall detection system and in Section 7.4 the Bayesian optimization methods introduced in

this thesis are used to significantly improve the performance of a hand-tracking algorithm

critical to an Alzheimer’s patient prompting system.

1.2 Motivation

As better healthcare worldwide is improving longevity and the baby boomer generation is

aging, the proportion of elderly adults within the population is rapidly growing. The world’s

over 60 population is proportionally expected to continue to grow from one out of every nine

people to one out of every five in the next forty years [United Nations, 2012]. In this time

period, the proportion of working age adults to retirement age adults in developed countries

is predicted to decrease from approximately eight to one to two to one [United Nations, 2008].

Healthcare systems and governments are seeking new ways to alleviate the burden on society

of caring for this aging population. Artificial intelligence has been shown to be a promising

solution, as many of the simpler tasks that burden caregivers can be automated. Such tasks

include monitoring older adults and issuing reminders, guiding them through activities of

daily living, and customizing rehabilitation exercises to the abilities of the subject. The

automation of these tasks through artificial intelligence also suggests solutions for promoting

independence and aging in place, because it alleviates the need for the constant presence of

a caregiver in the home. The benefits of the application of machine learning to problems

in assistive technology are becoming ever more clear. However, the application of machine

learning to problems in assistive technology remains challenging. In particular, it is often

unclear what machine learning model or approach is most appropriate for a given task. A

common paradigm is to apply multiple standard machine learning tools in a black box manner

and compare the results. This proceeds according to the following steps:

1. Collect data representative of the problem of interest.

2. Extract a set of features from these data.

3

3. Apply a selection of standard discriminative machine learning algorithms and compare

results.

This strategy is unsatisfying for a number of reasons. The performance of each machine

learning algorithm, for example, is dependent on the method used for feature extraction.

Consider a classification task where the structure of interest within the data lies on some

nonlinear latent manifold. This structure can be captured by a nonlinear feature extraction

followed by a linear classifier or conversely linear feature extraction followed by a nonlinear

classifier. Although this is a simple example, it elucidates the fact that there are underlying

complexities that make the comparison of various approaches challenging or less meaningful.

In general, each machine learning algorithm also requires the setting of non-trivial hyper-

parameters. Often these parameters govern the complexity of the model or the amount of

regularisation and require expert domain knowledge and time-consuming cross-validation

procedures to select. Some examples of these hyperparameters include the number of hidden

units in a neural network, the regularisation term in support vector machines and the number

of dimensions in principal components analysis. The combinations of feature extraction

methods, discriminative machine learning models and corresponding hyperparameters of each

forms a vast space to explore for the best result and strategy.

Researchers in assistive technology in general do not possess the advanced machine learning

domain knowledge necessary to intuitively explore the vast space of machine learning models

and parameterizations. However, assistive technology is a domain that requires high accuracy

and relatively small improvements in performance can translate to significant real world

impact. Consider, for example, the difference between 95% and 99.5% accuracy for a classifier

that detects falls in an older adult’s home from sensor data. Such a discrepancy in classification

accuracy can be translated to lives saved or a reduction in false positive classifications large

enough to make the classifier useful rather than irritating. Such improvements can be garnered

through more appropriate combinations of feature extraction, discriminative learning and

better hyperparameters.

4

1.3 Relationship to Published Work

Significant portions of this thesis overlap with published and peer-reviewed work. I will

outline here the relevant peer-reviewed publications for each chapter.

Chapter 3

J. Snoek, R. P. Adams, and H. Larochelle. On nonparametric guidance for learning

autoencoder representations. In International Conference on Artificial Intelligence and

Statistics, 2012a. Contribution: Primary author, original ideas, major mathematical

derivations, all empirical evaluation.

J. Snoek, R. P. Adams, and H. Larochelle. Nonparametric guidance of autoencoder

representations using label information. Journal of Machine Learning Research, 13:2567–

2588, 10/2012 2012b. Contribution: Primary author, original ideas, major mathematical

derivations, all empirical evaluation.

J. Snoek, R. Adams, and H. Larochelle. Semiparametric latent variable models for

guided representation. In The Learning Workshop (Snowbird), 2011a. Contribution:

Primary author, original ideas, major mathematical derivations, all empirical evaluation.

Chapter 4

J. Snoek, H. Larochelle, and R. P. Adams. Practical Bayesian optimization of machine

learning algorithms. In Advances in Neural Information Processing Systems, 2012c.

Contribution: Primary author, original ideas, major mathematical derivations, all

empirical evaluation.

Chapter 5

J. Snoek, H. Larochelle, and R. P. Adams. Opportunity cost in Bayesian optimization.

In Neural Information Processing Systems Workshop on Bayesian Optimization, 2011b.

Contribution: Primary author, original ideas, major mathematical derivations, all

empirical evaluation.

5

Chapter 7

Section 7.1

B. Taati, R. Wang, R. Huq, J. Snoek, and A. Mihailidis. Vision-based posture

assessment to detect and categorize compensation during robotic rehabilitation ther-

apy. In International Conference on Biomedical Robotics and Biomechatronics, 2012c.

Contribution: Machine learning component, some empirical evaluation.

J. Snoek, B. Taati, and A. Mihailidis. An automated machine learning approach

applied to robotic stroke rehabilitation. AAAI Symposium on Gerontechnology, 2012d.

Contribution: Primary author, machine learning component, all empirical evaluation.

Section 7.2

B. Taati, J. Snoek, D. Aleman, A. Mihailidis, and A. Ghavamzadeh. Machine learning

techniques for data mining in bone marrow transplant records. In The 54th Annual

conference of the Canadian Operational Research Society (CORS), May 2012a. Contri-

bution: Machine learning component, code and experimental framework for Bayesian

optimization, some empirical evaluation.

B. Taati, J. Snoek, D. Aleman, A. Mihailidis, and A. Ghavamzadeh. Applying

collaborative filtering techniques to data mining in bone marrow transplant records.

In INFORMS, Oct 2012b. Contribution: Machine learning component, code and

experimental framework for Bayesian optimization, some empirical evaluation.

B. Taati, J. Snoek, D. Aleman, and A. Ghavamzadeh. Data mining in bone marrow

transplant records to identify patients with high odds of survival. In Submission,

2013. Contribution: Machine learning component, code and experimental framework

for Bayesian optimization, some empirical evaluation.

Section 7.3

M. Belshaw, B. Taati, J. Snoek, and A. Mihailidis. Towards a single sensor passive

solution for automated fall detection. In International Conference of the IEEE En-

gineering in Medicine and Biology Society, EMBC, 2011b. Contribution: Machine

learning component, code and experimental framework for Bayesian optimization.

6

1.4 Code

In the effort to make the work in this thesis more useful and accessible to the wider research

community, and in the interest of reproducible research, code to run all of the methods

introduced in this thesis is made publicly available. The code is provided “open source”

at http://www.cs.toronto.edu/~jasper/software.html.

7

http://www.cs.toronto.edu/~jasper/software.html

Chapter 2

Background

For completeness, this section provides an in-depth review of some of the more advanced

concepts that are utilized in the thesis.

2.1 Gaussian Processes

A Gaussian process (GP) is a probabilistic model that defines a distribution over continuous

functions. Recent advances in machine learning have developed an elegant and principled

methodology for using Gaussian processes to express uncertainty over the space of functions

that fit some set of empirical data. This section will briefly review this theory in the context

of regression. A comprehensive survey is provided in [Rasmussen and Williams, 2006].

The term Gaussian process belongs to the nomenclature of the statistics and machine learning

community. It is always important to acknowledge that similar theory exists under different

terminology in various scientific fields. The fields of geostatistics and physics have used

Gaussian processes for interpolation and extrapolation under the term Kriging for many

years. The first instance of Kriging is often attributed to Danie G. Krige, who defined a

similar method for interpolation in his master’s thesis in 1951[Krige, 1951]. However, inspired

by Krige’s work, Kriging was in fact developed, and the term coined, by Matheron [1962].

Perhaps the first motivation for the use of GPs in the context of Bayesian machine learning

was provided by Neal [1996]. Neal [1996] observed that in the limit of infinite hidden units,

under a Bayesian formulation and with Gaussian priors, neural networks converged to GPs.

Neal provides a nice discussion and history of GPs in [Neal, 1994]. In recent years, the

8

contribution of the machine learning community to the further understanding and definition

of GPs has been significant [Rasmussen and Williams, 2006]. In this paper I will follow the

notation and formulation of GPs from Rasmussen and Williams [2006].

Consider a functional mapping X → Y from some real valued D-dimensional input data

vector x ∈ X , i.e. X =RD, to a real valued target y ∈ Y :

y = f(x; w) + ε

where f is some continuous function, w is a set of parameters or weights and ε is additive

measurement error or noise. In the case of a simple linear mapping f(x; w) may be defined

as f(x; w) = xTw. Assuming spherical zero-mean Gaussian noise with variance σ2, i.e.

ε ∼ N(0, σ2), gives rise to a Gaussian likelihood over the functional mapping:

p(y|x,w) ∼ N(xTw, σ2)

Often we have multiple observed inputs and corresponding targets. In that case, we can

specify a likelihood over the entire data set (assuming independence) as the product of the

single case likelihood over the N observations:

p(y|X,w) =N∏i=1

p(yi|xi,w) = N(XTw, σ2I) (2.1)

where y and X are created by stacking each of the N target values and input vectors

respectively.

The above linear formulation can be used to form a much richer, non-linear mapping through

projecting the inputs into some feature space and computing a weighted combination of

features instead of inputs. For example, a cubic polynomial mapping from the inputs to

targets can be specified as: y = φ(x)Tw + ε where φ(x) = [1, x, x2, x3]T . φ(x) can be

considered to be any fixed projection of x into some higher-dimensional space.

In the above formulation we specify a functional mapping from x to y, but we accept that

there may be some amount of noise involved in the mapping. Setting the noise in this manner

means that for a given input x, there is now a distribution over possible mappings to y.

Consider linearly interpolating along x using a finite number of equally spaced points x1...n

between two distinct inputs (x1 and xn). Now consider computing the targets y1...n for each

9

of those interpolated points using the above linear mapping with zero-mean Gaussian noise.

Repeating the mapping again will possibly result in different values of y for the same points

x, as the mapping is corrupted by the noise - a stochastic process. Plotting y1...n allows one

to visualize a function defining y (or alternatively interpolating an infinite number of equally

spaced points on x). Note that under a Gaussian noise model, an infinite number of possible

mappings into y for any given input x are possible, although most are extremely unlikely.

Thus repeating the above procedure an infinite number of times will produce an infinite

number of unique functions over y. Intuitively, many of these functions are extremely unlikely

to have produced the set of observed targets. Which of the infinite functions is the true

function (that would be realized if there was no noise)? How does one express a preference for

one function over another? One could for example assume independence between observations

and take the product of the probability of all the interpolated points (Equation 2.1) to obtain

a probability for each function. However, this assigns high probability to some extremely

unlikely (real-world) functions. Gaussian processes address how one defines a distribution

over this infinite set of functions in a principled way using Bayesian probability theory.

GPs take advantage of an a priori assumption to specify a preference over functions. This

is expressed through the use of a Bayesian prior over the parameters, w, of the function.

Specifically, a preference for smooth functions is expressed through setting a zero-mean

Gaussian prior over the parameters, p(w) = N(0,Σ), with some positive definite covariance

matrix Σ. As the functions are defined by their parameters, specifying a distribution over

the parameters is analogous to specifying a distribution over functions. Thus following Bayes’

rule, a posterior distribution over parameters is defined by:

p(w|y,X) =p(y|X,w)p(w)

p(y|X)(2.2)

where p(y|X), the marginal likelihood, is given by integrating out the weights:

p(y|X) =

∫p(y|X,w)p(w) dw (2.3)

This posterior over parameters (or functions) can be used to make much richer predictions

for the targets of a given data point. That is, the predictive distribution over targets y∗ for a

novel data point x∗ can be specified through averaging the targets given over each possible

10

set of parameters, weighted by the probability of the parameters:

p(y∗|x∗,X, y) =

∫p(y∗|x∗,w)p(w|X, y) dw

The predictive distribution can be analytically solved [Rasmussen and Williams, 2006] to be

a Gaussian distribution:

p(y∗|x∗,X, y) = N(k∗(K + σ2I)−1y, k(x∗,x∗)− kT∗ (K + σ2I)−1k∗) (2.4)

where K = φTΣφ, k∗ = φT∗Σφ, and k(x∗, x∗) = φT∗Σφ∗ and the shorthand φ(x∗) = φ∗ is

used. An interesting observation is that the posterior distribution of the target (Equation

2.4) is specified entirely of inner products with respect to Σ. A nice property is that if Σ is

positive-semidefinite, then these inner products specify the covariance of the distribution of

functions over y in terms of x. That is, if f(x) = φ(x)Tw and w ∼ N(0,Σ) then the function

[Rasmussen and Williams, 2006] is Gaussian with zero-mean where the covariance between

any two points is specified by the inner product of the basis functions in feature space. That

is cov(f(xp), f(xq)) = φ(xp)TΣφ(xq). An attractive property of this form of covariance is that

one can use the kernel trick, that makes Support Vector Machines [Cristianini and Scholkopf,

2002] so powerful, to compute the dot product in very high (or even infinite) feature spaces

without ever projecting the input into the feature space. Thus, the covariance between two

points is often referred to as a kernel: cov(f(xp), f(xq)) = k(xp,xq). A Gaussian process is

specified as a zero-mean Gaussian distribution over f(x) where the covariance matrix K(X,X)

is the kernel matrix or gram matrix of size N×N (where N is the number of observed

training points), and index Kp,q is computed using the covariance function k(xp,xq).

Intuitively, this means that the covariance between two data points in function space is

specified by their relative distance in input space. This results in a preference for very smooth

continuous functions (assuming a reasonable covariance function).

The Gaussian process covariance functions provide a very flexible and elegant methodology

for establishing priors over random functions. Any positive-semidefinite kernel can be used

to specify the covariance function. Probably the most widely used one is the exponentiated

quadratic covariance, also commonly referred to as the squared exponential, Gaussian or

11

radial basis function (RBF) kernel:

KSE(x,x′) = θ0 exp

−1

2r2(x,x′)

r2(x,x′) =

D∑d=1

(xd − x′d)2/θ2d. (2.5)

where D is the number of input dimensions and θ0..D are parameters of the kernel. The

first parameter, θ0, is a scale parameter that indicates the amount by which locality should

influence the covariance. The following parameters θ1...D are the length scales, that specify the

smoothness of the function with respect to each dimension of the input. Often these are each

constrained to be the same value, resulting in a spherical or isotropic covariance. However,

allowing for separate values of each performs Automatic Relevance Determination [MacKay,

1994a] as it allows the model to select the relevant dimensions of the input during learning.

The exponentiated quadratic covariance has the very attractive property that it is the dot

product between the xp and xq in a feature space containing an infinite number of centered

radial basis functions [Rasmussen and Williams, 2006]. The covariance is unity if the two

inputs are close and decreases smoothly to zero as they move farther away, thus realizing a

preference that data points that are close in input space should be close in the target function

space. Since the exponentiated quadratic is infinitely differentiable, as a prior it expresses

a preference for extremely smooth functions. Sometimes this assumption of smoothness is

appropriate, but often it is too restrictive to model properly a given function of interest.

Thus, Stein [1999] argues for the use of the Matern family of functions which, through the

specification of a real valued parameter, permit tuning of the smoothness of the prior. For

example, samples from the ARD Matern 5/2 covariance,

KM52(x,x′) = θ0

(1 +

√5r2(x,x′) +

5

3r2(x,x′)

)exp

−√

5r2(x,x′). (2.6)

are only twice differentiable.

Different covariance functions can be combined (as a sum or product) to produce a richer

covariance. For example, often a combination of the RBF kernel, a linear kernel and a noise

process are combined into a single covariance function [Williams and Rasmussen, 1996]:

Ksum(x,x′) = θ0 exp

−1

2r2(x,x′)

+ αxTx′ + δσ2 (2.7)

where α indicates the weighting of the linear portion of the kernel and δ is the Kronecker

delta that is one only if x = x′. Learning the parameters of the covariance allows the model

12

to use a weighted combination of each of the combined kernels. A full treatment of the rich

family of covariance functions and kernels is too broad to cover in this paper. However, an

excellent discussion of covariance functions and their properties is given in [Rasmussen and

Williams, 2006, Chapter 5].

2.2 Gaussian Process Latent Variable Models

Intuitively, it may be desirable to consider observed raw data as being the noisy result of a

smooth mapping from a highly structured latent space. A Gaussian process can provide a

distribution over the family of possibly infinite continuous mappings from the latent space

to the data. The innovation of the Gaussian process latent variable model (GPLVM) is to

consider such a Gaussian process prior over functions and then optimise the latent variables

with respect to the marginal likelihood of the GP. The GPLVM is derived from the dual

problem of probabilistic principal components analysis (PPCA [Tipping and Bishop, 1999]).

Instead of marginalizing the latent variables and optimizing the weights, the weights are

marginalized and the latent variables are optimized. The result is a generalization of PPCA

allowing any smooth non-linear mapping from the latent space to the data.

First, a zero-mean spherical unit-variance prior is specified for the parameters, or weights,

of the mapping, W ∼ N(0, I). The marginal likelihood for Y becomes the product over the

dimensions D of the data:

p(Y|X, σ2) =D∏d=1

p(y:,d|X, σ2) (2.8)

p(y:,d|X, σ2) =N(0,XXT + σ2I) (2.9)

where we have borrowed the notation from [Lawrence, 2005] that y:,d represents the dth

column of Y.

The log-likelihood can be analytically derived as:

L = −DN2

ln2π − D

2ln|K| − 1

2tr(K−1YYT ) (2.10)

where K = XXT + σ2I

[Lawrence, 2005] shows that the values XML that maximize the likelihood with respect to X

13

can be solved for analytically as:

X = ULV T (2.11)

where U contains columnwise the first q eigenvectors of YYT , L is a q×q diagonal matrix

with the first q eigenvalues1 of D−1YYT along the diagonal and V is some arbitrary rotation

matrix. This analytic solution can thus be solved as an eigenvalue problem (equivalently to

that in PCA [Lawrence, 2005]).

Examining the marginal likelihood of the dual of PPCA (Equation 2.8) the relationship to

Gaussian Processes becomes clear. It is the product of D Gaussian Processes, each mapping

the latent data to a different dimension of the observed data with the same linear covariance

kernel (K = XXT +σ2I). Thus the dual formulation of PPCA is exactly optimizing the inputs

and hyperparameters of a Gaussian Process mapping from the inputs X to the observed data

Y.

It is this Gaussian Process mapping that makes the GPLVM such a powerful and flexible

model. GP theory provides a framework for specifying a much richer mapping to the data

space with Bayesian model averaging for regularization. Non-linearity can be achieved by

specifying a non-linear kernel within the GP. Again, in practice the RBF kernel (Equation 2.5)

is generally used as it exhibits the desirable properties of preferring smooth functions consisting

of effectively infinite basis functions.

Optimization of this model proceeds through computing the gradient of the marginal likelihood

(Equation 2.10) with respect to the kernel [Lawrence, 2005]:

∂L

∂K= K−1YYTK−1 −DK−1 (2.12)

This gradient provides flexibility with respect to the choice of covariance kernel. Any kernel

can be used as long as one can compute the gradient of the latent variables with respect to

the kernel (∂K∂x

). Combining this with Equation 2.12, the parameters of the kernel can then

be jointly optimized with the latent variables.

Unfortunately, the objective function is complex and multi-modal. No analytic solution has

been found and thus gradient based techniques are used to follow the gradients (Equation 2.12)

to a (local) minimum of the log marginal likelihood (Equation 2.8). Though any gradient

based optimization package can be used, it should be noted that computing the gradients

is computationally expensive; it involves computing the inverse of the kernel matrix (an

1They are actually the square root of the eigenvalues with noise subtracted out.

14

O(n3) operation where n is the size of the data set). Thus an optimization package that is

conservative in the number of gradient evaluations (and instead considers e.g. second order

gradient information) should be preferred. This is why Lawrence [2005] advocates the use of

scaled conjugate gradients [Møller, 1993]. However, Yao et al. [2011] have shown stochastic

gradient descent to outperform other techniques when optimizing GPLVMs, possibly due to

its ability to jump out of small local minima.

The quality of the resulting model depends highly on the initialisation of the latent variables.

A good starting point (as shown by [Lawrence, 2005]) is to initialize the latent variables using

the analytic solution to dual-PPCA and then follow the gradients from there. A number

of publications [van der Maaten, 2009, Geiger et al., 2009, Bitzer and Williams, 2010] have

focused on the initialisation of the GPLVM and shown improved results by following various

initialisation methodologies.

2.2.1 Specifying a prior for the latent variable space

While it is true that the GPLVM provides regularization through Bayesian model selection, it

is important to note that overfitting can still occur. The kernel hyperparameters are optimized

through maximizing the marginal likelihood following the Type II Maximum Likelihood

procedure (MacKay’s evidence framework [MacKay, 1992]), which has been shown to prevent

overfitting due to Bayesian model selection. However, this is an approximate technique that

often works well when only a small number of hyperparameters are being optimized and

a large number of parameters have been integrated out (MacKay provides an illuminating

discussion in MacKay [1999]). However, as each of the latent variables is a parameter to

optimize, this again opens up the possibility of overfitting.

It is thus a good idea to specify a prior over the latent positions P (X) instead of using a

log-uniform prior as is done in the Type II ML procedure. Optimizing the latent variables

with respect to the marginal likelihood multiplied by this prior corresponds to finding the

maximum a-posteriori (MAP) solution for X rather than the maximum likelihood. Lawrence

[2005] specifies a spherical zero-mean unit-variance prior P (X) =∏N

n=1N (0, I) over the latent

space.

A much richer prior can be used if one knows additional information about the structure of

the data or how it should be embedded in the latent space. This is the foundational idea

behind a number of advances in using GPLVMs, and has resulted in state-of-the-art results

on a number of machine learning tasks.

15

Incorporating Dynamics

If the observed data form a time series or is sequential in nature, e.g. motion capture data,

then it makes sense to incorporate this sequential structure in the latent mapping. This can

be incorporated in the form of a dynamical model as the prior over X. That is, one conditions

the latent mapping Xt of an observed variable on the latent position of the previous observed

variable Xt−1, i.e. P (Xt|Xt−1). This idea forms the basis of the Gaussian Process Dynamical

Model Wang et al. [2008]. In the GPDM, Wang et. al use a Gaussian process mapping over

the dynamics of the latent space to form the prior distribution over X. A latent position xt

of a data case at time t is assumed to be some mapping of the latent position of the previous

data case xt−1:

xt = f(xt−1; w) + et (2.13)

with a Gaussian process prior over f . They simultaneously optimize the parameters of the

model and the dynamics, through combining the gradients of both to learn the model. This

prior encodes a preference in the model for placing observed points that are close in time,

closer in latent space. As a result, the sequential observed data tends to be placed along a

very smooth non-linear manifold in latent space, where distance along the manifold is given

by time. Wang et al. [2008] showed that one can use the latent dynamics to create new

sequences in latent space, and then use the GPLVM mapping P (Y|X) to generate a new

sequence in data space. This was used by Urtasun et al. [2006] to very accurately infer the

dynamics of a person in a visual tracking setting even in the presence of full occlusions.

Topologically Constrained GPLVMs

Urtasun et al. [2008] identified that one can manipulate the topology of the latent space

through using domain specific a priori knowledge about which data points should be neighbors

(close) in the latent space. They specify the prior over latent variables using Locally Linear

Embedding Roweis and Saul [2000], an algorithm that learns an encoding of data points in

terms of weighted combinations of their neighbors. The LLE prior gives the GPLVM a strong

preference to place neighboring data points close in the latent space. The authors show that

if, for example, the data is highly cyclic in nature (such as human gait) the neighbors can be

set to be those that are closest in terms of phase. This results in a highly cyclic embedding

in latent space. A major shortcoming of the Gaussian Process Dynamical Model is that

the dynamics is uni-modal; there are no transitions between different types of motion. The

16

authors address this by setting points in the latent space where transitions are possible to be

neighbors in the LLE prior.

van der Maaten [2009] similarly uses a dimensionality reduction technique (t-SNE [van der

Maaten and Hinton, 2008]) to initialize the GPLVM and as the prior for the latent space, and

demonstrates improved results (in a nearest neighbor experiment) over the original zero-mean

Gaussian prior over the latent variables.

Classification

In a classification setting, the observed training data are generally associated with a discrete

class label. With such data it makes sense to embed data points of the same class nearby

in latent space, and similarly separate data points of different classes as much as possible.

This is exactly the motivation for the Discriminative GPLVM [Urtasun and Darrell, 2007].

Urtasun and Darrell [2007] form the prior over the latent space p(brmX) using Generalized

Discriminant Analysis (GDA). GDA is a non-linear kernel based method that attempts to

maximize between-class separability while minimizing within-class variability. The resulting

model prefers to encode in the low-dimensional latent space primarily the high-dimensional

structure of the data that most clearly separates the classes. Urtasun and Darell show that

classification in the latent space produces results that outperform even standard Gaussian

process classification. Also, the Discriminative GPLVM allows for some interesting future

extensions, such as semi-supervised learning through only using the GDA prior on labeled

data, and classification of dynamical sequences through combining the discriminative prior

with a dynamical one.

Hierarchical and Shared Gaussian Process Latent Variable Models

In many situations, sequential data is not well modeled using first order dynamics as is done

using Gaussian Process Dynamical Models (e.g. Wang et al. [2008]). For example, in the

case of visual tracking or motion capture data, often the data is collected at a variable frame

rate. Assuming each data sample directly follows the last is obviously not an optimal model

of the data in this case. The dynamics assumes the role of a prior for the latent variables

in a GPLVM, causing the GPLVM to place data points that sequentially follow each other

nearby in latent space. This will introduce error into the model if there is a large variance

in the times between temporal neighbors in training data. Specifically, the prior will try to

17

push sequential neighbors close together equally, regardless of how close they are in time.

To overcome such issues is part of the motivation for the Hierarchical GPLVM [Lawrence

and Moore, 2007]. The Hierarchical GPLVM models the prior over X as another GPLVM.

Multiple GPLVMs can then be connected through sharing a parent. The parent process

encodes its children in a joined latent space while the children processes can be encoded in

their own latent topology. This allows one to relate separate processes using a single shared

prior. However, this added representational power also adds more potential for overfitting.

Thus a very strong prior must be provided for the parent of the hierarchy and the intermediate

GPLVM must be very constrained. The authors showed that the HGPLVM could include

time explicitly as the prior for the top level GPLVM, thus constraining data points that are

close in time to be close in latent space. Ek et al. [2008] show that a form of GPLVM where

two GPLVMs are connected with a single parent GPLVM as the prior can be used to select

the maximally correlated features from the two data spaces. Navaratnam et al. [2007] used a

similar model to perform pose estimation, and showed that the GPLVM performed well on

an unsupervised learning task.

2.2.2 Mapping from the data to the latent space using back con-

straints

An interesting observation about the GPLVM is that although it provides a smooth mapping

from the latent space to the data space, the converse is not true. Instead the GPLVM ensures

that points that are distant in the data space will be distant in the latent space (to preserve

continuity in the mapping). Lawrence and Quinonero Candela [2006] provide an interesting

discussion and propose the use of back constraints to ensure a smooth mapping from data

space to latent space. This constraint is imposed on the optimization of the model by replacing

each element of X in the model with a functional mapping of the inputs x = g(y; w). Any

mapping for g(y; w) can be used, as long as it enforces a smooth mapping and the gradients of

X with respect to the mapping can be computed. (Lawrence and Quinonero Candela [2006])

demonstrate examples where the back constraints enforce more visually pleasing manifolds in

latent space and improved results in a nearest neighbors experiment. Urtasun and Darrell

[2007] show that the back constraints can be used to provide a very quick mapping from

the data to the latent space (as using the GPLVM would involve optimizing for the latent

position of a given data point). This is a very interesting result, as it opens up the possibility

of directly computing the latent position of a test input as part of a real time system (perhaps

18

for classification). In [Urtasun et al., 2008] Urtasun et. al show that the back constraints can

be used similarly to the prior to constrain the topology of the GPLVM.

2.3 Autoencoders

In the machine learning nomenclature, the term autoencoder [Cottrell et al., 1987, Saund,

1989, Hinton and Zemel, 1994] was originally used to describe a particular variant of neural

network that was trained to minimize the reconstruction error of the input after projecting

it through a hidden layer of lower dimensionality. Specifically, to compress some data, an

autoencoder employs a parametric functional mapping g(y,W1) : Y → X from the data to

the latent space and then a similar mapping f(x,W2) : X → Y back from the latent space

to the data where X =RJ and Y=RK and J < K. The mapping is of the form:

y = f(g(y; W1); W2) + ε (2.14)

where ε is considered noise, g is a linearly weighted combination of the inputs, y, that is

projected through an elementwise nonlinearity such as the logistic function, σ

g(y; W1) = σ(yTW1) (2.15)

and f is a linear mapping back to the inputs, i.e. f(x; W) = xTW. W1 ∈ RJ×K and

W2 ∈ RK×J are considered weight matrices. Note, that it is assumed that biases are

incorporated through appending a 1 to each instance of both y and x, but these are omitted

for notational simplicity.

A popular method for finding an explicit parameterisation for this model is through maximum

likelihood estimation under a Gaussian noise model. That is, if we assume spherical zero-

mean Gaussian noise, i.e. ε ∼ N(0, σ2), and perform maximum likelihood estimation for the

parameters, the objective corresponds to finding the W1 and W2 that minimizes squared

reconstruction error:

W?1,W

?2 =arg min

W1,W2

N∑n=1

K∑k=1

(y(n)k − fk(g(y(n); W1); W2))2, (2.16)

Intuitively, this network learns an encoding of the input in the hidden layer that compresses

the data while preserving some unifying structure. More recently, the term autoencoder has

19

been extended to include models with overcomplete hidden representations. However, the

training objective must be altered to prevent the network from simply learning the identity

matrix - which clearly leads to an optimal reconstruction. Vincent et al. [2008] have shown

this can be done through corrupting the input to the encoder through adding Gaussian-

distributed or ’salt and pepper’ noise and minimizing reconstruction error of the uncorrupted

data. This denoising autoencoder must learn structure from the inputs to infer what the

noisy inputs were. Ranzato et al. [2007] achieved an overcomplete representation through

enforcing sparsity in the hidden layer. Rifai et al. [2011] propose a contractive autoencoder

through incorporating a penalty relative to the Frobenius norm of the Jacobian matrix of

the encoder activations with respect to the input. This can be considered a cost inversely

proportional to the smoothness of the mapping from inputs to the hidden representation.

Training of autoencoder networks with more than a single hidden layer had been unsuccessful

due to the complexity of optimizing the objective until Hinton and Salakhutdinov [2006]

demonstrated that unsupervised pre-training, ’stacking’ the weights of restricted Boltzmann

machines, provided a good starting point for training a deep autoencoder. Intuitively, the

training objective of a deep multilayer neural network is extremely complex and reaching a

good local minimum from a random initialisation is quite challenging. Arguments for why

such an initialisation is so effective vary. According to Martens [2010] the training objective

suffers not from poor local minima, but rather pathological curvature and therefore second

order optimisation methods are required to properly explore the objective. Erhan et al.

[2010] argue that training deep networks suffers from poor generalisation, particularly in the

early stages of learning, and pre-training provides some robustness to this early overfitting.

Recently, Hinton et al. [2012] demonstrated that a procedure of stochastically dropping out

hidden units during training of the neural network could both provide regularization and

make the optimization more robust to local optima.

Regardless of the reason why, the resulting network learns a nonlinear parametric mapping

into a feature space that is particularly well suited to discriminative tasks. Using the

codes resulting from these deep autoencoders has facilitated state-of-the-art performance on

numerous discriminative tasks due to the models’ ability to capture complex structure from

the input [Bengio, 2009].

20

2.4 Bayesian Optimization

Bayesian optimization [Mockus et al., 1978] is a methodology for finding the extremum of noisy

black-box functions. Given some small number of observed inputs and corresponding outputs

of a function of interest, Bayesian optimization iteratively suggests the next input to explore

such that the optimum of the function is reached in as few function evaluations as possible.

Bayesian optimization generally operates on expensive, noisy and multi-modal functions. This

is a domain that is considered inappropriate for common gradient based local optimizers in the

well established domain of convex optimization [Boyd and Vandenberghe, 2004]. In Bayesian

optimization, a full statistical model of the function of interest is developed. Combining

prior assumptions over the structure of the function and a history of all observations, one

can reason about the location of optima while taking uncertainty into account. This is

what makes the procedure inherently Bayesian. Under the rationale that a function to be

optimized is computationally, physically or morally expensive, we can justify the additional

computational expense of developing a full statistical model. The statistical model acts as a

surrogate function, which can be queried exhaustively at relatively low computational expense.

Bayesian optimization selects the next point at which to query the real function of interest

by densely evaluating a heuristic proxy acquisition function, which captures an exploitation

vs exploration trade-off. We will briefly explore the literature of Bayesian optimization but

delay mathematical definition in the context of Gaussian processes until Chapter 4.

Although originally suggested by Mockus et al. [1978], Bayesian optimization was not generally

popular for many years. However, recently Bayesian optimization has been rediscovered,

presumably due to the development of more mature statistical models of functions and the

computational power required to simulate them. Naturally, the effectiveness of Bayesian

optimization relies on our ability to accurately model distributions over functions. Therefore,

the advancement of Bayesian optimization has been tied to the development of Gaussian

processes as a rich statistical model over continuous functions. This rediscovery is largely

attributable to Jones et al. [1998] and Jones [2001], who derived closed form expressions for

the expected improvement acquisition function and demonstrated the efficiency of Bayesian

optimization compared to other global optimization strategies. Brochu et al. [2010] and

Lizotte [2008] provide a thorough review of the Bayesian optimization literature.

The Bayesian optimization methods and objectives are closely related to those from the

multi-armed bandit setting. This perhaps confusing name is derived from a moniker or

nickname for gambling slot machines as one armed bandits. In the multi-armed bandit

21

literature, the objective is generally to formulate a strategy for iteratively gambling with

multiple metaphorical slot machines, each with different expected return, such that the most

possible return or least total cost is achieved. The cost is generally quantified as simple regret

which represents the difference in expected return for playing a chosen slot machine instead of

the one with the highest expected return. Cumulative regret then represents the cumulative

cost of all evaluations of a given strategy, or simple regret summed over all iterations. The

even more confusingly named continuous armed bandit setting thus refers to the scenario

where one can evaluate a noisy function over a continuous space rather than evaluate a

discrete set of choices. [Srinivas et al., 2010] have proven that in the continuous armed bandit

setting, Gaussian process optimization with the upper confidence bound acquisition function

optimizes a bound on cumulative regret. The objective of minimizing cumulative regret,

the sum of all previous function values, is however fundamentally different from that of

optimization, where only the best value observed is of interest. This separates the objectives

of Bayesian optimization and much of the continuous armed bandit literature. Some work,

such as Gabillon et al. [2012], has considered the problem of finding the “best arm” through

optimizing simple regret in the discrete case under interesting scenarios such as a fixed budget.

The objectives of Bayesian optimization can be considered analogous to those of the “best

arm” setting but in a continuous rather than discrete space.

Some effort has been made to prove the convergence properties of Bayesian optimization in

the multidimensional setting. Vazquez and Bect [2010] have shown that using the expected

improvement criterion, Bayesian optimization is guaranteed to converge to the optimum

under strong assumptions over the function, a fixed covariance function and fixed noise.

Bull [2011] has proven that under some assumptions over the ability of the statistical prior

to model the function of interest, a fixed covariance and using the expected improvement

criterion, Bayesian optimization converges to the optimum efficiently. de Freitas et al. [2012]

demonstrated exponentially vanishing simple regret in the Gaussian process bandit setting

when there are deterministic observations. All these theoretical results, however, rely on

assumptions that are difficult to achieve in practice. For example, they assume an ability

find the true optimum of the underlying acquisition function, which is highly multimodal, at

each iteration.

In Section 4.2 we formally define Bayesian optimization and corresponding acquisition

functions in the Gaussian process setting.

22

Chapter 3

Nonparametrically Guided

Autoencoder

3.1 Introduction

One of the central tasks of machine learning is the inference of latent representations. Most

often these can be interpreted as representing aggregate features that explain various properties

of the data. In probabilistic models, such latent representations typically take the form of

unobserved random variables. Often this latent representation is of direct interest and may

reflect, for example, cluster identities. It may also be useful as a way to explain statistical

variation as part of a density model. In this work, we are interested in the discovery of latent

features which can be later used as alternate representations of data for discriminative tasks.

That is, we wish to find ways to extract statistical structure that will make it as easy as

possible for a classifier or regressor to produce accurate labels.

We are particularly interested in methods for learning latent representations that result

in fast feature extraction for out-of-sample data. We can think of these as devices that

have been trained to perform rapid approximate inference of hidden values associated with

data. Neural networks have proven to be an effective way to perform such processing, and

autoencoder neural networks, specifically, have been used to find representatons for a variety

of downstream machine learning tasks, for example, image classification [Vincent et al., 2008],

speech recognition [Deng et al., 2010], and Bayesian nonparametric modeling [Adams et al.,

2010].

23

The critical insight of the autoencoder neural network is the idea of using a constrained

(typically either sparse or low-dimensional) representation within a feedforward neural network.

The training objective induces the network to learn to reconstruct its input at its output.

The constrained central representation at the bottleneck forces the network to find a compact

way to explain the statistical variation in the data. While this often leads to representations

that are useful for discriminative tasks, it does require that the salient variations in the data

distribution be relevant for the eventual labeling. This assumption does not necessarily always

hold; often irrelevant factors can dominate the input distribution and make it poorly-suited

for discrimination [Larochelle et al., 2007]. In previous work to address this issue, Bengio

et al. [2007] introduced weak supervision into the autoencoder training objective by adding

label-specific output units in addition to the reconstruction. This approach was also followed

by Ranzato and Szummer [2008] for learning document representations.

The difficulty of this approach is that it complicates the task of learning the autoencoder

representation. The objective now is to learn not only a hidden representation that is good for

reconstruction, but also one that is immediately good for discrimination under the simplified

choice of model, for example, logistic regression. This is undesirable because it potentially

prevents us from discovering informative representations for the more sophisticated nonlinear

classifiers that we might wish to use later. We are forced to solve two problems at once, and

the result of one of them (the classifier) will be immediately thrown away.

Here we propose a different take on the issue of introducing supervised guidance into

autoencoder representations. We consider Gaussian process priors on the discriminative

function that maps the latent codes into labels. The result of this choice is a Gaussian process

latent variable model (GPLVM) [Lawrence, 2005] for the labels. This not only allows us to

flexibly represent a wide class of classifiers, but also prevents us from having to commit to a

particular function at training time. We are then able to combine the efficient parametric

feed-forward aspects of the autoencoder with a flexible Bayesian nonparametric model for

the labels. This also leads to an interesting interpretation of the back-constrained GPLVM

itself as a limiting case of an autoencoder in which the decoder has been marginalized out. In

Section 3.5, we empirically examine our proposed approach on three data sets. In Section 7.1,

we demonstrate the effectiveness of our approach on a real-world rehabilitation problem.

We also examine a data set that highlights the value of our approach, in which we cannot

only use guidance from desired labels, but also introduce guidance away from irrelevant

representations.

24

3.2 Unsupervised Learning of Latent Representations

The nonparametrically-guided autoencoder presented in this paper is motivated largely by

the relationship between two different approaches to latent variable modeling. In this section,

we review these two approaches, the GPLVM and autoencoder neural network, and examine

precisely how they are related.

3.2.1 Autoencoder Neural Networks

The autoencoder [Cottrell et al., 1987] is a neural network architecture that is designed to

create a latent representation that is informative of the input data. Through training the

model to reproduce the input data at its output, a latent embedding must arise within the

hidden layer of the model. Its computations can intuitively be separated into two parts:

• An encoder, which maps the input into a latent (often lower-dimensional) representation.

• A decoder, which reconstructs the input through a map from the latent representation.

We will denote the latent space by X and the visible (data) space by Y and assume they

are real valued with dimensionality J and K respectively, that is, X =RJ and Y=RK . The

encoder, then, is defined as a function g(y ; φ) : Y → X and the decoder as f(x ; ψ) : X → Y .

Given N data examples D=y(n)Nn=1, y(n) ∈ Y, we jointly optimize the parameters of the

encoder φ and decoder ψ over the least-squares reconstruction cost:

φ?, ψ?=arg minφ,ψ

N∑n=1

K∑k=1

(y(n)k − fk(g(y(n);φ);ψ))2, (3.1)

where fk(·) is the kth output dimension of f(·). It is easy to demonstrate that this model is

equivalent to principal components analysis when f and g are linear projections. However,

nonlinear basis functions allow for a more powerful nonlinear mapping. In our empirical

analysis we use sigmoidal

g(y;φ) = (1 + exp(−yT+φ))−1

and noisy rectified linear

g(y;φ) = max0,yT+φ + ε, ε ∼ N(0, 1)

25

basis functions for the encoder where y+ denotes y with a 1 appended to account for a

bias term. The noisy rectified linear units or NReLU [Nair and Hinton, 2010]) exhibit the

property that they are more equivariant to the scaling of the inputs (the non-noisy version

being perfectly equivariant when the bias term is fixed to 0). This is a useful property for

image data, for example, as (in contrast to sigmoidal basis functions) global lighting changes

will cause uniform changes in the activations across hidden units.

Recently, autoencoders have regained popularity as they have been shown to be an effective

module for “greedy pre-training” of deep neural networks [Bengio et al., 2007]. Denoising

autoencoders [Vincent et al., 2008] are of particular interest, as they are robust to the trivial

“identity” solutions that can arise when trying to learn overcomplete representations. Over-

complete representations, which are of higher dimensionality than the input, are considered

to be ideal for discriminative tasks. However, these are difficult to learn because a trivial

minimum of the autoencoder reconstruction objective is reached when the autoencoder learns

the identity transformation. The denoising autoencoder forces the model to learn more

interesting structure from the data by providing as input a corrupted training example, while

evaluating reconstruction on the noiseless original. The objective of Equation (3.1) then

becomes

φ?, ψ?=arg minφ,ψ

N∑n=1

K∑k=1

(y(n)k − fk(g(y(n);φ);ψ))2,

where y(n) is the corrupted version of y(n). Thus, in order to infer missing components of the

input or fix the corruptions, the model must extract a richer latent representation.

3.2.2 Gaussian Process Latent Variable Models

While the denoising autoencoder learns a latent representation that is distributed over the

hidden units of the model, an alternative strategy is to consider that the data intrinsically

lie on a lower-dimensional latent manifold that reflects their statistical structure. Such a

manifold is difficult to define a priori, however, and thus the problem is often framed as

learning the latent embedding under an assumed smooth functional mapping between the

visible and latent spaces. Unfortunately, a major challenge arising from this strategy is the

simultaneous optimization of the latent embedding and the functional parameterization. The

Gaussian process latent variable model [Lawrence, 2005] addresses this challenge under a

Bayesian probabilistic formulation. Using a Gaussian process prior, the GPLVM marginalizes

over the infinite possible mappings from the latent to visible spaces and optimizes the latent

26

embedding over a distribution of mappings. The GPLVM results in a powerful nonparametric

model that analytically integrates over the infinite number of functional parameterizations

from the latent to the visible space.

Similar to the autoencoder, linear kernels in the GPLVM recover principal components

analysis. Under a nonlinear basis, however, the GPLVM can represent an arbitrarily complex

continuous mapping, depending on the functions supported by the Gaussian process prior.

Although GPLVMs were initially introduced for the visualization of high dimensional data,

they have been used to obtain state-of-the-art results for a number of tasks, including modeling

human motion [Wang et al., 2008], classification [Urtasun and Darrell, 2007] and collaborative

filtering [Lawrence and Urtasun, 2009].

The GPLVM assumes that the N data examples D=y(n)Nn=1 are the image of a homologous

set x(n)Nn=1 arising from a vector-valued “decoder” function f(x) : X → Y. Analogously

to the squared-loss of the previous section, the GPLVM assumes that the observed data

have been corrupted by zero-mean Gaussian noise: y(n) =f(x(n))+ε with ε∼N(0, σ2IK). The

innovation of the GPLVM is to place a Gaussian process prior on the function f(x) and then

optimize the latent representation x(n)Nn=1, while marginalizing out the unknown f(x).

3.2.3 Gaussian Process Priors

Rather than requiring a specific finite basis, the Gaussian process provides a distribution

over random functions of a particular family, the properties of which are specified via a

positive definite covariance function. Typically, Gaussian processes are defined in terms of a

distribution over scalar functions and in keeping with the convention for the GPLVM, we

shall assume that K independent GPs are used to construct the vector-valued function f(x).

We denote each of these functions as fk(x) : X → R. The GP requires a covariance kernel

function, which we denote as C(x,x′) : X×X → R. The defining characteristic of the GP

is that for any finite set of N data in X there is a corresponding N -dimensional Gaussian

distribution over the function values, which in the GPLVM we take to be the components of Y .

The N×N covariance matrix of this distribution is the matrix arising from the application of

the covariance kernel to the N points in X . We denote any additional parameters governing

the behavior of the covariance function by θ.

Under the component-wise independence assumptions of the GPLVM, the Gaussian process

prior allows one to analytically integrate out the K latent scalar functions from X to Y.

27

Allowing for each of the K Gaussian processes to have unique hyperparameter θk, we write the

marginal likelihood, that is, the probability of the observed data given the hyperparameters

and the latent representation, as

p(y(n)Nn=1 | x(n)Nn=1, θkKk=1, σ2) =

K∏k=1

N(y(·)k | 0,Σθk +σ2IN),

where y(·)k refers to the vector [y

(1)k , . . . , y

(N)k ] and where Σθk is the matrix arising from xnNn=1

and θk. In the basic GPLVM, the optimal xn are found by maximizing this marginal likelihood.

3.3 Covariance Functions

Here we will briefly describe the covariance functions used in this work. For a more thorough

treatment, we direct the reader to Rasmussen and Williams [2006, Chapter 4].

A common choice of covariance function for the GP is the automatic relevance determination

(ARD) exponentiated quadratic (also known as squared exponential) kernel

KEQ(x,x′) = exp

−1

2r2(x,x′)

, r2(x,x′) = (x− x′)TΨ(x− x′)

where the covariance between outputs of the GP depends on the distance between correspond-

ing inputs. Here Ψ is a symmetric positive definite matrix that defines the metric [Vivarelli

and Williams, 1999]. Typically, Ψ is a diagonal matrix with Ψd,d = 1/`2d, where the length-

scale parameters, `d, scale the contribution of each dimension of the input independently. In

the case of the GPLVM, these parameters are made redundant as the inputs themselves are

learned. Thus, in this work we assume these kernel hyperparameters are set to a fixed value.

The exponentiated quadratic construction is not appropriate for all functions. Consider a

function that is periodic in the inputs. The covariance between outputs should then depend

not on the Euclidian distance between inputs but rather on their phase. A solution is to

warp the inputs to capture this property and then apply the exponentiated quadratic in

this warped space. To model a periodic function, MacKay [1998] suggests applying the

exponentiated quadratic covariance to the output of an embedding function u(x), where for

a single dimensional input x, u(x) = [sin(x), cos(x)] expands from R to R2. The resulting

28

periodic covariance becomes

KPER(x, x′) = exp

−

2 sin2 x−x′2

`2

.

The exponentiated quadratic covariance can be shown [MacKay, 1998] to be the similarity

between inputs after they are projected into a feature space by an infinite number of centered

radial basis functions. Williams [1998] derived a kernel that similarly, under a specific

activation function, reflects a feature projection by a neural network in the limit of infinite

units. This results in the neural network covariance

KNN(x,x′) =2

πsin−1

2xTΨx′√(1 + 2xTΨx′)(1 + 2xTΨx′)

, (3.2)

where x is x with a 1 prepended. An important distinction from the exponentiated quadratic

is that the neural net covariance is non-stationary. Unlike the exponentiated quadratic, the

neural network covariance is not invariant to translation. We use the neural network covariance

primarily to draw a theoretical connection between GPs and autoencoders. However, the

non-stationary properties of this covariance in the context of the GPLVM, which can allow

the GPLVM to capture more complex structure, warrant further investigation.

3.3.1 The Back-Constrained GPLVM

Although the GPLVM constrains the mapping from the latent space to the data to be smooth,

it does not enforce smoothness in the inverse mapping. This can be an undesirable property, as

data that are intrinsically close in observed space need not be close in the latent representation.

Not only does this introduce arbitrary gaps in the latent manifold, but it also complicates the

encoding of novel data points into the latent space as there is no direct mapping. The latent

representations of out-of-sample data must thus be optimized, conditioned on the latent

embedding of the training examples. Lawrence and Quinonero Candela [2006] reformulated

the GPLVM to address these issues, with the constraint that the hidden representation be

the result of a smooth map from the observed space. They proposed multilayer perceptrons

and radial-basis-function networks as possible implementations of this smooth mapping. We

will denote this “encoder” function, parameterized by φ, as g(y ; φ) : Y → X . The marginal

likelihood objective of this back-constrained GPLVM can now be formulated as finding the

29

optimal φ under:

φ?=arg minφ

K∑k=1

ln |Σθk,φ+σ2IN |+ y(·)k

T(Σθk,φ+σ2IN)−1y

(·)k , (3.3)

where the kth covariance matrix Σθk,φ now depends not only on the kernel hyperparameters θk,

but also on the parameters of g(y ; φ), that is,

[Σθk,φ]n,n′ = C(g(y(n);φ), g(y(n′);φ) ; θk). (3.4)

Lawrence and Quinonero Candela [2006] motivate the back-constrained GPLVM partially

through the NeuroScale algorithm of Lowe and Tipping [1997]. The NeuroScale algorithm is

a radial basis function network that creates a one-way mapping from data to a latent space

using a heuristic loss that attempts to preserve pairwise distances between data cases. Thus,

the back-constrained GPLVM can be viewed as a combination of NeuroScale and the GPLVM

where the pairwise distance loss is removed and rather the loss is backpropagated from the

GPLVM.

3.3.2 GPLVM as an Infinite Autoencoder

The relationship between Gaussian processes and artificial neural networks was established

by Neal [1996], who showed that the prior over functions implied by many parametric neural

networks becomes a GP in the limit of an infinite number of hidden units. Williams [1998]

subsequently derived a GP covariance function corresponding to such an infinite neural

network (Equation 3.2) with a specific activation function.

An interesting and overlooked consequence of this relationship is that it establishes a connec-

tion between autoencoders and the back-constrained Gaussian process latent variable model.

A GPLVM with the covariance function of Williams [1998], although it does not impose

a density over the data, is similar to a density network [MacKay, 1994b] with an infinite

number of hidden units in the single hidden layer. We can transform this density network into

a semiparametric autoencoder by applying a neural network as the backconstraint network

of the GPLVM. The encoder of the resulting model is a parametric neural network and the

decoder a Gaussian process.

30

We can alternatively derive this model starting from an autoencoder. With a least-squares

reconstruction cost and a linear decoder, one can integrate out the weights of the decoder

assuming a zero-mean Gaussian prior over the weights. This results in a Gaussian process

for the decoder and learning thus corresponds to the minimization of Equation (3.3) with a

linear kernel for Equation (3.4). Incorporating any non-degenerate positive definite kernel,

which corresponds to a decoder of infinite size, also recovers the general back-constrained

GPLVM algorithm.

This infinite autoencoder exhibits some attractive properties. After training, the decoder

network of an autoencoder is generally superfluous. Learning a parametric form for this

decoder is thus a nuisance that complicates the objective. The infinite decoder network, as

realized by the GP, obviates the need to learn a parameterization and instead marginalizes

over all possible decoders. The parametric encoder offers a rapid encoding and persists

as the training data changes, permitting, for example, stochastic gradient descent. A

disadvantage, however, is that the decoder naturally inherits the computational costs of the

GP by memorizing the data. Thus, for very high dimensional data, a standard autoencoder

may be more desirable.

3.4 Supervised Guidance of Latent Representations

Unsupervised learning has proven to be effective for learning latent representations that

excel in discriminative tasks. However, when the salient statistics of the data are only

weakly informative about a desired discriminative task, it can be useful to incorporate label

information into unsupervised learning. Bengio et al. [2007] demonstrated, for example,

that while a purely supervised signal can lead to overfitting, mild supervised guidance can

be beneficial when initializing a discriminative deep neural network. Therefore, Bengio

et al. [2007] proposed a hybrid approach under which the unsupervised model’s latent

representation also be trained to predict the label information, by adding a parametric

mapping c(x ; Λ) : X → Z from the latent space X to the labels Z and backpropagating

error gradients from the output. Bengio et al. [2007] used a linear logistic regression classifier

for this parametric mapping. This “partial supervision” thus encourages the model to

encode statistics within the latent representation that are useful for a specific (but learned)

parameterization of such a linear mapping. Ranzato and Szummer [2008] adopted a similar

strategy to learn compact representations of documents.

31

There are disadvantages to this approach. The assumption of a specific parametric form

for the mapping c(x ; Λ) restricts the supervised guidance to classifiers within that family

of mappings. Also, the learned representation is committed to one particular setting of the

parameters Λ. Consider the learning dynamics of gradient descent optimization for this

strategy. At every iteration t of descent (with current state φt, ψt,Λt), the gradient from

supervised guidance encourages the latent representation (currently parametrized by φt, ψt)

to become more predictive of the labels under the current label map c(x ; Λt). Such behavior

discourages moves in φ, ψ space that make the latent representation more predictive under

some other label map c(x ; Λ?) where Λ? is potentially distant from Λt. Hence, while the

problem would seem to be alleviated by the fact that Λ is learned jointly, this constant pressure

towards representations that are immediately useful increases the difficulty of learning the

unsupervised component.

3.4.1 Nonparametrically Guided Autoencoder

Instead of specifying a particular discriminative regressor for the supervised guidance and

jointly optimizing for its parameters and those of an autoencoder, it seems more desirable to

enforce only that a mapping to the labels exists while optimizing for the latent representation.

That is, rather than learning a latent representation that is tied to a specific parameterized

mapping to the labels, we would instead prefer to find a latent representation that is consistent

with an entire class of mappings. One way to arrive at such a guidance mechanism is to

marginalize out the parameters Λ of a label map c(x ; Λ) under a distribution that permits a

wide family of functions. We have seen previously that this can be done for reconstructions

of the input space with a decoder f(x ; ψ). We follow the same reasoning and do this instead

for c(x ; Λ). Integrating out the parameters of the label map yields a back-constrained

GPLVM acting on the label space Z, where the back constraints are determined by the input

space Y. The positive definite kernel specifying the Gaussian process then determines the

properties of the distribution over mappings from the latent representation to the labels.

The result is a hybrid of the autoencoder and back-constrained GPLVM, where the encoder

is shared across models. For notation, we will refer to this approach to guided latent

representation as a nonparametrically guided autoencoder, or NPGA.

Let the label space Z be an M -dimensional real space,1 that is, Z=RM , and the nth training

example has a label vector z(n) ∈ Z. The covariance function that relates label vectors in

1For discrete labels, we use a “one-hot” encoding.

32

the NPGA is

[Σθm,φ,Γ]n,n′ = C(Γ · g(y(n);φ),Γ · g(y(n′);φ) ; θm),

where Γ ∈ RH×J is an H-dimensional linear projection of the encoder output. For H J ,

this projection improves efficiency and reduces overfitting. Learning in the NPGA is then

formulated as finding the optimal φ, ψ,Γ under the combined objective:

φ?, ψ?,Γ?=arg minφ,ψ,Γ

(1−α)Lauto(φ, ψ) + αLGP(φ,Γ)

where α ∈ [0, 1] linearly blends the two objectives

Lauto(φ, ψ) =1

K

N∑n=1

K∑k=1

(y(n)k − fk(g(y(n);φ);ψ))2,

LGP(φ,Γ) =1

M

M∑m=1

[ln |Σθm,φ,Γ+σ2IN | +z(·)

m

T(Σθm,φ,Γ+σ2IN)−1z(·)

m

].

We use a linear decoder for f(x ; ψ), and the encoder g(y;φ) is a linear transformation

followed by a fixed element-wise nonlinearity. As is common for autoencoders and to reduce

the number of free parameters in the model, the encoder and decoder weights are tied.

As proposed in the denoising autoencoder variant of Vincent et al. [2008], we always add

noise to the encoder inputs in cost Lauto(φ, ψ), keeping the noise fixed during each iteration

of learning. That is, we update the denoising autoencoder noise every three iterations of

conjugate gradient descent optimization. For the larger data sets, we divide the training data

into mini-batches of 350 training cases and perform three iterations of conjugate gradient

descent per mini-batch. The optimization proceeds sequentially over the batches such that

model parameters are updated after each mini-batch.

3.4.2 Related Models

An example of the hybridization of an unsupervised connectionist model and Gaussian

processes has been explored in previous work. Salakhutdinov and Hinton [2008] used restricted

Boltzmann machines (RBMs) to initialize a multilayer neural network mapping into the

covariance kernel of a Gaussian process regressor or classifier. They then adjusted the

mapping through backpropagating gradients from the Gaussian process through the neural

33

network. In contrast to the NPGA, this model did not use a Gaussian process in the initial

learning of the latent representation and relies on a Gaussian process for inference at test

time. Unfortunately, this poses significant practical issues for large data sets such as NORB

or CIFAR-10, as the computational complexity of GP inference is cubic in the number of

data examples. Note also that when the salient variations of the data are not relevant to a

given discriminative task, the initial RBM training will not encourage the encoding of the

discriminative information in the latent representation. The NPGA circumvents these issues

by applying a GP to small mini-batches during the learning of the latent representation and

uses the GP to learn a representation that is better even for a linear discriminative model.

Previous work has merged parametric unsupervised learning and nonparametric supervised

learning. Salakhutdinov and Hinton [2007] combined autoencoder training with neighborhood

component analysis [Goldberger et al., 2004], which encouraged the model to encode similar

latent representations for inputs belonging to the same class. Hadsell et al. [2006] employ a

similar objective in a fully supervised setting to preserve distances in label space in a latent

representation. They used this method to visualize the different latent embeddings that can

arise from using additional labels on the NORB data set. Note that within the NPGA, the

backconstrained-GPLVM performs an analogous role. In Equation 3.3, the first term, the log

determinant of the kernel, regularizes the latent space. Since the determinant is minimized

when the covariance between all pairs is maximized, it pulls all examples together in the

latent space. The second term, however, pushes examples that are distant in label space

apart in the latent space. For example, when a one-hot coding is used, the labels act as

indicator variables reflecting same-class pairs in the concentration matrix. This pushes apart

examples that are of different class and pulls together examples of the same class. Thus, the

GPLVM enforces that examples close in label space will be closer in the latent representation

than examples that are distant in label space.

There are several important differences, however, between the aforementioned approaches

and the NPGA. First, the NPGA can be intuitively interpreted as using a marginalization

over mappings to labels. Second, the NPGA naturally accommodates continuous labels and

enables the use of any covariance function within the wide library from the Gaussian process

literature. Incorporating periodic labels, for example, is straightforward through using a

periodic covariance. Encoding such periodic signals in a parametric neural network and

blending this with unsupervised learning can be challenging [Zemel et al., 1995]. Similarly

to a subset of the aforementioned work, the NPGA exhibits the property that it not only

enables the learning of latent representations that encode information that is relevant for

34

discrimination but as we show in Section 3.5.3, it can ignore salient information that is not

relevant to the discriminative representation.

Although it was originally developed as a model for unsupervised dimensionality reduction, a

number of approaches have explored the addition of auxiliary signals within the GPLVM.

The Discriminative GPLVM [Urtasun and Darrell, 2007], for example, added a discriminant

analysis based prior that enforces inter-class separability in the latent space. The DGPLVM

is, however, restricted to discrete labels, requires that the latent dimensionality be smaller

than the number of classes and uses a GP mapping to the data, which is computationally

prohibitive for high dimensional data. A GPLVM formulation in which multiple GPLVMs

mapping to different signals share a single latent space, the shared GPLVM (SGPLVM), was

introduced by Shon et al. [2005]. Wang et al. [2007] showed that using product kernels within

the context of the GPLVM results in a generalisation of multilinear models and allows one

to separate the encoding of various signals in the latent representation. As discussed above,

the reliance on a Gaussian process mapping to the data prohibits the application of these

approaches to large and high dimensional data sets. Our model overcomes these limitations

through using a natural parametric form of the GPLVM, the autoencoder, to map to the

data.

3.5 Empirical Analyses

We now present experiments with NPGA on three different classification data sets. In all

experiments, the discriminative value of the learned representation is evaluated by training a

linear (logistic) classifier, a standard practice for evaluating latent representations.

3.5.1 Oil Flow Data

We begin our emprical analysis by exploring the benefits of using the NPGA on a multi-phase

oil flow classification problem [Bishop and James, 1993]. The data are twelve-dimensional,

real-valued gamma densitometry measurements from a simulation of multi-phase oil flow. The

relatively small sample size of these data—1,000 training and 1,000 test examples—makes

this problem useful for exploring different models and training procedures. We use these data

primarily to explore two questions:

35

Beta

Alpha

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

(a)

0 0.2 0.4 0.6 0.8 1

2

3

4

5

Alpha

Perc

ent C

lass

ifica

tion

Erro

r

Fully Parametric ModelNPGA − RBF KernelNPGA − Linear Kernel

(b)

Figure 3.1: We explore the benefit of the NPGA on the oil data through adjusting the relativecontributions of the autoencoder, logistic regressor and GP costs in the hybrid objective bymodifying α and β. (a) Classification error on the test set on a linear scale from 6% (dark) to 1%(light) (b) Cross-sections of (a) at β=0 (a fully parametric model) and β=1 (NPGA).

• To what extent does the nonparametric guidance of an unsupervised parametric au-

toencoder improve the learned feature representation with respect to the classification

objective?

• What additional benefit is gained through using nonparametric guidance over simply

incorporating a parametric mapping to the labels?

In order to address these concerns, we linearly blend our nonparametric guidance cost

LGP(φ,Γ) with the one Bengio et al. [2007] proposed, referred to as LLR(φ,Λ):

L(φ, ψ,Λ,Γ ; α, β) = (1−α)Lauto(φ, ψ) + α((1−β)LLR(φ,Λ) + βLGP(φ,Γ)), (3.5)

where β ∈ [0, 1] and Λ are the parameters of a multi-class logistic regression mapping to the

labels.

Thus, α allows us to adjust the relative contribution of the unsupervised guidance while β

weighs the relative contributions of the parametric and nonparametric supervised guidance.

To assess the benefit of the nonparametric guidance, we perform a grid search over the range

of settings for α and β at intervals of 0.1. For each of these intervals, a model was trained for

100 iterations of conjugate gradient descent and classification performance was assessed by

36

(a) (b)

Figure 3.2: Latent projections of the 1000 test cases within the two dimensional latent space of theGP, Γ, for (a) a NPGA (α = 0.5) and (b) a back-constrained GPLVM.

applying logistic regression on the hidden units of the encoder. 250 NRenLU units were used

in the encoder, and zero-mean Gaussian noise with a standard deviation of 0.05 was added to

the inputs of the denoising autoencoder cost. The GP label mapping used an RBF covariance

with H=2. To make the problem more challenging, a subset of 100 training samples was

used. Each experiment was repeated over 20 different random initializations.

The results of this analysis are visualized in Figure 3.2. Figure 3.1b demonstrates that,

even when compared with direct optimization under the discriminative family that will

be used at test time (logistic regression), performance improves by integrating out the

label map. However, in Figure 3.1a we can see that some parametric guidance can be

beneficial, presumably because it is from the same discriminative family as the final classifier.

A visualisation of the latent representation learned by an NPGA and a standard back-

constrained GPLVM is provided in Figures 3.2a and 3.2b. The former clearly embeds much

more class-relevant structure than the latter.

We observe also that using a GP with a linear covariance function within the NPGA

outperforms the parametric guidance (see Fig. 3.1b). While the performance of the model

does depend on the choice of kernel, this helps to confirm that the benefit of our approach

is achieved mainly through integrating out the label mapping, rather than having a more

powerful nonlinear mapping to the labels. Another interesting result is that the results of the

37

Figure 3.3: A sample of filters learned by the NPGA on the CIFAR 10 data set. This model achieveda test accuracy of 65.71%. The filters are sorted by norm.

linear covariance NPGA are significantly noisier than the RBF mapping. Presumably, this is

due to the long-range global support of the linear covariance causing noisier batch updates.

3.5.2 CIFAR 10 Image Data

We also apply the NPGA to a much larger data set that has been widely studied in the

connectionist learning literature. The CIFAR 10 object classification data set2 is a labeled

subset of the 80 million tiny images data [Torralba et al., 2008] with a training set of 50,000

32×32 color images and a test set of an additional 10,000 images. The data are labeled into

ten classes. As GPs scale poorly on large data sets we consider it pertinent to explore the

following:

Are the benefits of nonparametric guidance still observed in a larger scale classifi-

cation problem, when mini-batch training is used?

To answer this question, we evaluate the use of nonparametric guidance on three different

combinations of preprocessing, architecture and convolution. For each experiment, an

autoencoder is compared to a NPGA by modifying α. Experiments3 were performed following

2CIFAR data set at http://www.cs.utoronto.ca/~kriz/cifar.html.3When PCA preprocessing was used for autoencoder training, the inputs were corrupted with zero-mean

Gaussian noise with standard deviation 0.05. Otherwise, raw pixels were corrupted by deleting (i.e., set tozero) 10% of the pixels. Autoencoder training then corresponds to reconstructing the original input. Eachmodel used a neural net (MLP) covariance with fixed hyperparameters.

38

http://www.cs.utoronto.ca/~kriz/cifar.html

Experiment α Accuracy

1. Full Images

0.0 46.91%0.1 56.75%0.5 52.11%1.0 45.45%

2. 28x28 Patches0.0 63.20%0.8 65.71%

3. Convolutional0.0 73.52%0.1 75.82%

Sparse Autoencoder [Coates et al., 2011] 73.4%Sparse RBM [Coates et al., 2011] 72.4%K-means (Hard) [Coates et al., 2011] 68.6%K-means (Triangle) [Coates et al., 2011] 77.9%

Table 3.1: Results on CIFAR 10 for various training strategies, varying the nonparametric guidanceα. Recently published convolutional results are shown for comparison.

three different strategies:

1. Full images: A one-layer autoencoder with 2400 NReLU units was trained on the raw

data (which was reduced from 32×32×3 = 3072 to 400 dimensions using PCA). A GP

mapping to the labels operated on a H = 25 dimensional space.

2. 28 × 28 patches: An autoencoder with 1500 logistic hidden units was trained on

28×28×3 patches subsampled from the full images, then reduced to 400 dimensions

using PCA. All models were fine tuned using backpropagation with softmax outputs

and predictions were made by taking the expectation over all patches (i.e., to classify

an image, we consider all 28×28 patches obtained from that image and then average

the label distributions over all patches). A H=25 dimensional latent space was used

for the GP.

3. Convolutional: Following Coates et al. [2011], 6×6 patches were subsampled and

each patch was normalized for lighting and contrast. This resulted in a 36×3 = 108

dimensional feature vector as input to the autoencoder. For classification, features were

computed densely over all 6×6 patches. The images were divided into 4×4 blocks

and features were pooled through summing the feature activations in each block. 1600

NReLU units were used in the autoencoder but the GP was applied to only 400 of

them. The GP used a H=10 dimensional space.

39

After training, a logistic regression classifier was applied to the features resulting from the

hidden layer of each autoencoder to evaluate their quality with respect to the classification

objective. The results, presented in Table 3.1, show that supervised guidance helps in

all three strategies. The use of different architectures, methodologies and hidden unit

activations demonstrates that the nonparametric guidance can be beneficial for a wide variety

of formulations. Although, we do not achieve state of the art results on this data set, these

results demonstrate that nonparametric guidance is beneficial for a wide variety of model

architechtures. We note that the optimal amount of guidance differs for each experiment

and setting α too high can often be detrimental to performance. This is to be expected,

however, as the amount of discriminative information available in the data differs for each

experiment. The small patches in the convolutional strategy, for example, likely encode very

weak discriminative information. Figure 3.3 visualises the encoder weights learned by the

NPGA on the CIFAR data.

3.5.3 Small NORB Image Data

In the following empirical analysis, the use of the NPGA is explored on the small NORB data

[LeCun et al., 2004]. The data are stereo image pairs of fifty toys belonging to five generic

categories. Each toy was imaged under six lighting conditions, nine elevations and eighteen

azimuths. The 108×108 images were subsampled to half their size to yield a 48×48×2

dimensional input vector per example. The objects were divided evenly into test and training

sets yielding 24,300 examples each. The objective is to classify to which object category each

of the test examples belongs.

This is an interesting problem as the variations in the data due to the different imaging

conditions are the salient ones and will be the strongest signal learned by the autoencoder.

This is nuisance structure that will influence the latent embedding in undesirable ways. For

example, neighbors in the latent space may reflect lighting conditions in observed space

rather than objects of the same class. Certainly, the squared pixel difference objective of

the autoencoder will be affected more by significant lighting changes than object categories.

Fortunately, the variations due to the imaging conditions are known a priori. In addition to

an object category label, there are two real-valued vectors (elevation and azimuth) and one

discrete vector (lighting type) associated with each example. In our empirical analysis we

examine the following:

As the autoencoder attempts to coalesce the various sources of structure into its

40

Class Elevation Lighting

Figure 3.4: Visualisations of the NORB training (top) and test (bottom) data latent space represen-tations in the NPGA, corresponding to class (first column), elevation (second column), and lighting(third column). The visualizations are in the GP latent space, Γ, of a model with H = 2 for eachGP. Colors correspond to the respective labels.

hidden layer, can the NPGA guide the learning in such a way as to separate the

class-invariant transformations of the data from the class-relevant information?

An NPGA was constructed with Gaussian processes mapping to each of the four label types

to address this question. In order to separate the latent embedding of the salient information

related to each label, the GPs were applied to disjoint subsets of the hidden units of the

autoencoder. The autoencoder’s 2400 NReLU units were partitioned such that half were used

to encode structure relevant for classification and the other half were evenly divided to encode

the remaining three labels. Thus a GP mapping from a four dimensional latent space, H=4,

to class labels was applied to 1200 hidden units. GPs, with H=2, mapping to the three

auxiliary labels were applied each to 400 hidden units. As the lighting labels are discrete,

we used a one-hot coding, similarly to the class labels. The elevation labels are continuous,

so the GP was mapped directly to the labels. Finally, because the azimuth is a periodic

41

Model Accuracy

Autoencoder + 4(Log)reg (α = 0.5) 85.97%GPLVM 88.44%SGPLVM (4 GPs) 89.02%NPGA (4 GPs Lin – α=0.5) 92.09%Autoencoder 92.75%Autoencoder + Logreg (α = 0.5) 92.91%NPGA (1 GP NN – α=0.5) 93.03%NPGA (1 GP Lin – α=0.5) 93.12%NPGA (4 GPs Mix – α=0.5) 94.28%

K-Nearest Neighbors [LeCun et al., 2004] 83.4%Gaussian SVM [Salakhutdinov and Larochelle, 2010] 88.4%3 Layer DBN [Salakhutdinov and Larochelle, 2010] 91.69%DBM: MF-FULL [Salakhutdinov and Larochelle, 2010] 92.77%Third Order RBM [Nair and Hinton, 2009] 93.5%

Table 3.2: Experimental results on the small NORB data test set. Relevant published results areshown for comparison. NN, Lin and Mix indicate neural network, linear and a combination of neuralnetwork and periodic covariances respectively. Logreg indicates that a parametric logistic regressionmapping to labels is blended with the autoencoder.

signal, a periodic kernel was used for the azimuth GP. This highlights a major advantage of

our approach, as the broad library of GP covariance functions facilitate a flexibility to the

mapping that would be challenging with a parametric model.

To validate this configuration, we empirically compared it to a standard autoencoder (i.e.,

α=0), an autoencoder with parametric logistic regression guidance and an NPGA with a

single GP applied to all hidden units mapping to the class labels. For comparison, we also

provide results obtained by a back-constrained GPLVM and SGPLVM.4 For all models, a

validation set of 4300 training cases was withheld for parameter selection and early stopping.

Neural net covariances were used for each GP except the one applied to azimuth, which

used a periodic RBF kernel. GP hyperparameters were held fixed as their influence on the

objective would confound the analysis of the role of α. For denoising autoencoder training,

the raw pixels were corrupted by setting 20% of pixels to zero in the inputs. Each image was

lighting- and contrast-normalized and the error on the test set was evaluated using logistic

4The GPLVM and SGPLVM were applied to a 96 dimensional PCA of the data for computional tractability,used a neural net covariance mapping to the data, and otherwise used the same back-constraints, kernelconfiguration, and mini-batch training as the NPGA. The SGPLVM consisted of a GPLVM with a latentspace that is shared by multiple GPLVM mappings to the data and each of the labels

42

regression on the hidden units of each model. A visualisation of the structure learned by the

GPs is shown in Figure 3.4. Results of the empirical comparison are presented in Table 3.2.

The NPGA model with four nonlinear kernel GPs significantly outperforms all other models,

with an accuracy of 94.28%. This is to our knowledge the best (non-convolutional) result

for a shallow model on this data set. The model indeed appears to separate the irrelevant

transformations of the data from the structure relevant to the classification objective. In

fact, a logistic regression classifier applied to only the 1200 hidden units on which the class

GP was applied achieves a test error rate of 94.02%. This implies that the half of the latent

representation that encodes the information to which the model should be invariant can

be discarded with virtually no discriminative penalty. Given the significant difference in

accuracy between this formulation and the other models, it appears to be very important to

separate the encoding of different sources of variation within the autoencoder hidden layer.

The NPGA with four linear covariance GPs performed more poorly than the NPGA with a

single linear covariance GP to class labels (92.09% compared to 93.03%). This interesting

observation highlights the importance of using an appropriate mapping to each label. For

example, it is unlikely that a linear covariance would be able to appropriately capture the

structure of the periodic azimuth signal. An autoencoder with parametric guidance to all four

labels, mimicking the configuration of the NPGA, achieved the poorest performance of the

models tested, with 86% accuracy. This model incorporated two logistic and two Gaussian

outputs applied to separate partitions of the hidden units. These results demonstrate the

advantage of the GP formulation for supervised guidance, which gives the flexibility of

choosing an appropriate kernel for different label mappings (e.g., a periodic kernel for the

rotation label).

3.6 Conclusion

In this chapter we present an interesting theoretical link between the autoencoder neural

network and the back-constrained Gaussian process latent variable model. A particular

formulation of the back-constrained GPLVM can be interpreted as an autoencoder in which

the decoder has an infinite number of hidden units. This formulation exhibits some attractive

properties as it allows one to learn the encoder half of the autoencoder while marginalizing

over decoders. We examine the use of this model to guide the latent representation of

an autoencoder to encode auxiliary label information without instantiating a parametric

43

mapping to the labels. The resulting nonparametric guidance encourages the autoencoder

to encode a latent representation that captures salient structure within the input data

that is harmonious with the labels. Conceptually, this approach enforces simply that a

smooth mapping exists from the latent representation to the labels rather than choosing

or learning a specific parameterization. The approach is empirically validated on four

data sets, demonstrating that the nonparametrically guided autoencoder encourages latent

representations that are better with respect to a discriminative task. Code to run the NPGA

is available at http://hips.seas.harvard.edu/files/npga.tar.gz. We demonstrate on

the NORB data that this model can also be used to discourage latent representations that

capture statistical structure that is known to be irrelevant through guiding the autoencoder

to separate multiple sources of variation. This achieves state-of-the-art performance for a

shallow non-convolutional model on NORB. In Section 7.1, we show that the hyperparameters

introduced in this formulation can be optimized automatically and efficiently using Bayesian

optimization. With these automatically selected hyperparameters the model achieves state of

the art performance on a real-world applied problem in rehabilitation research.

44

http://hips.seas.harvard.edu/files/npga.tar.gz

Chapter 4

Practical Bayesian Optimization of

Machine Learning Algorithms

4.1 Introduction

Machine learning algorithms are rarely parameter-free: parameters controlling the rate of

learning or the capacity of the underlying model must often be specified. These parameters

are often considered nuisances, making it appealing to develop machine learning algorithms

with fewer of them. Another, more flexible take on this issue is to view the optimization of

such parameters as a procedure to be automated. Specifically, we could view such tuning

as the optimization of an unknown black-box function and invoke algorithms developed for

such problems. A good choice is Bayesian optimization [Mockus et al., 1978], which has

been shown to outperform other state of the art global optimization algorithms on a number

of challenging optimization benchmark functions [Jones, 2001]. For continuous functions,

Bayesian optimization typically works by assuming the unknown function was sampled from

a Gaussian process and maintains a posterior distribution for this function as observations are

made or, in our case, as the results of running learning algorithm experiments with different

hyperparameters are observed. To pick the hyperparameters of the next experiment, one can

optimize the expected improvement (EI) [Mockus et al., 1978] over the current best result or

the Gaussian process upper confidence bound (UCB)[Srinivas et al., 2010]. EI and UCB have

been shown to be efficient in the number of function evaluations required to find the global

optimum of many multimodal black-box functions [Bull, 2011, Srinivas et al., 2010].

Machine learning algorithms, however, have certain characteristics that distinguish them

45

from other black-box optimization problems. First, each function evaluation can require a

variable amount of time: training a small neural network with 10 hidden units will take less

time than a bigger network with 1000 hidden units. Even without considering duration, the

advent of cloud computing makes it possible to quantify economically the cost of requiring

large-memory machines for learning, changing the actual cost in dollars of an experiment

with a different number of hidden units. Second, machine learning experiments are often

run in parallel, on multiple cores or machines. In both situations, the standard sequential

approach of GP optimization can be suboptimal.

In this chapter, we identify good practices for Bayesian optimization of machine learning

algorithms. We argue that a fully Bayesian treatment of the underlying GP kernel is preferred

to the approach based on optimization of the GP hyperparameters, as previously proposed

[Bergstra et al., 2011]. Our second contribution is the description of new algorithms for taking

into account the variable and unknown cost of experiments or the availability of multiple

cores to run experiments in parallel.

Gaussian processes have proven to be useful surrogate models for computer experiments

and good practices have been established in this context for sensitivity analysis, calibration

and prediction [Kennedy and O’Hagan, 2001]. While these strategies are not considered

in the context of optimization, they can be useful to researchers in machine learning who

wish to understand better the sensitivity of their models to various hyperparameters. Os-

borne et al. [2009] and Garnett et al. [2010] have explored alternative strategies for using

Gaussian processes for global optimization and over sets of sensors where they integrate over

GP hyperparameters using sampling approaches and Bayesian quadrature. They consider

methods for incorporating derivative observations and novel strategies for dealing with com-

mon conditioning problems that hinder practical implementations of Bayesian optimization.

Hutter et al. [2011] have developed sequential model-based optimization strategies for the

configuration of satisfiability and mixed integer programming solvers using random forests.

The machine learning algorithms we consider, however, warrant a fully Bayesian treatment

as their expensive nature necessitates minimizing the number of evaluations. Bayesian

optimization strategies have also been used to tune the parameters of Markov chain Monte

Carlo algorithms by Mahendran et al. [2012]. Azimi et al. [2011] consider the problem of

scheduling multiple experiments in Bayesian optimization concurrently within a fixed budget

of experiments and time where the running time is stochastic. We instead consider the

case where experiments are run concurrently and the running time or cost is an unknown

function of the inputs. Recently, Bergstra et al. [2011] have explored various strategies for

46

optimizing the hyperparameters of machine learning algorithms. They demonstrated that

grid search strategies are inferior to random search [Bergstra and Bengio, 2012] and suggested

the use of Gaussian process Bayesian optimization, optimizing the hyperparameters of a

squared-exponential covariance, and proposed the Tree Parzen Algorithm.

4.2 Bayesian Optimization with Gaussian Process

Priors

As in other kinds of optimization, in Bayesian optimization we are interested in finding the

minimum of a function f(x) on some bounded set X , which we will take to be a subset of RD.

What makes Bayesian optimization different from other procedures is that it constructs a

probabilistic model for f(x) and then exploits this model to make decisions about where in Xto next evaluate the function, while integrating out uncertainty. The essential philosophy

is to use all of the information available from previous evaluations of f(x) and not simply

rely on local gradient and Hessian approximations. This results in a procedure that can

find the minimum of difficult non-convex functions with relatively few evaluations, at the

cost of performing more computation to determine the next point to try. When evaluations

of f(x) are expensive to perform — as is the case when it requires training a machine learning

algorithm — then it is easy to justify some extra computation to make better decisions. For

an overview of the Bayesian optimization formalism and a review of previous work, see, e.g.,

Brochu et al. [2010]. In this section we briefly review the general Bayesian optimization

approach, before discussing our novel contributions in Section 4.3.

There are two major choices that must be made when performing Bayesian optimization.

First, one must select a prior over functions that will express assumptions about the function

being optimized. For this we choose the Gaussian process prior, due to its flexibility and

tractability. Second, we must choose an acquisition function, which is used to construct a

utility function from the model posterior, allowing us to determine the next point to evaluate.

4.2.1 Gaussian Processes

The Gaussian process (GP) is a convenient and powerful prior distribution on functions,

which we will take here to be of the form f : X → R. The GP is defined by the property

that any finite set of N points xn ∈ XNn=1 induces a multivariate Gaussian distribution

47

on RN . The nth of these points is taken to be the function value f(xn), and the elegant

marginalization properties of the Gaussian distribution allow us to compute marginals and

conditionals in closed form. The support and properties of the resulting distribution on

functions are determined by a mean function m : X → R and a positive definite covariance

function K : X × X → R. We will discuss the impact of covariance functions in Section 4.3.1.

For an overview of Gaussian processes, see Rasmussen and Williams [Rasmussen and Williams,

2006] or Section 2.1.

4.2.2 Acquisition Functions for Bayesian Optimization

We assume that the function f(x) is drawn from a Gaussian process prior and that our

observations are of the form xn, ynNn=1, where yn ∼ N(f(xn), ν) and ν is the variance of noise

introduced into the function observations. This prior and these data induce a posterior over

functions; the acquisition function, which we denote by a : X → R+, determines what point

in X should be evaluated next via a proxy optimization xnext = argmaxx a(x), where several

different functions, and carefully selected combinations thereof [Hoffman et al., 2011], have

been proposed. In general, these acquisition functions depend on the previous observations,

as well as the GP hyperparameters; we denote this dependence as a(x ; xn, yn, θ). There

are several popular choices of acquisition function. Under the Gaussian process prior, these

functions depend on the model solely through its predictive mean function µ(x ; xn, yn, θ)and predictive variance function σ2(x ; xn, yn, θ). In the proceeding, we will denote the

best current value as xbest = argminxnf(xn) and Φ(·) as the cumulative distribution function

of the standard normal distribution.

Probability of Improvement One intuitive strategy is to maximize the probability of

improving over the best current value [Kushner, 1964]. Under the GP this can be computed

analytically as

aPI(x ; xn, yn, θ) = Φ(γ(x)), γ(x) =f(xbest)− µ(x ; xn, yn, θ)

σ(x ; xn, yn, θ). (4.1)

Expected Improvement Alternatively, one could choose to maximize the expected im-

provement (EI) over the current best. This also has closed form under the Gaussian process:

aEI(x ; xn, yn, θ) = σ(x ; xn, yn, θ) (γ(x) Φ(γ(x)) + N(γ(x) ; 0, 1)) (4.2)

48

GP Upper Confidence Bound A more recent development is the idea of exploiting lower

confidence bounds (upper, when considering maximization) to construct acquisition functions

that minimize regret over the course of their optimization [Srinivas et al., 2010, de Freitas

et al., 2012]. These acquisition functions have the form

aLCB(x ; xn, yn, θ) = µ(x ; xn, yn, θ)− κσ(x ; xn, yn, θ), (4.3)

with a tunable κ to balance exploitation against exploration.

In this work we will focus on the EI criterion, as it has been shown to be better-behaved

than probability of improvement, but unlike the method of GP upper confidence bounds

(GP-UCB), it does not require its own tuning parameter. Although the EI algorithm performs

well in minimization problems, we wish to note that the regret formalization may be more

appropriate in some settings. We perform a direct comparison between our EI-based approach

and GP-UCB in Section 4.4.1.

4.3 Practical Considerations for Bayesian Optimization

of Hyperparameters

Although an elegant framework for optimizing expensive functions, there are several limitations

that have prevented it from becoming a widely-used technique for optimizing hyperparameters

in machine learning problems. First, it is unclear for practical problems what an appropriate

choice is for the covariance function and its associated hyperparameters. Second, as the

function evaluation itself may involve a time-consuming optimization procedure, problems

may vary significantly in duration and this should be taken into account. Third, optimization

algorithms should take advantage of multi-core parallelism in order to map well onto modern

computational environments. In this section, we propose solutions to each of these issues.

4.3.1 Covariance Functions and Treatment of Covariance

Hyperparameters

The power of the Gaussian process to express a rich distribution on functions rests solely

on the shoulders of the covariance function. While non-degenerate covariance functions

correspond to infinite bases, they nevertheless can correspond to strong assumptions regarding

49

likely functions. In particular, the automatic relevance determination (ARD) exponentiated

quadratic (squared exponential) kernel

KSE(x,x′) = θ0 exp

−1

2r2(x,x′)

r2(x,x′) =

D∑d=1

(xd − x′d)2/θ2d. (4.4)

is often a default choice for Gaussian process regression. However, sample functions with

this covariance function are unrealistically smooth for practical optimization problems. We

instead propose the use of the ARD Matern 5/2 kernel:

KM52(x,x′) = θ0

(1 +

√5r2(x,x′) +

5

3r2(x,x′)

)exp

−√

5r2(x,x′). (4.5)

This covariance function results in sample functions which are twice-differentiable, an assump-

tion that corresponds to those made by, e.g., quasi-Newton methods, but without requiring

the smoothness of the squared exponential.

After choosing the form of the covariance, we must also manage the hyperparameters that gov-

ern its behavior (Note that these “hyperparameters” are distinct from those being subjected to

the overall Bayesian optimization.), as well as that of the mean function. For our problems of in-

terest, typically we would have D + 3 Gaussian process hyperparameters: D length scales θ1:D,

the covariance amplitude θ0, the observation noise ν, and a constant mean m. The most

commonly advocated approach is to use a point estimate of these parameters by optimizing the

marginal likelihood under the Gaussian process, p(y | xnNn=1, θ, ν,m) = N(y |m1,Σθ + νI),where y = [y1, y2, · · · , yN ]T, and Σθ is the covariance matrix resulting from the N input

points under the hyperparameters θ.

However, for a fully-Bayesian treatment of hyperparameters (summarized here by θ alone),

it is desirable to marginalize over hyperparameters and compute the integrated acquisition

function:

a(x ; xn, yn) =

∫a(x ; xn, yn, θ) p(θ | xn, ynNn=1) dθ, (4.6)

where a(x) depends on θ and all of the observations. For probability of improvement and EI,

this expectation is the correct generalization to account for uncertainty in hyperparameters.

We can therefore blend acquisition functions arising from samples from the posterior over GP

hyperparameters and have a Monte Carlo estimate of the integrated expected improvement.

These samples can be acquired efficiently using slice sampling, as described in Murray and

50

(a) Posterior samples under varying hyperparameters

(b) Expected improvement under varying hyperparameters

(c) Integrated expected improvement

Figure 4.1: Illustration of integrated expectedimprovement. (a) Three posterior samples areshown, each with different length scales, after thesame five observations. (b) Three expected im-provement acquisition functions, with the samedata and hyperparameters. The maximum ofeach is shown. (c) The integrated expected im-provement, with its maximum shown.

(a) Posterior samples after three data

(b) Expected improvement under three fantasies

(c) Expected improvement across fantasies

Figure 4.2: Illustration of the acquisition withpending evaluations. (a) Three data have beenobserved and three posterior functions are shown,with “fantasies” for three pending evaluations.(b) Expected improvement, conditioned on theeach joint fantasy of the pending outcome. (c) Ex-pected improvement after integrating over thefantasy outcomes.

Adams [2010]. As both optimization and Markov chain Monte Carlo are computationally

dominated by the cubic cost of solving an N -dimensional linear system (and our function

evaluations are assumed to be much more expensive anyway), the fully-Bayesian treatment is

sensible and our empirical evaluations bear this out. Figure 4.1 shows how the integrated

expected improvement changes the acquistion function.

4.3.2 Modeling Costs

Ultimately, the objective of Bayesian optimization is to find a good setting of our hyperpa-

rameters as quickly as possible. Greedy acquisition procedures such as expected improvement

51

try to make the best progress possible in the next function evaluation. From a practial point

of view, however, we are not so concerned with function evaluations as with wallclock time.

Different regions of the parameter space may result in vastly different execution times, due to

varying regularization, learning rates, etc. To improve our performance in terms of wallclock

time, we propose optimizing with the expected improvement per second, which prefers to

acquire points that are not only likely to be good, but that are also likely to be evaluated

quickly. This notion of cost can be naturally generalized to other budgeted resources, such as

reagents or money.

Just as we do not know the true objective function f(x), we also do not know the duration

function c(x) : X → R+. We can nevertheless employ our Gaussian process machinery to

model ln c(x) alongside f(x). In this work, we assume that these functions are independent of

each other, although their coupling may be usefully captured using GP variants of multi-task

learning (e.g., Teh et al. [2005], Bonilla et al. [2008]). Under the independence assumption,

we can easily compute the predicted expected inverse duration and use it to compute the

expected improvement per second as a function of x.

4.3.3 Monte Carlo Acquisition for Parallelizing Bayesian Opti-

mization

With the advent of multi-core computing, it is natural to ask how we can parallelize our

Bayesian optimization procedures. More generally than simply batch parallelism, however,

we would like to be able to decide what x should be evaluated next, even while a set of points

are being evaluated. Clearly, we cannot use the same acquisition function again, or we will

repeat one of the pending experiments. Ideally, we could perform a roll-out of our acquisition

policy, to choose a point that appropriately balanced information gain and exploitation.

However, such roll-outs are generally intractable. Instead we propose a sequential strategy

that takes advantage of the tractable inference properties of the Gaussian process to compute

Monte Carlo estimates of the acquisiton function under different possible results from pending

function evaluations.

Consider the situation in which N evaluations have completed, yielding data xn, ynNn=1,

and in which J evaluations are pending at locations xjJj=1. Ideally, we would choose a

new point based on the expected acquisition function under all possible outcomes of these

52

pending evaluations:

a(x ; xn, yn, θ, xj) =∫RJ

a(x ; xn, yn, θ, xj, yj) p(yjJj=1 | xjJj=1, xn, ynNn=1) dy1 · · · dyJ . (4.7)

This is simply the expectation of a(x) under a J-dimensional Gaussian distribution, whose

mean and covariance can easily be computed. As in the covariance hyperparameter case, it is

straightforward to use samples from this distribution to compute the expected acquisition

and use this to select the next point. Figure 4.2 shows how this procedure would operate with

queued evaluations. We note that a similar approach is touched upon briefly by Ginsbourger

and Riche [2010b], but they view it as too intractable to warrant attention. Azimi et al.

[2012] consider the case where batches of experiments can be selected and run concurrently,

but this ignores the highly variable running times of machine learning algorithms. We have

found our Monte Carlo estimation procedure to be highly effective in practice, however, as

will be discussed in Section 4.4.

4.3.4 Optimizing the Acquisition Function

The Bayesian optimization routine relies on replacing the optimization problem over an

expensive and potentially noisy function with a relatively cheap alternative over a surrogate

function. That is, at each iteration of Bayesian optimization we must find the optimum of

the acquisition function in order to pick the next input to evaluate. Although the expected

improvement criterion and its derivative are analytic expressions, this optimization is non-

trivial as the expected improvement is highly multimodal (see Figure 4.2c). Other researchers

such as Bardenet and Kegl [2010] and Mahendran et al. [2012] have used alternative global

optimization algorithms for this purpose. We follow a strategy similar to Lizotte et al. [2012]

and Hutter et al. [2011] where we locally optimize the acquisition function over multiple

candidate points from a dense grid. We start by projecting the observations to the unit

hypercube, as defined by bounds of the optimization. Gaussian process hyperparameters, θ,

are sampled using the slice sampling algorithm of Murray and Adams [2010]. In order to

find the maximum of the multimodal acquisition function a(x ; xn, yn, θ) in a continuous

domain, first discrete candidate points are densely sampled in the unit hypercube using a

low discrepancy Sobol sequence [Bratley and Fox, 1988]. Each of these candidates is then

subjected to a bounded optimization over the integrated acquisition function. Precisely, the

53

minimum of the acquisition function, averaged over GP hyperparameter samples, is computed

with the input initialized at each of the candidate points. As the acquisition functions in this

work can all be expressed analytically in closed form, standard gradient descent techniques

can be used. This yields a new set of candidate points, each of which is located at a local

optimum of the integrated acquisition function. The next point to be evaluated in the

Bayesian optimization procedure is then selected as the candidate point with the highest

integrated acquisition function. Algorithm 1 outlines the procedure for selecting the next

candidate point to evaluate while integrating over hyperparameter samples. In the event of

pending experiments, xjJj=1, fantasized corresponding outcomes, yjJj=1, can be efficiently

sampled from the Gaussian process posterior for each hyperparameter sample and added to

the observation set, before computing the integrated acquisition function.

4.3.5 Hyperparameter Priors

After choosing the form of the Gaussian process covariance, we must also manage the hyper-

parameters that govern its behavior. In our empirical evaluation, unless otherwise specified,

we have D + 3 Gaussian process hyperparameters: D length scales θ1:D, the covariance am-

plitude θ0, the observation noise ν, and a constant mean m. For a fully-Bayesian treatment

of hyperparameters, it is desirable to marginalize over hyperparameters and compute the

integrated acquisition function. As stated in the paper, a Monte Carlo estimate of the

integrated acquisition function is computed via slice sampling [Murray and Adams, 2010].

Appropriate priors for each of the hyperparameters are chosen for use within the context

of the slice sampling algorithm, which we will clarify here. We specify a uniform prior for

the mean, m, and width 2 top-hat priors for each of the D length scale parameters. As we

expect the observation noise generally to be close to or exactly zero, ν is given a horseshoe

prior [Carvalho et al., 2009]. The covariance amplitude θ0 is given a zero mean, unit variance

lognormal prior, θ0 ∼ lnN(0, 1).

4.4 Empirical Analyses

In this section, we empirically analyse1 the algorithms introduced in this chapter and

compare to existing strategies and human performance on a number of challenging machine

1All experiments were conducted on identical machines using the Amazon EC2 service.

54

Algorithm 1 Selecting the next point to evaluate

Input: Observations xn, ynNn=1

Generate a set of M candidate points from the Sobol sequence [Bratley and Fox, 1988]Xcand = Sobol(M)Generate H Gaussian process hyperparameter samples [Murray and Adams, 2010]for h = 1 to H do

Sample θh See Section 4.3.5end forfor xm in Xcand do

xm = maxx

∑Hh=1 a(xm ; xn, yn, θh)

end forxnext = argmaxx

∑Hh=1 a(Xcand ; xn, yn, θh)

return xnext

learning problems. We refer to our method of expected improvement while marginalizing

GP hyperparameters as “GP EI MCMC”, optimizing hyperparameters as “GP EI Opt”,

EI per second as “GP EI per Second”, and N times parallelized GP EI MCMC as “Nx

GP EI MCMC”. Each results figure plots the progression of minxn f(xn) over the number

of function evaluations or time, averaged over multiple runs of each algorithm. If not

specified otherwise, xnext = argmaxx a(x) is computed using gradient-based search with

multiple restarts (see supplementary material for details). The code used is made publicly

available at http://www.cs.toronto.edu/~jasper/software.html.

4.4.1 Branin-Hoo and Logistic Regression

We first compare to standard approaches and the recent Tree Parzen Algorithm2 (TPA) of

Bergstra et al. [2011] on two standard problems. The Branin-Hoo function is a common

benchmark for Bayesian optimization techniques [Jones, 2001] that is defined over x ∈ R2

where 0 ≤ x1 ≤ 15 and −5 ≤ x2 ≤ 15. We also compare to TPA on a logistic regression

classification task on the popular MNIST data. The algorithm requires choosing four

hyperparameters, the learning rate for stochastic gradient descent, on a log scale from 0 to 1,

the `2 regularization parameter, between 0 and 1, the mini batch size, from 20 to 2000 and the

number of learning epochs, from 5 to 2000. Each algorithm was run on the Branin-Hoo and

logistic regression problems 100 and 10 times respectively and mean and standard error are

reported. The results of these analyses are presented in Figures 4.3a and 4.3b in terms of the

2Using the publicly available code from https://github.com/jaberg/hyperopt/wiki

55

http://www.cs.toronto.edu/~jasper/software.html

https://github.com/jaberg/hyperopt/wiki

0 10 20 30 40 50

0

5

10

15

20

25

30

35M

in F

unction V

alu

e

Function evaluations

GP EI Opt

GP EI MCMC

GP−UCB

TPA

(a)

0 20 40 60 80 100

0.08

0.1

0.12

0.14

0.16

0.18

0.2

0.22

0.24

Min

Fu

nctio

n V

alu

e

Function Evaluations

GP EI MCMC

GP EI Opt

GP EI per Sec

Tree Parzen Algorithm

(b)

5 10 15 20 25 30 35 40 45

0.08

0.1

0.12

0.14

0.16

0.18

0.2

Min

Fu

nctio

n V

alu

e

Minutes

GP EI MCMC

GP EI per Second

(c)

Figure 4.3: Comparisons on the Branin-Hoo function (4.3a) and training logistic regression onMNIST (4.3b). (4.3c) shows GP EI MCMC and GP EI per Second from (4.3b), but in terms oftime elapsed.

number of times the function is evaluated. On Branin-Hoo, integrating over hyperparameters

is superior to using a point estimate and the GP EI significantly outperforms TPA, finding

the minimum in less than half as many evaluations, in both cases. For logistic regression,

4.3b and 4.3c show that although EI per second is less efficient in function evaluations it

outperforms standard EI in time.

4.4.2 Online LDA

Latent Dirichlet Allocation (LDA) is a directed graphical model for documents in which

words are generated from a mixture of multinomial “topic” distributions. Variational Bayes

is a popular paradigm for learning and, recently, Hoffman et al. [2010] proposed an online

learning approach in that context. Online LDA requires 2 learning parameters, τ0 and κ,

that control the learning rate ρt = (τ0 + t)−κ used to update the variational parameters of

56

0 10 20 30 40 501260

1270

1280

1290

1300

1310

1320

1330

1340

1350M

in F

un

ctio

n V

alu

e


GP EI MCMC

GP EI per second

GP EI Opt

Random Grid Search

3x GP EI MCMC

5x GP EI MCMC

10x GP EI MCMC

(a)

0 2 4 6 8 10 121260

1270

1280

1290

1300

1310

1320

1330

1340

1350

Min

fu

nctio

n v

alu

e

Time (Days)

GP EI MCMC

GP EI per second

GP EI Opt

3x GP EI MCMC

5x GP EI MCMC

10x GP EI MCMC

(b)

0 10 20 30 40 501260

1270

1280

1290

1300

1310

1320

1330

1340

1350

Min

Fu

nctio

n V

alu

e


3x GP EI MCMC (On grid)

5x GP EI MCMC (On grid)

3x GP EI MCMC (Off grid)

5x GP EI MCMC (Off grid)

(c)

Figure 4.4: Different strategies of optimization on the Online LDA problem compared in terms offunction evaluations (4.4a), walltime (4.4b) and constrained to a grid or not (4.4c).

LDA based on the tth minibatch of document word count vectors. The size of the minibatch

is also a third parameter that must be chosen. Hoffman et al. [2010] relied on an exhaustive

grid search of size 6× 6× 8, for a total of 288 hyperparameter configurations.

We used the code made publically available by Hoffman et al. [2010] to run experiments

with online LDA on a collection of Wikipedia articles. We downloaded a random set of

249 560 articles, split into training, validation and test sets of size 200 000, 24 560 and 25 000

respectively. The documents are represented as vectors of word counts from a vocabulary

of 7702 words. As reported in Hoffman et al. [2010], we used a lower bound on the per

word perplexity of the validation set documents as the performance measure. One must also

specify the number of topics and the hyperparameters η for the symmetric Dirichlet prior over

the topic distributions and α for the symmetric Dirichlet prior over the per document topic

mixing weights. We followed Hoffman et al. [2010] and used 100 topics and η = α = 0.01

in our experiments in order to emulate their analysis and repeated exactly the grid search

57

reported in the paper3. Each online LDA evaluation generally took between five to ten hours

to converge, thus the grid search requires approximately 60 to 120 processor days to complete.

In Figures 4.4a and 4.4b, we compare our various strategies of optimization over the same

grid on this expensive problem. That is, the algorithms were restricted to only the exact

parameter settings as evaluated by the grid search. Each optimization was then repeated 100

times (each time picking two different random experiments to initialize the optimization with)

and the mean and standard error are reported4. Figure 4.4c also presents a 5 run average of

optimization with 3 and 5 times parallelized GP EI MCMC, but without restricting the new

parameter setting to be on the pre-specified grid (see supplementary material for details). A

comparison with their “on grid” versions is illustrated.

Clearly integrating over hyperparameters is superior to using a point estimate in this case.

While GP EI MCMC is the most efficient in terms of function evaluations, we see that

parallelized GP EI MCMC finds the best parameters in significantly less time. Finally, in

Figure 4.4c we see that the parallelized GP EI MCMC algorithms find a significantly better

minimum value than was found in the grid search used by Hoffman et al. [2010] while running

a fraction of the number of experiments.

4.4.3 Motif Finding with Structured Support Vector Machines

In this example, we consider optimizing the learning parameters of Max-Margin Min-Entropy

(M3E) Models [Miller et al., 2012], which include Latent Structured Support Vector Ma-

chines [Yu and Joachims, 2009] as a special case. Latent structured SVMs outperform SVMs

on problems where they can explicitly model problem-dependent hidden variables. A popular

example task is the binary classification of protein DNA sequences [Miller et al., 2012, Kumar

et al., 2010, Yu and Joachims, 2009]. The hidden variable to be modeled is the unknown

location of particular subsequences, or motifs, that are indicators of positive sequences.

Setting the hyperparameters, such as the regularisation term, C, of structured SVMs remains

a challenge and these are typically set through a time consuming grid search procedure as is

done in Miller et al. [2012], Yu and Joachims [2009]. Indeed, Kumar et al. [2010] avoided

hyperparameter selection for this task as it was too computationally expensive. However,

3i.e. the only difference was the randomly sampled collection of articles in the data set and the choice ofthe vocabulary. We ran each evaluation for 10 hours or until convergence.

4The restriction of the search to the same grid was chosen for efficiency reasons: it allowed us to repeatthe experiments several times efficiently, by first computing all function evaluations over the whole grid andreusing these values within each repeated experiment.

58

0 5 10 15 20 250.24

0.245

0.25

0.255

0.26

Time (hours)

Min

fu

nctio

n v

alu

e

GP EI MCMC

GP EI per Second

3x GP EI MCMC

3x GP EI per Second

Random Grid Search

(a)

0 20 40 60 80 1000.24

0.245

0.25

0.255

0.26

Min

Fu

nctio

n V

alu

e


GP EI MCMC

GP EI per Second

3x GP EI MCMC

3x GP EI per Second

(b)

0 20 40 60 80 1000.24

0.245

0.25

0.255

0.26

0.265

0.27

0.275

0.28

Min

Fu

nctio

n V

alu

e


Matern 52 ARD

SqExp

SqExp ARD

Matern 32 ARD

(c)

Figure 4.5: A comparison of various strategies for optimizing the hyperparameters of M3E models onthe protein motif finding task in terms of walltime (4.5a), function evaluations (4.5b) and differentcovariance functions(4.5c).

Miller et al. [2012] demonstrate that results depend highly on the setting of the parameters,

which differ for each protein. M3E models introduce an entropy term, parameterized by α,

which enables the model to outperform latent structured SVMs. This additional performance,

however, comes at the expense of an additional problem-dependent hyperparameter. We

emulate the experiments of Miller et al. [2012] for one protein with approximately 40,000

sequences. We explore 25 settings of the parameter C, on a log scale from 10−1 to 106,

14 settings of α, on a log scale from 0.1 to 5 and the model convergence tolerance, ε ∈10−4,10−3,10−2,10−1. We ran a grid search over the 1400 possible combinations of these

parameters, evaluating each over 5 random 50-50 training and test splits.

In Figures 4.5a and 4.5b, we compare the randomized grid search to GP EI MCMC, GP

EI per Second and their 3x parallelized versions, all constrained to the same points on the

grid. Each algorithm was repeated 100 times and the mean and standard error are shown.

We observe that the Bayesian optimization strategies are considerably more efficient than

59

0 10 20 30 40 50

0.2

0.25

0.3

0.35

0.4M

in F

un

ctio

n V

alu

e


GP EI MCMC

GP EI Opt

GP EI per Second

GP EI MCMC 3x Parallel

Human Expert

0 10 20 30 40 50 60 70

0.2

0.25

0.3

0.35

0.4

Min

fu

nctio

n v

alu

e

Time (Hours)

GP EI MCMC

GP EI Opt

GP EI per Second

GP EI MCMC 3x Parallel

Figure 4.6: Validation error on the CIFAR-10 data for different optimization strategies.

grid search which is the status quo. In this case, GP EI MCMC is superior to GP EI per

Second in terms of function evaluations but GP EI per Second finds better parameters faster

than GP EI MCMC as it learns to use a less strict convergence tolerance early on while

exploring the other parameters. Indeed, 3x GP EI per second, is the least efficient in terms of

function evaluations but finds better parameters faster than all the other algorithms. Figure

4.5c compares the use of various covariance functions in GP EI MCMC optimization on this

problem, again repeating the optimization 100 times. It is clear that the selection of an

appropriate covariance significantly affects performance and the estimation of length scale

parameters is critical. The assumption of the infinite differentiability as imposed by the

commonly used squared exponential is too restrictive for this problem.

4.4.4 Convolutional Networks on CIFAR-10

Neural networks and deep learning methods notoriously require careful tuning of numerous

hyperparameters. Multi-layer convolutional neural networks are an example of such a model

for which a thorough exploration of architechtures and hyperparameters is beneficial, as

demonstrated in Saxe et al. [2011], but often computationally prohibitive. While Saxe et al.

[2011] demonstrate a methodology for efficiently exploring model architechtures, numerous

hyperparameters, such as regularisation parameters, remain. In this empirical analysis, we

tune nine hyperparameters of a three-layer convolutional network [Krizhevsky, 2009] on the

CIFAR-10 benchmark dataset using the code provided 5. This model has been carefully tuned

by a human expert [Krizhevsky, 2009] to achieve a highly competitive result of 18% test error

on the unaugmented data, which matches the published state of the art result [Coates and

5Available at: http://code.google.com/p/cuda-convnet/

60

http://code.google.com/p/cuda-convnet/

Ng, 2011] on CIFAR-10. The parameters we explore include the number of epochs to run

the model, the learning rate, four weight costs (one for each layer and the softmax output

weights), and the width, scale and power of the response normalization on the pooling layers

of the network.

We optimize over the nine parameters for each strategy on a withheld validation set and

report the mean validation error and standard error over five separate randomly initialized

runs. Results are presented in Figure 4.6 and contrasted with the average results achieved

using the best parameters found by the expert. The best hyperparameters GP EI MCMC

approach achieve an error on the test set of 14.98%, which is over 3% better than the expert

and the state of the art on CIFAR-10. The same procedure was repeated on the CIFAR-10

data augmented with horizontal reflections and translations, similarly improving on the

expert from 11% to 9.5% test error and achieving to our knowledge the lowest error reported

on the competitive CIFAR-10 benchmark.

4.5 Conclusion

We presented methods for performing Bayesian optimization for hyperparameter selection of

general machine learning algorithms. We introduced a fully Bayesian treatment for EI, and

algorithms for dealing with variable time regimes and running experiments in parallel. The

effectiveness of our approaches were demonstrated on three challenging recently published

problems spanning different areas of machine learning. The resulting Bayesian optimization

finds better hyperparameters significantly faster than the approaches used by the authors

and surpasses a human expert at selecting hyperparameters on the competitive CIFAR-10

dataset, beating the state of the art by over 3%.

61

Chapter 5

Opportunity Cost in Bayesian

Optimization

A major advantage of Bayesian optimization is that it generally requires fewer function

evaluations than optimization methods that do not exploit the intrinsic uncertainty associated

with the task. The ability to perform well with fewer evaluations of the target function makes

the Bayesian approach to optimization particularly compelling when that target distribution

is expensive to evaluate. The notion of expense, however, depends on the problem and may

even depend on the location in the search space. For example, we may be under a time

deadline and the experiments we wish to run may have varying duration, as when training

neural networks or finding the hyperparameters of support vector machines. In this section we

develop a new idea for selecting experiments in this setting, that builds in information about

the opportunity cost of some experiments over others. Specifically, we consider Bayesian

optimization where 1) there are limited resources, 2) function evaluations vary in resource

cost across the search space, and 3) the costs are unknown and must be learned.

5.1 Introduction

The optimization of machine learning models frequently involves careful tuning of hyperparam-

eters. Unfortunately, however, this tuning is often a “black art”, requiring expert experience,

rules of thumb, or sometimes brute-force search. Bayesian optimization would seem to provide

an elegant approach to this “meta-learning” problem, as the hyperparameter-specific learning

may be an expensive procedure in time and other resources. Algorithms optimizing expected

62

improvement [Mockus et al., 1978] and the Gaussian process upper confidence bound [Srinivas

et al., 2010] are appealing in this setting as they have been shown to be efficient in the

number of function evaluations required to find the global optimum of many multimodal

black-box functions [Bull, 2011].

An important real-world caveat in hyperparameter learning and other problems, however,

is that the cost of function evaluations may vary over the space. For example, in a neural

network, the learning algorithm will be trained to convergence, but how long this takes may

change with the learning rate, regularization strength, and number of hidden units. Even

without considering duration, the advent of cloud computing makes it possible to quantify

economically the cost of requiring large-memory machines for learning, changing the actual

cost in dollars of an experiment with a different number of hidden units. Finally, network

communication between processes and in loading data may also have direct costs, or may

determine whether an experiment can be conducted using, e.g., a GPU.

Our framework for thinking about varying expenses under resource constraints is to model

the opportunity cost associated with candidate experiments. We would like our Bayesian

optimization algorithm to consider running a larger number of cheap experiments, when

the cost-ignorant algorithm would otherwise only be able to run a few expensive ones. We

examine two different ideas along these lines. When there is only a single constrained resource

(e.g., time to a deadline), we have found that a simple myopic variant of expected improvement

performs well. As we will discuss, however, this approach does not generalize well when there

are multiple resource constraints. For this more general case, we develop a new algorithm

that attempts to directly compute an approximation to the opportunity cost.

As our goal is to develop effective meta-learning algorithms, we also address the case where

the resource costs are unknown a priori. In keeping with the overall philosophy of Bayesian

optimization, we learn these costs and represent their associated uncertainty, enabling us to

make choices that incorporate our own ignorance appropriately. We use Gaussian process

priors to model these cost functions.

5.2 Expected Improvement

Bayesian optimization refers to a Bayesian motivated method for seeking the extrema of

expensive functions [Brochu et al., 2010]. Intuitively, the Bayesian approach suggests that

given a prior belief over the form of a function and a finite number of observations, one can

63

make an educated guess about where the optima of such a function should be. Much of the

Bayesian approach was introduced and developed by Mockus et al. [1978], Mockus [1994,

1989]. In particular, Mockus et al. [1978] argued that when searching for the optima of a

function, rather than assume this corresponds to the optima of the estimated function (i.e.

the predictive mean in the Gaussian case) one should search where the expected improvement

is greatest. Schonlau et al. [1998] derive an analytic form for the case where a single next

point is selected while an approximate approach is suggested for the less myopic next n-point

alternative.

5.3 Expected Improvement with a Deadline

Ginsbourger and Riche [2010a] explored the scenario of optimization using the expected

improvement criterion when the number of available function evaluations is fixed and known.

They demonstrate that choosing the next experiment to run based on the standard myopic

expected improvement algorithm is inferior to choosing it based on the joint expected improve-

ment of the available function evaluations. Computing the joint EI, however, is intractable in

general. Therefore, we explore a greedy algorithm and a Monte Carlo approximation.

5.3.1 Modeling Time

Just as we do not know the true objective function f(x), we also do not know the duration

function c(x) : X → R+. Naturally, we can assume our duration observations to be corrupted

by noise. The duration of a computer simulation, for example, generally varies due to

numerous external sources such as other concurrent processes. Assuming that the cost of

any given evaluation of f(x) is always positive and that the duration is a smooth function of

the inputs, we can employ our Gaussian process machinery to model ln c(x) alongside f(x).

Formally, we assume that the function ln c(x) is drawn from a Gaussian process prior and that

our observations are of the form xn, tnNn=1, where tn ∼ N(ln c(xn), ν) and ν is the variance

of noise introduced into the log duration observations. Using the posterior mean of the GP

gives the expected log duration of a function evaluation f(x) given the parameterization x.

64

5.3.2 Expected Improvement per Second

As a baseline greedy approach, we use the expected improvement per second acquisition

function introduced in Section 4.3.2. Here we modify the acquisition function such that the

next point to be chosen is the one for which the expected improvement per second is highest.

That is, we greedily choose the experiment that is expected to give us the most efficiency in

terms of improvement over time. Following the notation of Section 4.2.2 this gives the EI per

Second acquisition function,

aEI/S(x; xn, yn, tn, θ, ψ) =aEI(x ; xn, yn, θ)

expµ(x; xn, tn, ψ)(5.1)

where we denote µ(x; xn, tn, ψ) as the posterior mean function of the GP over log durations,

parameterized by ψ.

5.3.3 Monte Carlo Multi-Step Myopic EI (MCMS)

Rather than evaluate the joint EI of each possible subset of experiments that fits within the

time horizon, which is intractable in general, we develop a Monte Carlo approximation to

running multiple steps of myopic EI optimization. That is, we sample paths conditionally on

each candidate point, recursively sampling function values for each experiment, computing

EI and adding the top EI candidate until the total estimated time taken to evaluate the path

exceeds the time deadline. The minimum function value observed along the path is taken

to be the improvement achieved by following that path. The expected improvement is then

taken to be the average improvement over multiple sampled paths. This algorithm is outlined

in Algorithm 2.

5.4 Experiments

5.4.1 Simple Branin-Hoo Example

To illustrate the advantage of our approach, we first present a simple toy example. The

Branin-Hoo function is a common benchmark for Bayesian optimization techniques [Jones,

2001] that is defined over x ∈ R2 where 0 ≤ x1 ≤ 15 and −5 ≤ x2 ≤ 15. Rather than assume

that each function evaluation is equally expensive, we assume that half of the search space is

65

Algorithm 2 A Monte Carlo Algorithm for computing the next experiment to run given abounded amount of time:1: maxtot← 02: for m = 1→M do3: bestm← max(y) [Initialize to current maximum.]4: timeleft← T [Initialize to total time left.]5: zcur ← zj [Starting point is the next-step candidate.]6: while timeleft > 0 do7: Sample ycur = f(zcur) [Sample the function from the GP, given history.]8: if ycur > bestm then9: bestm = ycur [Update the maximum.]

10: end if11: timeleft = timeleft− g(zcur) [Subtract the runtime of this expt.]12: Update the GP posterior for this path.13: zcur = arg maxz(EI(z)) [Choose next point with myopic EI.]14: end while15: maxtot = maxtot+ bestm [Accumulate for later sample average.]16: end for17: Ψ(zj) = maxtot/M [Divide to get average.]

Return max(Ψ)

exactly ten times more expensive to evaluate than the other half, i.e. function evaluations in

the expensive half, x2 < 2.5, take ten fantasy seconds rather than one. We further assume

that there is a deadline of 50 seconds to complete the optimization. The halves are split

in such a way that equally good optima exist in either half. Thus, with knowledge of the

time it would take to evaluate each candidate point, an algorithm that is aware of the time

required to evaluate each point should prefer running cheaper experiments and reach a better

optimum than a time agnostic algorithm. Each algorithm was evaluated on this toy scenario

and the results are presented in Figure 5.1. Notice that the MCMS algorithm is capable of

running many more experiments in the same amount of time, and therefore is able to find a

better optimum than standard EI.

5.4.2 Multiple Kernel Learning with Support Vector Machines

A natural application for Bayesian optimization is the optimization of hyperparameters within

multiple kernel support vector machines. Traditionally, the hyperparameters of support vector

machines are optimized using cross-validation. For a small number of parameters a grid

search over parameter values is feasible. However, in a multiple kernel learning setting,

where potentially many kernel hyperparameters must be set, exhaustive grid search becomes

66

−5 0 5 100

5

10

15

(a)

−5 0 5 100

5

10

15

(b)

MCMS EI per Second EI0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Ave

rag

e T

ime

pe

r E

xp

erim

en

t

(c)

MCMS EI per Second EI0

1

2

3

4

5

6

Min

inu

m V

alu

e F

ou

nd

(d)

Figure 5.1: An example sequence of points evaluated within the 50 second deadline by (a) standardEI and (b) MCMS EI. Note that inputs less than 2.5 on the horizontal axis are ten times moreexpensive to evaluate. Thus, the time-aware algorithms spend most of their evaluations in thecheaper half. Each algorithm was used to optimize the Branin-Hoo function with fantasized timecosts with a deadline of 50 seconds. (c) shows the average time taken per experiment by eachalgorithm and (d) shows the average minimum found within the deadline.

Data Set MCMS EI/S EI Deadline(s)

Ionosphere 10.4± 0.74 9.40± 1.27 13.7± 0.96 10w1a 2.36± 0.14 2.20± 0.08 2.61± 0.06 150Australian 30.8± 1.50 30.0± 0.44 32.0± 0.04 250Sonar 12.8± 0.19 13.0± 0.18 13.8± 0.21 2

Table 5.1: Minimum cross-validation error values in percent found by the optimization routinefor various Bayesian optimization algorithms performing hyperparameter optimization on multiplekernel SVMs applied to standard UCI datasets.

computationally infeasible. In this example, expected improvement is used to to optimize

the parameters of a support vector machine with multiple kernels. In this case, we use the

sum of an rbf and a linear kernel, each scaled by a scale parameter. The parameters to

be optimized are kernel scale parameters, rbf kernel width, the slack penalty, C, and the

convergence tolerance parameter. Experiments were conducted on four binary classification

datasets from the UCI data repository, where we use EI to find the hyperparameter settings

that minimize ten-fold cross validation error. Results are presented in Table 5.1 and curves

demonstrating the average minumum function value found by each algorthm vs the amount

of time are presented in Figure 5.2.

67

0 50 100 1502.1

2.2

2.3

2.4

2.5

2.6

2.7

2.8

Min

Fu

nctio

n V

alu

e

Seconds of Computation

EI per second

EI

MCMS

Figure 5.2: Example optimization curves for opti-mization performed on the w1a data demonstrat-ing the minumum function values found by eachalgorithm as a function of time. In this case, theoptimization deadline was chosen to be 150 sec-onds.

MCMS EI per Second EI

440

450

460

470

480

490

500

510

520

530

Min

inum

Valu

e F

ound

Figure 5.3: A comparison of the minimum errorachieved before the deadline by the various al-gorithms on the neural network problem. Theerror values are averaged over twenty evaluationsand the errorbars are standard error. In this caseMCMS outperforms both algorithms.

5.4.3 Training a Neural Network

As a final experiment, we use Bayesian optimization to optimize the hyperparameters of

a neural network. This scenario is borrowed from an undergraduate course assignment at

the University of Toronto, where students are given neural network code and are asked to

tune the hyperparameters to reach a validation classification error of 500. Thus, the idea is

that this is a challenge for someone with introductory knowledge of neural networks. The

parameters required to tune are the number of hidden units to use, the number of epochs

of unsupervised pretraining, a weight decay for pretraining, the learning rate for stochastic

gradient descent, the number of epochs of backpropagation, and the weight decay during

backpropagation. All algorithms were run on this problem 20 times, with a deadline chosen

such that the algorithms could run approximately between 10 to 40 evaluations, and the

results are presented in Figure 5.3.

68

5.5 Conclusion

Bayesian optimization has been established as a highly effective approach to optimizing noisy

black-box functions when parsimony in the number of function evaluations is desirable. The

desire for efficiency in terms of the number of function evaluations is, however, a proxy for

the real cost of running each function evaluation. Given that the cost of function evaluations

is often dependent on the parameters being optimized, it is important to make the distinction

between the real cost and simply the number of function evaluations. For setting the

hyperparameters of machine learning models, this cost can often be quantified as the amount

of time it takes to retrieve the result of a function evaluation. Given finite computational

resources and temporally or computationally “expensive” function evaluations, the true desire

of Bayesian optimization is generally to achieve the best hyperparameters in a bounded

amount of time. Analogously, we wish to find the best result in a bounded amount of resources.

There is no value to potentially finding a better result after the resources are exhausted or

the deadline has passed (a notion well observed by machine learning researchers adjusting

model hyperparameters before a publication deadline). Thus, in Bayesian optimization we

should expect to do better, in terms of the best value found in bounded resources, if we take

the opportunity cost of each function evaluation into account. Naturally, as the function

being optimized is unobserved, we expect the parameter dependent cost to be unknown as

well. However, we can estimate the expected cost, given some number of observations and

similarly to how we estimate the function itself, through a distribution over functions afforded

to us by the Gaussian process. In Section 5.4, we show that incorporating the expected cost,

when optimizing the hyperparameters of popular machine learning models, always achieves a

better result in less time than standard Bayesian optimization.

69

Chapter 6

Bayesian Optimization under

Unknown Constraints

6.1 Introduction

The aforementioned approach to Bayesian optimization, and the literature in general, assumes

that the bounds of the optimization for each parameter are independent and known a priori.

Often this is not the case. For many problems, complex combinations of the parameters being

optimized over can cause numerical instabilities. When optimizing deep neural networks, as

in the deep convolutional network example of Section 4.4.4, parameters in the optimization

routine exhibit this property. That is, the learning rate and momentum parameters common

to stochastic gradient descent optimization routines each should individually be bound but

also interact in non-trivial ways. Often numerical instability can be avoided if at least one

parameter is a small value. However, if both are large values, relative to their bounds, the

optimization can become unmanageable and quickly reach poor or numerically undefined

results. Clearly, such results will violate the smoothness prior induced by the Gaussian

process and result in a poor model of the function being optimized.

Resource constraints can similarly be challenging to define with independent bounds on

parameters. In keeping with the deep neural network example, while training such a model

one can violate memory constraints in numerous ways. For example, the number of hidden

layers and the number of hidden units in each layer can not be treated independently when

considering memory requirements. However, bounding each in such a manner as to avoid

70

violating memory requirements altogether would prevent the Bayesian optimization routine

from exploring many valid parameter settings.

6.2 A Constraint Weighted Acquisition Function

The optimization of an unknown, multi-modal and noisy function subject to unknown

constraints may seem challenging at best. However, if there is a smooth, continuous boundary

between the constraint violating region and valid points, we can model the constraint boundary

using similar tools to the Bayesian optimization itself. That is, for each candidate point in the

Bayesian optimization we assign a probability that the point is a constraint violation given

ground truth examples of valid and invalid points. Under an assumption of independence,

the expected improvement of a candidate point becomes weighted by the probability that the

point is valid and not a constraint violation. Denoting x as a superset of x, i.e. x ∈ x with

corresponding labels c ∈ 0, 1 indicating which points are not constraint violations, we can

write the constraint weighted integrated acquisition function as:

a(x ; xn, yn, cn) =∫a(x ; xn, yn, θ) p(θ | xn, ynNn=1) p(c |x, xn, cnNn=1,Ω) p(Ω | xn, cnNn=1) dθ dΩ, (6.1)

where c is the validity label corresponding to x and Ω are the hyperparameters of a discrimi-

native probabilistic model. Note that the relative complement of x in x is the set of constraint

violating points. As their observed values are treated as undefined, they are not included in

the computation of the standard acquisition function a(x ; xn, yn, θ) p(θ | xn, ynNn=1).

While there are numerous options for the form of p(c |x, xn, cnNn=1,Ω), a natural choice

is to use a Gaussian process classifier as it makes similar assumptions about the data and

much computation can be shared with the GP regression. To implement the GP classifier,

we use a latent Gaussian process with a probit likelihood. This corresponds to modeling a

Gaussian process prior over latent function values, f = [f1, f2, · · · , fN ]T, which are “squashed”

through the cumulative distribution function of the standard normal Φ(·) to produce class

probabilities:

p(c |x, xn, cnNn=1,Ω) =

∫Φ(f)N(f |ΣΩ + νI) df , (6.2)

71

Algorithm 3 Selecting the next point to evaluate in constrained optimization

Input: Observations xn, yn, cnNn=1

Generate a set of M candidate points from the Sobol sequence [Bratley and Fox, 1988]Xcand = Sobol(M)Generate H Gaussian process hyperparameter samples [Murray and Adams, 2010]for h = 1 to H do

Sample θh See Section 4.3.5Sample Ωh Murray et al. [2010]

end forAdjust each candidate to locally optimize the constrained acquisition functionfor xm in Xcand do

xm = maxx

∑Hh=1 a(xm ; xn, yn, cn, θh,Ωh)

end forxnext = argmaxx

∑Hh=1 a(Xcand ; xn, yn, cn, θh,Ωh)

return xnext

where ΣΩ indicates the GP covariance matrix over x and x with hyperparameters given by

Ω. We estimate the integral of Equation 6.2 following the elliptical slice sampling algorithm

of Murray et al. [2010]. Observe that the noise term, ν, in the Gaussian process classifier

allows the model to elegantly deal with label noise and mis-labelings.

6.2.1 Optimizing the Acquisition Function

In the constrained case, the Bayesian optimization routine proceeds similarly to standard

Bayesian optimization as explained in Section 4.3.4. We can similarly optimize a dense

grid of candidate points but instead over the new acquisition function a(xm ; xn, yn, cn).GP hyperparameters for the classifier are sampled according to the elliptical slice sampling

strategy of Murray et al. [2010]. An outline of the algorithm used to select the next point at

which to evaluate the function being optimized is provided in Algorithm 3.

6.2.2 Obtaining Labels

It is assumed that the above algorithm is given observations, xn, yn, cnNn=1, consisting of

function values, yn and constraint label observations, cn. It is naturally unsatisfying to assume

that a set of class labeled examples xn, cnNn=1 are available before the optimization proceeds.

As the function of interest is assumed to be unknown a priori, we also assume that the

72

constraint violating region is unknown. Thus, in this work and in the empirical evaluation in

Section 6.4 we accumulate labels during the optimization routine. We assume that evaluating

a function at input x reveals the function value y and a binary value indicating if the point

is valid, c. In practice, this binary value is obtained simply by, for example, checking if y

is a real number or under a predefined threshold. The optimization routine proceeds by

accumulating both labels and function observations at each iteration.

6.3 Related Work

Gramacy and Lee [2010] suggest a similar approach to constrained Bayesian optimization,

also weighting the expected improvement criterion by the predictions of a Gaussian process

constraint classifier. We extend the approach to use elliptical slice sampling, integrating

out the latent function values, f , and hyperparameters, θ and Ω. Given the importance of

integrating out the Gaussian process hyperparameters in Bayesian optimization, as developed

in Section 4.3 and empirically demonstrated in Section 4.4, we believe that the fully Bayesian

treatment will be equally necessary here. Note that the “Constrained Bayesian Optimization”

approach of Azimi [2012] addresses a very different problem to the constrained problem in this

work. Azimi [2012] consider the case where individual experiments or function evaluations are

given constrained regions, as sub-rectangles of the unit hypercube, of input space to operate

on and return a function value from that region.

6.4 Empirical Analysis

In this section we will empirically validate the constrained Bayesian optimization algorithm

introduced above in two example problems. In Section 7.3, we demonstrate the effectiveness

of the approach on the real world problem of tuning a complex computer vision based fall

detection system from assistive technology.

6.4.1 Constrained Branin-Hoo

As an initial empirical analysis we construct a constrained variant of the Branin-Hoo function.

The Branin-Hoo function was augmented such that it was padded with an invalid region and

the bounds were increased to include this constraint violating region. Any function evaluations

73

0 20 40 60 80 100

0

20

40

60

80

100

120

140

Min

Fu

nctio

n V

alu

e


GP EI MCMC

Constrained GP EI MCMC

(a)

0 20 40 60 80 100

0

20

40

60

80

100

120

140

Min

Fu

nctio

n V

alu

e


GP EI MCMC


(b)

Figure 6.1: A comparison of GP EI MCMC and Constrained GP EI MCMC on the constrainedBranin-Hoo function. The Branin-Hoo function was augmented such that it was padded with aninvalid region and the bounds were increased to include this invalid region. Any function evaluationsin the invalid region were treated as returning the worst possible (highest) value of the Branin-Hoofunction. Figures 6.1a and 6.1b respectively show a comparison of the progression of the minimumvalue (averaged over 100 runs) found by the GP EI MCMC algorithm and constrained GP EIMCMC algorithm when the bounds are increased by 10% and 50% respectively to include an invalidregion. For the Constrained GP EI MCMC algorithm any function evaluations in the invalid regionwere treated as constraint violations.

in the invalid region were treated as returning the worst possible (highest) value of the Branin-

Hoo function. For the Constrained GP EI MCMC algorithm any function evaluations in the

invalid region were treated as constraint violations. Figures 6.1a and 6.1b respectively show

a comparison of the progression of the minimum value (averaged over 100 runs) found by

the GP EI MCMC algorithm and constrained GP EI MCMC algorithm when the bounds

are increased by 10% and 50% respectively to include an invalid region. In this case, the

standard Bayesian optimization strategy performs poorly because the Gaussian process has

trouble modeling the discontinuities in the function. The GP attempts to compensate for the

violation of the smoothness prior by learning extremely small length-scales, resulting in a

very poor model of the actual Branin-Hoo function. The constrained variant, however, learns

the constraint boundary allowing the regression GP to model the actual Branin-Hoo function.

Note that one may consider simply not including observations from the invalid region in the

model. However, this would result in the pathological case where uncertainty in the invalid

region will increase such that the model will only explore therein and never converge.

74

0 20 40 60 80 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Min

Fu

nctio

n V

alu

e


GP EI MCMC


(a)

0 20 40 60 80 100

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Min

Fu

nctio

n V

alu

e


GP EI MCMC


(b)

Figure 6.2: A comparison of GP EI MCMC and Constrained GP EI MCMC on the constraineddeep neural network example. The classification error, averaged over 40 runs, is shown per functionevaluation. Figure 6.2a shows the classification error when there are no constraints and 6.2b whenthere is a constraint on the total number of parameters in the two layer neural network.

6.4.2 Deep Neural Networks

In this example we consider the training of a deep neural network, trained according to the

“dropout” strategy of Hinton et al. [2012], on a subset of the popular MNIST digits data. We

take a sample of one thousand training examples (from fifty thousand) to make the training

faster and enable the experiment to be repeated 40 times. We optimize the error over a

validation set of ten thousand examples with respect to four parameters. These are the initial

learning rate (from 0 to 4) for stochastic gradient descent, the number of hidden units in each

of two hidden layers (from 1 to 500) and the number of epochs of stochastic gradient descent

training (from 1 to 350). In the first experiment, we compare the constrained variant of the

GP EI MCMC algorithm to the GP EI MCMC algorithm introduced in Chapter 4 on this

problem. For the GP EI MCMC algorithm we consider any undefined results or constraint

violations to be 90% error. In this experiment, adjusting the learning rate can result in

a constraint violation as certain settings will result in numerical instability or arbitrarily

poor results. We contrast this experiment to one in which there is a simulated memory

constraint that is violated when there is over one hundred thousand parameters. Such a

memory constraint could arise, for example, in an embedded application. As the input

dimensionality is 784, this constraint can easily be violated by numerous combinations of the

layer sizes. However, setting bounds on the layer sizes would discard a significant proportion

of the valid configurations. Figures 6.2a and 6.2b demonstrate the results of the two Bayesian

optimization strategies on the two experiments.

75

We see that in the initial experiment, GP EI MCMC and constrained GP EI MCMC both

appear to converge to the same best result within approximately 20 iterations. However,

we notice in Figure 6.2b that once the memory constraint is added to the problem, the

constrained GP EI MCMC algorithm significantly outperforms GP EI MCMC, as expected.

6.5 Conclusion

In this chapter we introduce a methodology for performing Bayesian optimization in the

presence of unknown constraints. We formulate a constrained acquisition function through

incorporating a GP classifier that weights the acquisition function by the probability that

a candidate point is valid. We demonstrate empirically on an analytic function and a deep

neural network example that the resulting algorithm significantly outperforms the standard

approach to Bayesian optimization in the presence of unknown constraints. In Section 7.3

we apply the constrained Bayesian optimization algorithm to a well-motivated real world

problem from the domain of assistive technology.

76

Chapter 7

Applications to Assistive Technology

and Health Informatics

Assistive technology is a field with extremely strongly motivated problems which, although

originally appearing insurmountable, are increasingly solvable with the careful application of

artificial intelligence. This is a result of the simple and highly predictable notion that there

will be a major shortage in caregivers for an ever growing population of elderly adults needing

care. Any solution will have to replace caregivers in tasks that can be automated, or at a

minimum use technology to make caregivers more efficient. Thus, applying and developing

artificial intelligence would seem to be the clear path towards a solution. Indeed, significant

effort has been made to this end in a number of key projects. The COACH system [Mihailidis

et al., 2008] uses artificial intelligence to guide dementia patients through completing activities

of daily living, which they would otherwise need a caregiver for. The HELPER [Belshaw et al.,

2011a] system uses computer vision, speech recognition and a machine learning classifier to

detect if someone has fallen down and coordinate emergency response. Snoek et al. [2009]

developed a system to detect falls and other unsafe events on stairs using computer vision

and machine learning techniques. The Autominder system [Pollack et al., 2003], deployed as

a robot, uses a variety of AI techniques to model a subject’s daily planning, reason about the

execution of their plans and then issues reminders if appropriate. Chu et al. [2012] similarly

employ partially observable Markov decision processes to model the state of a subject, reason

about their schedule and plans, and issue prompts if necessary. The NURSEBOT [Montemerlo

et al., 2002] project uses artificial intelligence to develop a robotic guide for the elderly. Adami

et al. [2011] use machine learning to classify movement in bed for long term monitoring.

77

Cook [2012] and Rashidi and Cook [2009] advocate and use machine learning techniques to

monitor elderly adults in smart homes. Although machine learning algorithms have only been

employed to solve problems in assistive technology and health informatics in recent years,

they are poised to make a major impact across a wide variety of domains and applications,

of which the aforementioned are only a small sample.

However, more advanced artificial intelligence techniques, and machine learning in particular,

are inaccessible to many of the highly multidisciplinary assistive technology researchers.

Machine learning remains difficult to use because it requires significant domain knowledge

to apply. Answering the questions of which feature extraction method or which classifier to

use require a broad knowledge of the domain. Knowing how to set the hyperparameters of

particular machine learning models often requires a deep understanding of the particular

models (although as shown in Section 4 even those with the most expertise can not currently

optimally set these parameters using standard techniques). These issues are not only

frustrating to assisitive technology researchers but in aggregate result in a tremendous loss in

the effectiveness of machine learning algorithms applied to problems of critical importance to

society. A major goal of the work in this thesis is to make machine learning more accessible

to assistive technology researchers. The contribution of this thesis to the domain of assistive

technology is through statistically rigorous means to automate the role of an expert in

the domain of machine learning. The result is code that researchers can use to perform

feature extraction and classification in the context of a single model and especially to set

the hyperparameters of machine learning models. This section describes how the methods

introduced in this thesis have been and are being used within four well motivated applications

from assistive technology.

7.1 Rehabilitation

7.1.1 Introduction

This section highlights the findings of Snoek et al. [2012a] and Snoek et al. [2012c] (Chapters 3

and 4) in the context of assistive technology. In this section, we explore the use of an

integrated and entirely automatic approach to perform feature extraction, classification and

hyperparameter selection applied to a real-world rehabilitation research problem. In particular,

we explore the use of an unsupervised machine learning model that is guided to learn a

78

representation that is more appropriate for a given discriminative task. This nonparametrically

guided autoencoder (Chapter 3) uses a neural network to learn a nonlinear encoding that

captures the underlying structure of the input data while being constrained to maintain an

encoding for which there exists a mapping to some discriminative label information. This

model integrates the feature extraction and discriminative tasks and reduces the complexity

induced by exploring various combinations of feature extraction algorithms and classifiers.

The nonparametrically guided autoencoder (NPGA) requires a number of hyperparameters,

such as the number of hidden units of the neural network, that are nontrivial to select.

However, recent advances in machine learning have developed extremely effective methods

for automatically optimizing such hyperparameters. The black-box optimization strategy

known as Bayesian Optimization has recently been shown (Bergstra et al. [2011], Snoek et al.

[2012c], Chapter 4) to be particularly well suited to this task, consistently finding better

parameters, and doing so more efficiently, than machine learning experts.

In this work it is shown that such a fully automated strategy arising from the combination

of the NPGA and Bayesian Optimization can outperform a carefully hand tuned machine

learning approach on a real-world rehabilitation problem Taati et al. [2012c]. This is significant

as it suggests that assistive technology researchers can achieve state-of-the-art results on their

problems without requiring expert knowledge or tedious algorithm and parameter tuning.

7.1.2 Empirical Analysis

The approach outlined in this work is empirically validated on a real-world application in

assistive technology for rehabilitation. About 15 million people suffer stroke worldwide each

year, according to the World Health Organization. Up to 65% of stroke survivors have

difficulty using their upper limbs in daily activities and thus require rehabilitation therapy

Dobkin [2005]. The frequency at which rehabilitation patients can perform rehabilitation

exercises, a significant factor determining the rate of recovery, is often limited due to a

shortage of rehabilitation therapists. This motivated the development of a robotic system to

automate the role of a therapist providing guidance to patients performing repetitive upper

limb rehabilitation exercises by Kan et al. [2011], Huq et al. [2011], Lu et al. [2012] and Taati

et al. [2012c]. The system allows a user to perform upper limb reaching exercises with a

robotic arm (see Figures 7.1a, 7.1b) while it dynamically adjusts the amount of resistance

to match the user’s ability level. The system can thus alleviate the burden on therapists

and allow patients to perform exercises as frequently as desired, significantly expediting

79

(a) The Robot (b) Using the Robot

(c) Depth Image (d) Skeletal Joints

Figure 7.1: The rehabilitation robot setup and sample data captured by the sensor.

rehabilitation.

Critical to the effectiveness of the system is its ability to discriminate between various types of

incorrect postures and prompt the user accordingly. The current system [Taati et al., 2012c]

uses a Microsoft Kinect sensor to observe a patient performing upper limb reaching exercises

and records their posture as a temporal sequence of seven estimated upper body skeletal

joint angles (see Figures 7.1c and 7.1d for an example depth image and corresponding pose

skeleton captured by the system). A classifier is then employed to discriminate between five

different classes of posture, consisting of good posture and four common forms of improper

posture resulting from compensation due to limited agility. Taati et al. [2012c] obtained a

data set of seven users each performing each class of action at least once, creating a total of 35

sequences (23,782 frames). They compare the use of a multiclass support vector machine and

a hidden Markov support vector machine in a leave-one-subject-out test setting to distinguish

these classes and report best per-frame classification accuracy rates of 80.0% and 85.9%

80

Model Accuracy

SVMMulticlass Taati et al. [2012c] 80.0%Hidden Markov SVM Taati et al. [2012c] 85.9%`2-Regularized Logistic Regression 86.1%NPGA (α = 0.8147, β = 0.3227, H = 3, 242 hidden units) 91.7%

Table 7.1: Experimental results on the rehabilitation data. Per-frame classification accuracies areprovided for different classifiers on the test set. Bayesian optimization was performed on a validationset to select hyperparameters for the `2-regularized logistic regression and the best performingNPGA algorithm.

respectively.

In our analysis of this problem we use a NPGA to encode a latent embedding of postures that

facilitates better discrimination between different posture types. The same formulation as

presented in Section 3.5.1 is applied here. We interpolate between a standard autoencoder (α =

0), a classification neural net (α = 1, β = 1), and a nonparametrically guided autoencoder by

linear blending of their objectives according to Equation 3.5. Rectified linear units were used

in the autoencoder. As in Taati et al. [2012c], the input to the model is the seven skeletal

joint angles, that is, Y = R7, and the label space Z is over the five classes of posture.

In this setting, rather than perform a grid search for parameter selection as in Section 3.5.1,

we optimize validation set error over the hyperparameters of the model using Bayesian

optimization [Mockus et al., 1978]. Bayesian optimization is a methodology for globally opti-

mizing noisy, black-box functions based on the principles of Bayesian statistics. Particularly,

given a prior over functions and a limited number of observations, Bayesian optimization

explicitly models uncertainty over functional outputs and uses this to determine where to

search for the optimum. For a more in-depth overview of Bayesian optimization see Brochu

et al. [2010]. We use the Gaussian process expected improvement algorithm with a Matern-52

covariance, as described in Snoek et al. [2012c], to search over α ∈ [0, 1], β ∈ [0, 1], 10− 1000

hidden units in the autoencoder and the GP latent dimensionality H ∈ 1..10. The best

validation set error observed by the algorithm, on the twelfth of thirty-seven iterations,

was at α=0.8147, β=0.3227, H=3 and 242 hidden units. These settings correspond to a

per-frame classification error rate of 91.70%, which is significantly higher than that reported

by Taati et al. [2012c]. Results obtained using various models are presented in Table 7.1.

The relatively low number of experiments required by Bayesian optimization to find a state-

of-the-art result implies that the validation error is a well behaved function of the various

81

Alpha

Beta

H=2, 10 hidden units

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

(a)

Alpha

Beta


0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

(b)

Alpha

Beta


0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

10

15

20

25

30

35

40

45

50

(c)

Figure 7.2: The posterior mean learned by Bayesian optimization over the validation set classificationerror (in percent) for α and β with H fixed at 2 and three different settings of autoencoder hiddenunits: (a) 10, (b) 500, and (c) 1000. This shows how the relationship between validation error andthe amount of nonparametric guidance, α, and parametric guidance, β, is expected to change as thenumber of autoencoder hidden units is increased. The red x’s indicate points that were explored bythe Bayesian optimization routine.

hyperparameters. The relationship between the model hyperparameters and validation error

is challenging to visualize, but it is important to assess their relative effect on the performance

of the model. Thus, in Figure 7.2 we explore how the relationship between validation error and

the amount of nonparametric guidance α, and parametric guidance β is expected to change as

the number of autoencoder hidden units is varied. That is, we show the expected value of the

validation error for unobserved points under a Gaussian process regression. Similarly to the

results observed in Section 3.5.1, it seems clear that the best region in hyperparameter space

is a combination of all three objectives, the parametric guidance, nonparametric guidance and

unsupervised learning. This reinforces the theory that although incorporating a parametric

logistic regressor to the labels more directly reflects the ultimate goal of the model, it is

more prone to overfit the training data than the GP. Also, as we increase the number of

hidden units in the autoencoder, the amount of guidance required appears to decrease. As the

capacity of the autoencoder is increased, it is likely that the autoencoder encodes increasingly

subtle statistical structure in the data. When there are fewer hidden units, this structure

is not encoded unless the autoencoder objective is augmented to reflect a preference for it.

Interestingly, the validation error for H = 2 was significantly better than H = 1 but it did

not appear to change significantly for 2 ≥ H ≤ 10.

An additional interesting result is that, on this problem, the classification performance is

worse when the nearest neighbors algorithm is used on the learned representation of the

82

NPGA for discrimination. With the best performing NPGA reported above, a nearest

neighbors classifier applied to the hidden units of the autoencoder achieved an accuracy of

85.05%. Adjusting β in this case also did not improve accuracy. This likely reflects the fact

that the autoencoder must still encode information that is useful for reconstruction but not

discrimination.

In this example, the resulting classfier must operate in real time to be useful for the rehabili-

tation task. The final product of our system is a simple softmax neural network, which is

directly applicable to this problem. It is unlikely that a Gaussian process based classifier

would be feasible in this context.

7.1.3 Conclusion

In this section, a methodology was presented to combine feature extraction and classification

into a single model and to optimize the model hyperparameters automatically. The approach

was empirically validated on a real-world rehabilitation research problem, for which state-

of-the-art results were achieved. The approach is very general, and as such can be applied

to potentially many problems in the domain of assistive technology. This is valuable as the

need for careful exploration of machine learning models and model parameters, which often

requires significant domain knowledge, is obviated.

7.2 Mortality in Bone Marrow Transplant Patients

In this problem, the Bayesian optimization procedures developed in this thesis are applied to

optimizing the hyperparameters of machine learning algorithms involved in the prediction of

successful bone marrow transplant operations. This section summarizes Taati et al. [2012a],

Taati et al. [2012b] and Taati et al. [2013].

7.2.1 Introduction

Bone Marrow Stem Cell Transplants (BMTs) are a common procedure for treating certain

types of cancer, such as leukemia and lymphoma, and other diseases such as thalassemia.

Globally, over 50,000 first hematopoietic stem cell transplants are performed each year [Grat-

wohl et al., 2010]. The procedure replaces destroyed or damaged bone marrow with healthy

83

stem cells. BMTs are not always a successful treatment and involve risk factors and compli-

cations such as infection, relapse of the disease, or graft-versus-host disease, each of which

could cause death.

In this problem we attempt to predict BMT patient post-transplant survival based on features

describing the patient and their condition before and directly after the transplant. A major

issue in this work is that many of the features are either missing or noisy. Thus as an initial

step, missing data is inferred using collaborative filtering techniques such as probabilistic

principal components analysis with missing values (PPCA [Roweis, 1996]), Robust PCA with

`1 norm minimization (RPCA Becker et al. [2011]) and probabilistic matrix factorization

(PMF Salakhutdinov and Mnih [2008]). Analogously to predicting, for example, missing

movie ratings based on subjects with similar ratings, we can attempt to infer missing features

based on the features that are present for similar subjects. With the missing data filled in, a

classifier is trained to predict patient survival based on ground truth annotations.

7.2.2 Data

Records from 1750 transplants performed in the largest BMT centre in the world (Shariati

Hospital / Tehran University of Medical Sciences) have been collected for analysis1. The

data was collected over the course of 19 years and contains over 120 pre and post-transplant

measurements per patient. There are a total of 90 pre-transplant attributes, 33 post-transplant

attributes and the survival status of each patient. Almost a quarter, 22.3%, of the features

are missing. There are known errors in the data. In two cases, for instance, the date of

diagnosis is noted as prior to the date of birth. Preprocessing is done to remove and fix

obvious errors, but it is assumed that a significant and unknown proportion remain.

The records include 1010 male and 740 female patients of ages ranging from 2 to 68 years

old at the time of the transplant. The underlying diseases leading to the transplant include

thalassemia and various forms of leukemia, categorized into five classes of Acute Lymphoblastic

Leukemia (ALL), Acute Myelogenous Leukemia (AML), Chronic Myelogenous Leukemia

(CML), Plasma cell leukemia (PCL), or Other.

The pre-transplant attributes consist of categorical varibles, ordinal variables and calendar

dates. The named categorical variables are derived from records and measurements and

include the patient’s gender and blood type. The ordinal values include, for example, the

1The ethics review board approval for the study was obtained via Shariati Hospital (Tehran, Iran).

84

level of various antigens, such as Human Leukocyte Antigens, in the patient, the donor, and

the patient’s parents. The date of birth and that of the diagnosis and the transplant are also

included.


The Bayesian optimization techniques introduced in Section 4 were applied to optimize the

parameters of the various collaborative filtering techniques to minimize final classification

error. This includes two parameters that are of particular importance to all the collaborative

filtering techniques. These are the dimensionality of a latent feature space, d, and a parameter

that sets a limit on the sparsity of features before they are discarded, m. The latent space

dimensionality controls how complex the underlying factorization of the feature matrix can

be. The setting of this variable involves a tradeoff between limiting modeling capacity and

overfitting. The sparsity limit, or threshold, reflects the notion that some features are so

sparse that incorporating these into the model hinder performance as they possess no useful

information. For example, there are twenty features for which there are known values for

fewer than forty patients.

The optimization algorithm was set to run over 300 iterations to search for the best pair of

integer values within the ranges of d ∈ [2, 35] and m ∈ [1, 1750]. At each iteration, following

the training of the collaborative filtering algorithm with a given set of parameters, a random

forest classifier was applied to the filled in data to classify patient survival. A 10-fold cross

validation procedure was followed to obtain the loss that the optimizer was used to minimize.

Experiments were repeated to optimize the classification accuracy, the F1 score, and the area

under the ROC curve (AUC). Maximizing each of these values resulted in similar overall

classification results. For brevity, only the results from maximizing the F1 score are reported

here.

The overall process took about 20 hours to run on available hardware and converged on

the values of d = 21 and m = 1556. The selected value of m resulted in discarding the 22

most sparse features and keeping the remaining 170 features. Table 7.3 shows the prediction

accuracy of the resulting classifier for various thresholds on the classifier’s confidence measure.

85

Confidence Threshold # Patients Accuracy Precision Recall F1

70% 1199 80.6% 80.7% 99.6% 89.280% 712 83.1% 83.1% 100% 90.890% 240 85.0% 85.0% 100% 91.995% 74 91.9% 91.9% 100% 95.898% 31 96.8% 96.8% 100% 98.499% 7 100% 100% 100% 1

Table 7.2: Estimating the survival states for a subset of patients by adjusting threshold levels onthe prediction confidence value.

7.2.4 Conclusion

Bayesian optimization was used to set the hyperparameters involved in the prediction of

bone marrow transplant patient survival. The hyperparameters are highly dependent and as

such are difficult to set by hand. For a subset of patients, the resulting model can predict

a patient’s survival with very high accuracy. This is promising as it can already facilitate

the prioritization of resources but also will aid further study into what factors contribute to

patient survival.

7.3 Fall Detection

In this problem, we use the constrained Bayesian optimization method introduced in Section 6

to tune the parameters of a computer vision based fall detection system to minimize the

classification error of the fall detector.

7.3.1 Introduction

With improved medical care, nutrition and the aging of the baby boomer generation, the

world’s elderly population is rapidly growing relative to other demographics. As the older

generation grows, so does the burden of their care on healthcare systems and caregivers. A

natural solution to alleviate this burden is to facilitate aging in place and promote that the

elderly live independently for as long as possible. Falls in the home pose a major challenge to

independent living, however, and they are becoming increasingly epidemic. A recent study

has identified falls as the most expensive category of injury for the Canadian healthcare

86

system, costing in total over $6.2 billion in 2004 alone [Smartrisk, 2009]. Adults over the

age of 65 accounted for the majority of the costs, 84% of fall-related deaths and 59% of

hospitalizations, and the prevailing type of injury-causing fall was falls on the same level,

followed by falls on stairs. According to the same study, falls accounted for 50% of all injuries

leading to hospitalization and account for 31% of total injury costs in Canada. Although these

statistics are presently already alarming, the world’s over 60 population is proportionally

expected to continue to grow from one out of every nine people to one out of every five in

the next forty years [United Nations, 2012].

Under the hypothesis that immediate access to emergency healthcare after a fall will reduce

the severity of injuries, promote greater independence and allow older adults to age in place

for longer, there have been a number of efforts to create automatic fall-detection systems.

A promising strategy is through the use of computer vision. One reason is that a camera,

a passive sensor, does not require the subject being monitored to remember to wear it or

activate it once an adverse event occurs. A number of computer vision based fall detection

methods have been proposed [Rougier et al., 2006, Anderson et al., 2006, Spehr et al., 2008,

Nait-Charif and McKenna, 2004, Lee and Mihailidis, 2005, Snoek et al., 2010b]. However,

these methods do not address the complexities of real-world environments with variable

lighting conditions, furniture and clutter obstructing certain angles of view and in which

there may be an arbitrary number of people.

The Intelligent Assistive Technology and Systems lab at the University of Toronto has

developed a number of increasingly complex and effective vision fall detection systems. The

initial system [Lee and Mihailidis, 2005] used a ceiling mounted camera and simply classified

based on human set thresholds on simple features derived from a background subtracted

silhouette. Once the thresholds are violated, the system would initiate a speech recognition

based dialogue to establish the state of the subject and call for help if necessary. The

system has since evolved with Belshaw et al. [2011a] adding more sophisticated background

subtraction methods, shadow detection and a machine learning based classifier and Belshaw

et al. [2011b] incorporating optical flow methods and a wide angle lens. A diagram of the full

system is presented in Figure 7.3c.

The system [Belshaw et al., 2011b] initially was evaluated in a highly constrained laboratory

environment. It has since been fabricated into a customized unit, shown in Figure 7.3a, and

deployed in a far more realistic simulated home environment, the Toronto Rehab ”Homelab”,

Figure 7.3b. The re-evaluation of the system in this more complex and realistic environment

has elucidated some of the challenges that would need to be addressed to deploy the system in

87

(a) Fall detection unit on HomeLab ceiling tile. (b) Toronto Rehab HomeLab. c© Good Robot

Fall Detection(NN classifier)

Feature Extraction(Silhouette/Hu, Lighting Statistics, Flow)

Active Region Modelling

Optical Flow

Feature Vector

Video Frame Buffer

BackgroundModelling

No Fall Detected

FallDetected

Speech Verification"Do you need help?" "Yes"

"No"

SIP Call

Frame Capture

Audio Stream

Speaker

Intelligent Personal Emergency Response System(PERS) and Fall Detection System

N-Frame Window

Optical Flow Regions

Active Region Blobs

Fall Handler

Help Button

Camera/Microphone

Verbal "Help"Call Recognition

Keyword Detected

(c) Block Diagram of fall detection method/unit. (d) Undistorted image from the device.

Figure 7.3: The fall detection device is mounted on the ceiling tiles (Figure 7.3a) of the variousrooms within the Toronto Rehab Institute HomeLab (Figure 7.3b). Figures 7.3c and 7.3d show thelogic of the system in a block diagram and a post-processed image from the unit.

real home environments. A major issue is that the system has numerous parameters that were

painstakingly set by hand to improve empirical performance in the laboratory environment.

Each component of the system (Figure 7.3c) has parameters that significantly affect overall

system performance. These include, for example, the background model’s adaptive blending

parameter, the minimum background subtracted silhouette size, thresholds on the optical

flow and the hyperparameters of the classifier.

The setting of the aforementioned parameters poses a major hurdle towards building an

effective system that can be deployed in arbitrary environments. The complexity of the

problem is compounded by the fact that the parameters do not contribute independently to

88

classifier’s fall detection accuracy. Furthermore, certain combinations of the parameters result

in numerically undefined results or arbitrarily poor classification accuracy. For example, there

is a certain threshold on the adaptive background subtraction blending parameter for which

there are no longer any silhouettes of the minimum size to be considered human subjects.


The HomeLab, while situated inside the research space of a hospital, is a fully functional

home with four rooms - a kitchen, living room, bedroom, and a washroom (see Figure 7.3b).

In each room, a fall detection unit was installed on custom ceiling tiles at different heights.

Over a two month period, each time sufficient change in the background was detected, the

system collected video at a resolution of 640x480 pixels, and a frame rate of 30 frames per

second. Each of these movement sequences was kept as an example of a negative fall sequence.

Positive fall sequences were recorded by having twenty subjects simulate falls. Each subject

simulated three fall sequences in each of the four rooms. In addition, participants were

given the option of performing extra falls while using either a walker or cane. In total 3625

sequences, a contigous set of frames with people present, were collected. Of these sequences,

887 contained simulated falls. Each frame of video was annotated to indicate whether that

frame depicts a fall or not. Frames occurring after the subject gets up were discarded.

A random forest classifier was used to classify each video frame as a fall or not a fall.

Constrained Bayesian optimization was used to optimize the fall detection loss. As each run

of the classification pipeline to evaluate a given setting of parameters takes between three to

six hours, the optimization was parallelized ten times according to the procedure outlined in

Section 4.3.3. Figure 7.4 shows the progression of the best fall detector, as quantified by the

loss measure, found by the Bayesian optimization procedure over time.

7.3.3 Conclusion

In this problem, the constrained Bayesian optimization techniques introduced in Section 6

were used to optimize the parameters involved in a complex computer vision system. The

resulting strategy obviates the need for an expert to carefully adjust the parameters of the

system to optimize performance in new environments. It is clear from Figure 7.4 that the

system effectively minimizes the fall detection loss. However, the results were qualitatively

also highly interesting. Originally, the engineer that developed the system first carefully

89

0 10 20 30 40 50 60 70

1

2

3

4

5

Hours

Loss

Figure 7.4: Progression over time of the loss being minimized by the constrained Bayesian optimiza-tion on the fall detection problem.

tuned the parameters of the background subtraction algorithm to be effective at tracking

people as they walked around within the view of the camera. This reflects a simple bias

that a successful fall detection device must first be good at tracking people at all times.

The Bayesian optimization algorithm, however, tuned the background subtraction in such a

manner as to be extremely poor at tracking people in normal gait. Instead, it set the adaptive

background subtraction to only mark as foreground large objects which were moving very

rapidly. In the training data, this consists mostly of people during and shortly after falls.

The optimization algorithm tuned the background subtraction to effectively act as a classifier

of falls and non-falls based on how fast people are moving. Thus, the Bayesian optimization

algorithm found a highly effective mode which the human expert would never have explored.

7.4 Prompting Alzheimer’s Patients

In this problem, the Bayesian optimization techniques proposed in Section 4 are used to

improve the performance of COACH, a system built to prompt Alzheimer’s patients and

older adults with other forms of dementia through activities of daily living. In particular,

90

the Bayesian optimization demonstrably improves the performance of a computer vision and

machine learning based hand tracker on which the system critically depends.

7.4.1 Introduction

As motivated above in Section 7.3.1, one of the major challenges facing contemporary societies

is the care of the proportionally ever growing older adult population. Of the challenges

associated with caring for this population, the most difficult is perhaps caring for those

facing cognitive decline. Older adults with Alzheimer’s disease and other forms of dementia

require more resources and care and are less able to live independently. The U.S. Census

Bureau [U.S. Census Bureau, 2011] estimates that in 2011 about 29 percent of Americans

aged 65 or older were living alone. However, only one in seven people with Alzheimer’s

disease live alone [Alzheimer’s Association, 2012]. The Altzheimer’s Association estimates

the total health care, long-term care and hospice costs for people with Alzheimer’s disease to

increase from $200 billion in 2012 to $1.1 trillion by 2050 [Alzheimer’s Association, 2012]. In

order to manage the burden of caring for this challenging and growing population we must

find ways to facilitate independence and aging in place.

The COACH system [Mihailidis et al., 2008] attempts to facilitate independence by using

artificial intelligence to emulate the presence of a caregiver helping people with dementia

through tasks of daily living. Given that the inability to complete tasks associated with using

the restroom is among the most degrading and difficult challenges to overcome, both for older

adults with dementia and caregivers, the system focuses on the task of hand washing. The

system observes people as they wash their hands and prompts them if they get stuck in a

step or stray from a sequence of events that will lead to successful completion of the task.

Figure 7.5 shows an example of the COACH system experimental environment.

The effectiveness of COACH depends heavily on its ability to accurately estimate the location

of a subject’s hands relative to the wash basin, water and a towel. This was done through

the use of a carefully designed computer vision based hand tracking algorithm [Hoey et al.,

2010]. However, any errors made by the hand tracking algorithm could lead to erroneous

prompts that would confuse the user and render the system counter productive. Recent

studies [Czarnuch and Mihailidis, 2012, 2013] have emphasized this issue, and motivated the

use of depth cues as well as color information to track the hands.

The ceiling mounted camera has therefore been replaced with a Microsoft Kinect sensor. The

sensor has proven to be highly effective at facilitating pose estimation and the complementary

91

Figure 7.5: An image of the COACH system experimental set-up.

Figure 7.6: A qualitative demonstration of the hand tracking results. Hands from the viewpoint ofthe overhead kinect sensor (left) are first classified into hand pixels (center) and then the skeletaljoints are proposed (right).

problem of tracking body parts [Shotton et al., 2011]. Emulating the successful methods

of Shotton et al. [2011], a top-down-view approach to arm and hand tracking has been

developed for the COACH system. Similarly to Shotton et al. [2011], the body part inference

component of this system uses randomized decision forests to classify which body part a given

video pixel corresponds to based on depth and color features. The per-pixel classifications

are then used to generate proposals for 3D skeletal joint positions. The location of the hands

can then be trivially extracted from these.

92

Algorithm Small Data Set Full Data Set

Grid Search (following Shotton et al. [2011]) 85% 33%Constrained Bayesian Optimization (6x Parallelized) 92% 73%

Table 7.3: Body joint classification error on the COACH task. A comparison of the standard gridsearch procedure compared to Bayesian optimization


The performance of the aforementioned body part classifier and joint proposal algorithm

depends on a set of non-trivial and highly dependent hyperparameters. The random forest

classifier requires one to set a threshold on the minimum information gain for which a decision

tree split can be considered. Setting this parameter either too high or low significantly

adversely affects the performance of the classifier. The joint proposal algorithm involves the

setting of a confidence threshold on the predictions from the body part classifier and a pixel

offset parameter. Shotton et al. [2011] follow an expensive grid search procedure on a subset

of their data to select values for these parameters.

We compare here the standard grid search procedure for setting these parameters to setting

them using the Bayesian optimization algorithms from this thesis. In this setting, nine body

parts, which are defined by the joints that connect them, are classified. These joints are are

the left and right hands, left and right elbows, shoulders, and head. The mean per-pixel

classification accuracy over all body parts is reported on a withheld validation set of three

hundred images.

From the values given by Shotton et al. [2011] and following the grid search procedure resulted

in an average body part classification accuracy of 85% on a subset of the data (150 training

images). In contrast, the Bayesian optimization procedure found parameters that achieved

92% and required only the specification of bounds. Initially, the “best” hyperparameters

found using grid search on a subset of the data were then used on a much larger data set of

4800 images to train the final classifier because a grid search on the larger data set would

be prohibitively expensive. However, this resulted in a poor 33% classification accuracy

on the larger data set. Instead, the parallel Bayesian optimization procedure (parallelized

6 times) was repeated on the larger data set and achieved 72.8% classification accuracy

within 95 experiments. A grid search procedure would not even be able to explore 5 distinct

values for each of the three parameters within 95 experiments. In Figure 7.6 we demonstrate

qualitatively the results of the resulting hand tracker on an example video frame.

93

7.4.3 Conclusion

In this problem, we used the Bayesian optimization procedures introduced in this thesis to

significantly improve the performance of a hand tracking algorithm on which the COACH

Alzheimer’s patient prompting system relies. The procedure improved the accuracy of the

hand classifier from 33% to 72.8% on a computationally expensive problem. The larger scale

experiments take three to six hours to evaluate a single setting of the parameters (compared

to 3-6 minutes for the smaller scale experiments). Thus, parsimony in function evaluations is

highly desirable. A grid search procedure can be highly wasteful as it will evaluate parameter

settings in poor regimes which the Bayesian optimization can avoid. Furthermore, the

Bayesian optimization can take advantage of the strong dependencies between parameters.

As a result, the Bayesian optimization significantly outperformed grid search procedures on a

subset of the data and was highly effective in a regime for which grid search was prohibitively

expensive.

94

Chapter 8

Discussion

8.1 Limitations of the Nonparametrically Guided

Autoencoder and Future Directions

The NPGA developed in Chapter 3 was empirically shown to be a highly effective model

for guiding the latent representation learned by an autoencoder to encode structure that is

more relevant to a discriminative task of interest. In this section we will briefly discuss some

limitations of this approach and possible future work.

8.1.1 Non-Gaussian Outputs

In the formulation presented in Chapter 3 and corresponding empirical analysis we used a GP

to map from the latent representation to some auxiliary information such as labels. A natural

criticism of this approach is that these outputs are not always well modeled by a Gaussian

distribution. Class labels, for example, would be better modeled with logistic or probit

outputs. This can be added to the NPGA but would add complexity both analytically and

computationally. Interesting future work may extend this approach to incorporate arbitrary

outputs through the use of latent Gaussian processes. Inference and learning can be difficult

in this scenario, however, so methods such as expectation propagation [Minka, 2001] or

sampling [Murray et al., 2010] must be used.

95

8.1.2 Computational Complexity and Mini-Batch Learning

The computational complexity of the NPGA depends significantly on the number of observa-

tions that are modeled by the Gaussian process at a given time as the GP computationally

scales cubically in the number of observations. In Chapter 3, we find that an effective learning

and optimization strategy is to divide the data into mini-batches and perform gradient

updates following mini-batch stochastic gradient descent. This changes the computational

complexity to be cubic in the mini-batch size but linear in the total number of batches. This

is necessary for the NORB data, for example, as computing and inverting a covariance matrix

over all data cases in the GP is generally intractable on modern computers. While this method

has been widely adopted for the training of autoencoders and neural networks in general,

it is not standard practice for GPs. Yao et al. [2011] have shown that stochastic gradient

descent significantly outperforms other optimization techniques in the learning of latent

representations using GPLVMs. An interesting question that warrants further investigation

is the impact that this mini-batch stochastic gradient descent has on the modeling abilities

of the GP. In particular, an interesting empirical and theoretical question to consider is the

impact of the batch size and what the trade-off is between the batch size and the quality of

the model.

8.2 Limitations of Bayesian Optimization

and Interesting Future Directions

While the Bayesian optimization algorithms developed in this work are quite effective, and

offer methods for setting the hyperparameters of machine learning models which are vastly

superior to current strategies, there are limitations and drawbacks which are important to

acknowledge. In general these are tied to the statistical model used to model distributions

over functions, the Gaussian process.

8.2.1 Complexity and Large Scale Learning

Due to the computation and inversion of a covariance matrix between observations, the

computational complexity of sampling and inference in a GP is cubic in the number of

observations and the memory requirements scale quadratically. The fact that GP hyperpa-

rameter estimation and inference are themselves computationally expensive is one of the

96

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Figure 8.1: Running the treed-Bayesian optimization on the Branin-Hoo function. The translucentrectangles are different tree splits. Each blue dot is an observation that the Bayesian optimizationroutine suggested.

major reasons why the use of Bayesian optimization is advocated for optimizing expensive

black box functions. A drawback is that, as a result of the GP, as more observations are made

the Bayesian optimization routine becomes increasingly slow. For very difficult, for example

high-dimensional, optimization problems where many observations are needed, the parallel

Bayesian optimization routine can become prohibitively slow before finding the optimum.

There has been significant effort to develop approximate algorithms for Gaussian processes

that scale to larger data sets through, for example, selecting subsets of observations [Lawrence

et al., 2002] or learning optimal pseudo-observations [Snelson and Ghahramani, 2005]. Pre-

serving the high-quality uncertainty measurements of the GP remains a challenge, however,

as selecting subsets of observations or learning pseudo-inputs naturally discards observations

and thus increases uncertainty in observed regions. This may not be harmful for problems

where only the expected value (the predictive mean) is important, but Bayesian optimization

depends critically on high quality variance estimates. However, there are very interesting

sparse GP schemes that may be very effective for Bayesian optimization. For example, one

could develop a methodology that selects subsets of observations such that the GP finely

models regions with high expected improvement but coarsely models regions that are not

promising. This would be closely related to the work of [Snelson and Ghahramani, 2005,

Lawrence et al., 2002] but with a slightly different criterion for selecting observations or

pseudo-observations.

An interesting alternative is to model fixed subsets of the observations with different Gaussian

processes. This would change the complexity to scale linearly in the number of GPs. One

97

could partition the observed space in such a manner that different regions are modeled by

different Gaussian processes. An interesting by-product of this strategy is the ability to learn

different length scales and noise terms for each of the regions, thus achieving non-stationarity

and heteroscedasticity. Indeed this is a major motivation for the Bayesian treed GP approach

developed by Gramacy and Lee [2008]. A major consideration is how to determine partitioning

points for the various GPs. Gramacy and Lee [2008] adopt a Bayesian approach and integrate

over trees. This provides a very elegant framework for dealing with non-stationary functions

with change points, but does not facilitate large scale data as a consequence of sampling

over trees. A compromise would be to use a heuristic for partitioning the observations. A

simple heuristic would be to split the space modeled by a tree leaf, such that each partition

has an equal numbers of observations, if the total number of observations exceeds some

number and then recurse. Preliminary experiments following this strategy are promising but

reveal a possibly problematic pathology. In Figure 8.1 we demonstrate the points chosen by

performing Bayesian optimization with this treed GP strategy on the Branin-Hoo function.

While clearly the optimization scheme has selected the vast majority of points on and around

the optima of the function, it is clear that the model selects too many points at the bounds

of the leaf GPs. This is not surprising, as the variance of the GP grows in proportion to the

distance to observations and the leaves can not see observations that are modeled by other

leaves. However, this results in a significant number of redundant experiments as can be seen

by the clusters of observations at the bounds of the optimization in Figure 8.1.

8.2.2 High Dimensional Bayesian Optimization

Bayesian optimization has thus far been limited to relatively low (e.g. 10) dimensional

problems. The Bayesian optimization routines developed in this thesis have been used

successfully on a twelve dimensional speech recognition problem [Dahl et al., 2013] and the

twenty-six dimensional vision system optimization problem described in Section 7.3. However,

as more complex machine learning models are requiring more hyperparameters, there has been

interest in using Bayesian optimization for significantly higher dimensional problems. Some

contemporary deep learning methods, for example, have independent learning rates, capacity

parameters (e.g. numbers of hidden units) and regularization parameters for each layer of

the model. Thus, it may be desirable to optimize up to fifty hyperparameters for a deep

convolutional network such as the one studied in Section 4.4.4. Naturally, one would expect

the number of observations required to accurately model a function and find its optimum

98

to scale with respect to the input dimensionality. Certainly with ARD kernels, the number

of hyperparameters in the Gaussian process scales with respect to the input dimensionality,

which requires more observations to accurately estimate. If there is a redundancy in the

inputs, if the inputs are highly dependent or some input dimensions are irrelevant to the

optimization, then it may seem reasonable to consider methods to reduce their dimensionality.

That is, one may consider that the relevant structure within the inputs exists on a low

dimensional manifold. However, simultaneously learning a parameterized low dimensional

projection of the inputs while exploring the function may be difficult to justify. Chen et al.

[2012] explore the case where the input is considered sparse and one wishes to select the

relevant input dimensions while performing the optimization. However, in many problems of

interest we cannot expect the inputs to be sparse.

In excellent recent work by Wang et al. [2013], the authors use random projections to

project the inputs into a lower dimensional space over which the optimization is performed.

Provided that the relevant input structure indeed lies on a lower dimensional manifold, this is

significantly more efficient than performing the optimization in a high dimensional space. In

particular, Wang et al. [2013] augment the covariance function to first project the inputs to a

lower-dimensional space with a random parameterized linear projection and then compute

covariances in the lower dimensional space. For the exponentiated quadratic covariance this

can be written as:

KEQRP(x,x′) = exp

−1

2r2(x,x′)

, r2(x,x′) = (Ax−Ax′)TΨ(Ax−Ax′) (8.1)

where A ∈ Rd×D, D >> d and Ai,i ∼ N(0, 1), is a random projection matrix and Ψ is a

diagonal matrix with Ψd,d = 1/`2d.

An interesting extension to this would be to model Ψ as a full covariance matrix and then

constrain it in a manner so as to perform the random projection implicitly rather than include

A. Vivarelli and Williams [1999] point out that if Ψ is a general positive definite covariance

matrix, then the covariance is computed in a latent feature space resulting from an implicit

linear projection of the input data. This can be shown from an eigendecomposition of the

positive definite matrix Ψ into Ψ = VTSV where V is the matrix of eigenvectors and S is the

diagonal matrix of eigenvalues. When Ψ is a full covariance matrix we can thus equivalently

write Equation 8.1 as r2(x,x′) = (x−x′)TVTSV(x−x′) = (Vx−Vx′)TS(Vx−Vx′). Through

manipulating or constraining the eigenvalues and eigenvectors of Ψ one can change the

effective dimensionality of the projection. An interesting Bayesian alternative to performing

99

the random projections of Wang et al. [2013] would be to place a prior on Ψ that prefers low

dimensional embeddings, through a prior on the eigenspectrum of Ψ, and integrate it out

through sampling. This would have the effect to start the optimization with low dimensional

random projections but refine the latent projection as more observations are acquired. This

may be achievable through sampling Ψ from an Inverse-Wishart distribution or sampling the

elements of its Cholesky decomposition.

0 20 40 60 80 1000.24

0.245

0.25

0.255

0.26

0.265

0.27

0.275

0.28M

in F

un

ctio

n V

alu

e


Matern 52 ARD

SqExp

SqExp ARD

Matern 32 ARD

Figure 8.2: A comparison of various covariance functions used in Bayesian optimization for optimizingthe hyperparameters of M3E models on the protein motif finding task from Section 4.4.3.

8.2.3 Priors over Functions

Naturally, the performance of Bayesian optimization depends critically on how well the

prior distribution over functions fits the actual function being modeled. In Section 4.3.1

we argued that simply the degree of smoothness in the prior, as realized by the covariance

function of the GP, significantly affects the performance of Bayesian optimization. This was

empirically validated on a task involving the Bayesian optimization of the hyperparameters

of a structured support vector machine in Section 4.4.3. A figure demonstrating the results

from this empirical validation is reproduced in Figure 8.2 above. Clearly the choice of an

appropriate prior is extremely important to the performance of Bayesian optimization. While

we advocate the use of a particular covariance function, the ARD Matern 5/2 kernel, it is

important to emphasize that this is certainly not necessarily the most appropriate prior for

all problems. It is a very general prior that is appropriate if very little is known about the

actual structure of the function of interest. It should be assumed that Bayesian optimization

will likely perform better given a stronger and more appropriate prior for a specific task.

There are many examples where the ARD Matern 5/2 covariance and other covariances

discussed in this work are not an appropriate priors and certainly these arise in the optimization

100

0 50 100 150 200 250 3000

20

40

60

80

100V

alid

atio

n E

rro

r

Learning Epoch

(a)

0 50 100 150 200 250 300 3500

5

10

15

20

25

30

Va

ria

nce

in

Err

or

Learning Epoch

(b)

Figure 8.3: Results of running a deep neural network trained on a subset of the MNIST digits dataaccording to the “dropout” strategy of Hinton et al. [2012] one hundred times and plotting themean error (Figure 8.3a) and variance in error (Figure 8.3a) per learning epoch (iteration).

of hyperparameters of machine learning algorithms. A common example is in the case of

nonstationary and heteroscedastic functions. In Figure 8.3 we show the results of running

a deep neural network trained on a subset of the MNIST digits data according to the

“dropout” strategy of Hinton et al. [2012] one hundred times and plotting the mean error

(Figure 8.3a) and variance in error (Figure 8.3a) per learning epoch (iteration). This function

may correspond to the one being optimized by a Bayesian optimization routine attempting to

optimize the error of the deep network with respect to only the number of learning epochs run

(note that it is likely that the number of learning epochs would simply be a single dimension in

a higher dimensional optimization). In this simple example it is easy to see that the function

is both nonstationary and heteroscedastic. Specifically, the degree of smoothness of the error

curve and the noise are both a function of the inputs. The heteroscedasticity in particular can

significantly adversely affect the Bayesian optimization routine. The exponentiated quadratic

and Matern class of covariances assume a single noise term and length-scale. Thus, the GP

must account for the drastic difference in noise using a single term and likely significantly

overestimate the noise near the end of learning and underestimate it in the beginning. This

will have the effect that the Bayesian optimization will consider relatively small, although

realistically very significant, improvements in error to be accounted for by noise rather than

structure in the function. There have been numerous attempts to create nonstationary

covariance functions [see Rasmussen and Williams, 2006, chap. 4] and interesting methods for

incorporating heteroscedasticity and non-Gaussian residuals are emerging [Wang and Neal,

2012]. An important avenue of future work may incorporate these methods into the Gaussian

101

processes used in Bayesian optimization or develop very problem specific priors.

102

Chapter 9

Conclusion

Much of the motivation for the work in this thesis was to make advances in machine learning

more accessible to researchers in assistive technology and health informatics. The goal of

assistive technology is becoming ever more frequently to automate the role of a caregiver

through leveraging machine learning techniques. Problems in health informatics are similarly

increasingly being solved through the application of machine learning. In problems ranging

from the detection of falls in the home to the prediction of bone marrow transplant patient

survival, machine learning is having a profound impact on the critically important applications

in these domains. However, significant machine learning expertise is required to accomplish

these tasks effectively. A goal of this thesis is to automate the role of a machine learning

expert through automatically adapting models and adjusting parameters to a given task of

interest. This thesis consists of a number of contributions towards solving this challenging

open problem in machine learning.

In Chapter 3, through an interesting theoretical connection between GPLVMs and autoen-

coders, we present a new semiparametric latent variable model for extracting features that

are more useful for a given discriminative task. The model allows one to flexibly interpolate

between a fully supervised neural network classifier, an unsupervised autoencoder and a

non-parametric distribution over functional mappings through adjusting a small number of

hyperparameters. In empirical analysis we show that, with an appropriate setting of the

parameters, this model outperforms several common models on challenging machine learning

benchmark problems.

In Chapter 4, we motivate the use of Bayesian optimization as a statistically rigorous

methodology for setting the hyperparameters of machine learning models. We presented

103

methods for performing Bayesian optimization for hyperparameter selection of general machine

learning algorithms. We introduced a fully Bayesian treatment for EI, and algorithms for

dealing with variable time regimes and running experiments in parallel. The effectiveness

of our approaches were demonstrated on three challenging recently published problems

spanning different areas of machine learning. The resulting Bayesian optimization finds better

hyperparameters significantly faster than the approaches used by the authors and surpasses

a human expert at selecting hyperparameters on the competitive CIFAR-10 dataset, beating

the state of the art by over 3%.

In Chapter 5, we introduce two algorithms for incorporating the notion of parameter-dependent

variable cost in Bayesian optimization. In the common scenario where the cost of performing

a function evaluation depends on the parameters being optimized over, the efficiency of

Bayesian optimization should be considered in terms of this cost and not function evaluations.

We demonstrate empirically that in the scenario of a fixed cost budget, for example a deadline,

our algorithms find significantly better results than the standard greedy myopic Bayesian

optimization strategy.

Many problems that would seem appropriate for Bayesian optimization can be considered to

have unknown and complex constraints. Standard GP based Bayesian optimization performs

poorly in this scenario as the discontinuities resulting from constraint violations in the

function violate the smoothness prior of the Gaussian process. In Chapter 6, we introduce a

constrained acquisition function that allows Bayesian optimization to estimate the probability

that a candidate experiment is a constraint violation. In empirical analysis we demonstrate

that this constrained Bayesian optimization significantly outperforms the other Bayesian

optimization strategies in the presence of unknown parameter-dependent constraints.

Finally, in Chapter 7, we demonstrate empirically the impact of the algorithms and methods

introduced in this thesis on well-motivated real world applications in assistive technology and

health informatics. In Section 7.1 we apply the NPGA and Bayesian optimization towards

the automation of rehabilitation therapy. Section 7.2 demonstrates the use of Bayesian

optimization to set hyperparameters in a leukemia transplantation survival classification

problem. In Section 7.3 the parallel constrained Bayesian optimization algorithm is used

to tune numerous parameters to optimize a complex computer vision based fall detection

system and in Section 7.4 the Bayesian optimization methods introduced in this thesis are

used to significantly improve the performance of a hand-tracking algorithm critical to an

Alzheimer’s patient prompting system.

104

As a consequence of this work, it has become apparent that more sophisticated methods

for setting the hyperparameters of machine learning models can improve over the methods

currently being employed even by top experts in the field. The Bayesian optimization methods

introduced in this thesis have been empirically shown to find better hyperparameters faster

for state-of-the-art algorithms than the experts who developed them. The NPGA has also

been shown to achieve state of the art results on benchmark machine learning tasks and an

application in assistive technology.

There are numerous interesting avenues for future research, particularly in the further

development of Bayesian optimization for machine learning algorithms. Although the Bayesian

optimization methods in this work apply to a wide variety of general machine learning

algorithms, it is unlikely that the general approach will outperform strategies tailored to

specific problems. Gaussian process based Bayesian optimization depends heavily on the

ability to accurately model a function of interest. The statistical priors over functions

used in this work rely only on an assumption of smoothness. It is likely that stronger and

more appropriate priors tailored to a problem of interest would result in better models

of the function being optimized over. An interesting result may be that specific Bayesian

optimization strategies will be developed for the different common models in machine learning.

Hennig and Schuler [2012] propose to explicitly model a distribution over the optimum of a

function and translate the optimization problem into the problem of minimizing the entropy

of this model. However, currently this model is especially expensive to evaluate and requires

the optimization of expected improvement as an intermediary step. This is an interesting

new direction that warrants further research.

The loss, error or cost functions of machine learning models conditioned on some data are

frequently not smooth, stationary or homoscedastic functions of the model hyperparameters.

The error of a machine learning classifier, for example, is clearly a non-stationary and

heteroscedastic function of the learning rate used in stochastic gradient descent. Another

area for improvement in terms of Bayesian optimization may be in modeling these more

complex functions. This represents an interesting trade-off that warrants further research.

Specifically, capturing more complex structure in the distribution over functions will require

more parameters and thus likely more observations. An interesting question is whether the

additional modeling power justifies requiring more function evaluations.

Another interesting question is how to combine the results of multiple runs of Bayesian

optimization on different, but closely related, objective functions. It would seem inefficient,

105

and un-Bayesian, to start each new run of Bayesian optimization as if nothing is known about

the function being optimized when the results of previous optimization runs can be used.

There are numerous potential ways of incorporating the results of previous and concurrent

optimization runs to reduce the uncertainty over the optimum of a function of interest.

Finally, the use of Bayesian optimization for hyperparameter tuning offers a solution to

another significant problem in the domain of machine learning. In terms of reporting results,

using Bayesian optimization offers a more fair comparison of algorithms and baselines on

benchmark tasks. As many algorithms require significant parameter tuning and expertise to

achieve good performance, it is difficult to compare algorithms, particularly when researchers

have expertise in only one model. It is often unclear if a given model is naturally better for

a given task or if simply better hyperparameters were found. Using Bayesian optimization

offers a fair comparison and suggests reproducibility by anyone using Bayesian optimization.

Furthermore, the underlying statistical models, Gaussian processes, offer an understanding

of the sensitivity of a model to a given hyperparameter. More research into using Bayesian

optimization for reporting and model comparison would be a benefit to the field of machine

learning.

106

Bibliography

A. Adami, M. Pavel, T. Hayes, A. Adami, and C. Singer. A method for classification of movementsin bed. In International Conference of the IEEE Engineering in Medicine and Biology Society,2011. 77

R. P. Adams, Z. Ghahramani, and M. I. Jordan. Tree-structured stick breaking for hierarchicaldata. In Advances in Neural Information Processing Systems, 2010. 23

Alzheimer’s Association. Alzheimer’s disease facts and figures. Alzheimer’s and Dementia: TheJournal of the Alzheimer’s Association, 8:131–168, March 2012. 91

D. Anderson, J. Keller, M. Skubic, X. Chen, and Z. He. Recognizing falls from silhouettes. InInternational Conference of the IEEE Engineering in Medicine and Biology Society, 2006. 87

J. Azimi. Bayesian Optimization with Empirical Constraints. PhD thesis, Oregon State University,Oregon, United States, 2012. 73

J. Azimi, A. Fern, and X. Fern. Budgeted optimization with concurrent stochastic-durationexperiments. In Advances in Neural Information Processing Systems, 2011. 46

J. Azimi, A. Jalali, and X. Z. Fern. Hybrid batch Bayesian optimization. In International Conferenceon Machine Learning, 2012. 53

R. Bardenet and B. Kegl. Surrogating the surrogate: accelerating Gaussian process based globaloptimization with a mixture cross-entropy algorithm. In International Conference on MachineLearning, 2010. 53

S. Becker, E. J. Candes, and M. Grant. Templates for convex cone problems with applications tosparse signal recovery. Mathematical Programming Computation, 3(3), pages 165–218, 2011. 84

M. Belshaw, B. Taati, D. Giesbrecht, and A. Mihailidis. Intelligent vision-based fall detectionsystem: preliminary results from a real-world deployment. In Rehabilitation Engineering andAssistive Technology Society of North America (RESNA), 2011a. 77, 87

M. Belshaw, B. Taati, J. Snoek, and A. Mihailidis. Towards a single sensor passive solution forautomated fall detection. In International Conference of the IEEE Engineering in Medicine andBiology Society, 2011b. 87

Y. Bengio. Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1):1–127, 2009. 20

Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layer-wise training of deep networks.In Advances in Neural Information Processing Systems, 2007. 24, 26, 31, 36

107

J. Bergstra and Y. Bengio. Random search for hyper-parameter optimization. Journal of MachineLearning Research, 13:281–305, 2012. 47

J. S. Bergstra, R. Bardenet, Y. Bengio, and B. Kegl. Algorithms for hyper-parameter optimization.In Advances in Neural Information Processing Systems, 2011. 46, 55, 79

C. M. Bishop and G. D. James. Analysis of multiphase flows using dual-energy gamma densitometryand neural networks. Nuclear Instruments and Methods in Physics Research, pages 580–593, 1993.35

S. Bitzer and C. Williams. Kick-starting GPLVM optimization via a connection to metric MDS. InAdvances in Neural Information Processing Systems Workshop on Visualisation, December 2010.15

E. V. Bonilla, K. M. A. Chai, and C. K. I. Williams. Multi-task Gaussian process prediction. InAdvances in Neural Information Processing Systems, 2008. 52

S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, New York, NY,USA, 2004. 21

P. Bratley and B. Fox. Algorithm 659: Implementing Sobol’s quasirandom sequence generator.ACM Transactions on Mathematical Software, 14:88–100, 1988. 53, 55, 72

E. Brochu, V. M. Cora, and N. de Freitas. A tutorial on Bayesian optimization of expensivecost functions, with application to active user modeling and hierarchical reinforcement learning.pre-print, 2010. arXiv:1012.2599. 21, 47, 63, 81

A. D. Bull. Convergence rates of efficient global optimization algorithms. Journal of MachineLearning Research, 12(3-4):2879–2904, 2011. 22, 45, 63

C. M. Carvalho, N. G. Polson, and J. G. Scott. Handling sparsity via the horseshoe. In InternationalConference on Artificial Intelligence and Statistics, 2009. 54

B. Chen, R. Castro, and A. Krause. Joint optimization and variable selection of high-dimensionalgaussian processes. In International Conference on Machine Learning, 2012. 99

Y. Chu, Y. C. Song, R. Levinson, and H. A. Kautz. Interactive activity recognition and prompting toassist people with cognitive disabilities. Journal of Ambient Intelligence and Smart Environments,4(5):443–459, 2012. 77

A. Coates and A. Y. Ng. Selecting receptive fields in deep networks. In Advances in NeuralInformation Processing Systems, 2011. 60

A. Coates, H. Lee, and A. Y. Ng. An analysis of single-layer networks in unsupervised featurelearning. In Conference on Artificial Intelligence and Statistics, 2011. 39

D. J. Cook. How smart is your home? Science, 335(6076):1579–1581, 2012. 78

G. W. Cottrell, P. Munro, and D. Zipser. Learning internal representations from gray-scale images:An example of extensional programming. In Conference of the Cognitive Science Society, pages462–473, 1987. 19, 25

N. Cristianini and B. Scholkopf. Support vector machines and kernel methods: the new generationof learning machines. Association for the Advancement of Artificial Intelligence AI Magazine, 23

108

(3):31–41, 2002. 11

S. Czarnuch and A. Mihailidis. The coach: A real-world effectiveness study. In Canadian StudentHealth Research Forum, 2012. 91

S. Czarnuch and A. Mihailidis. An efficacy study of the coach in a real-world deployment. (inreview) Journal of Ambient Intelligence and Smart Environments (Thematic Issue on Designingand Deploying Intelligent Environments), 2013. 91

G. Dahl, T. N. Sainath, and G. E. Hinton. Improving deep neural networks for LVCSR usingrectified linear units and dropout. In International Conference on Acoustics, Speech, and SignalProcessing, 2013. 98

N. de Freitas, A. J. Smola, and M. Zoghi. Exponential regret bounds for Gaussian process banditswith deterministic observations. In International Conference on Machine Learning, 2012. 22, 49

L. Deng, M. Seltzer, D. Yu, A. Acero, A.-R. Mohamed, and G. E. Hinton. Binary coding of speechspectrograms using a deep autoencoder. In Interspeech, 2010. 23

B. H. Dobkin. Clinical practice, rehabilitation after stroke. New England Journal of Medicine, 352:1677–1684, 2005. 79

C. H. Ek, P. H. Torr, and N. D. Lawrence. GP-LVM for data consolidation. In Advances in NeuralInformation Processing Systems Workshop on Learning from Multiple Sources, 2008. 18

D. Erhan, Y. Bengio, A. C. Courville, P.-A. Manzagol, P. Vincent, and S. Bengio. Why doesunsupervised pre-training help deep learning? Journal of Machine Learning Research, 11:625–660,2010. 20

V. Gabillon, M. Ghavamzadeh, and A. Lazaric. Best arm identification: A unified approach to fixedbudget and fixed confidence. In Advances in Neural Information Processing Systems, 2012. 22

R. Garnett, M. A. Osborne, and S. J. Roberts. Bayesian optimization for sensor set selection. InInternational Conference on Information Processing in Sensor Networks, 2010. 46

A. Geiger, R. Urtasun, and T. Darrell. Rank priors for continuous non-linear dimensionalityreduction. In CVPR, pages 880–887, June 2009. 15

D. Ginsbourger and R. Riche. Towards Gaussian process-based optimization with finite time horizon.In Advances in Model-Oriented Design and Analysis, Contributions to Statistics, pages 89–96.Physica-Verlag HD, 2010a. 64

D. Ginsbourger and R. L. Riche. Dealing with asynchronicity in parallel Gaussian process basedglobal optimization. http://hal.archives-ouvertes.fr/hal-00507632, 2010b. 53

J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov. Neighbourhood components analysis.In Advances in Neural Information Processing Systems, 2004. 34

R. B. Gramacy and H. K. Lee. Optimization under unknown constraints. pre-print, 2010. URLhttp://arxiv.org/pdf/1004.4027v2.pdf. arXiv:1004.4027. 73

R. B. Gramacy and H. K. H. Lee. Bayesian treed Gaussian process models with an application tocomputer modeling. Journal of the American Statistical Association, 103(483):1119–1130, 2008.98

109

http://hal.archives-ouvertes.fr/hal-00507632

http://arxiv.org/pdf/1004.4027v2.pdf

A. Gratwohl, H. Baldomero, M. Aljurf, et al. Hematopoietic stem cell transplantation: A globalperspective. The Journal of the American Medical Association, 303(16), pages 1617–1624, 2010.83

R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reduction by learning an invariant mapping.In IEEE Conference on Computer Vision and Pattern Recognition, 2006. 34

P. Hennig and C. J. Schuler. Entropy search for information-efficient global optimization. Journalof Machine Learning Research, pages 1809–1837, 2012. 105

G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks.Science, 313(5786):504–507, July 2006. 20

G. E. Hinton and R. S. Zemel. Autoencoders, minimum description length and helmholtz freeenergy. In Advances in Neural Information Processing Systems, 1994. 19

G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Improvingneural networks by preventing co-adaptation of feature detectors. pre-print, 2012. URL http:

//arxiv.org/abs/1207.0580. x, 20, 75, 101

J. Hoey, A. V. Bertoldi, T. Craig, P. Poupart, and A. Mihailidis. Automated handwashingassistance for persons with dementia using video and a partially observable markov decisionprocess. Computer Vision and Image Understanding (Special Issue on Computer Vision Systems),114:503–519, 2010. 91

M. Hoffman, D. M. Blei, and F. Bach. Online learning for latent Dirichlet allocation. In Advancesin Neural Information Processing Systems, 2010. 56, 57, 58

M. D. Hoffman, E. Brochu, and N. de Freitas. Portfolio allocation for Bayesian optimization. InUncertainty in Artificial Intelligence, 2011. 48

R. Huq, P. Kan, R. Goetschalckx, D. Hebert, J. Hoey, and A. Mihailidis. A decision-theoreticapproach in the design of an adaptive upper-limb stroke rehabilitation robot. In InternationalConference of Rehabilitation Robotics (ICORR), 2011. 79

F. Hutter, H. H. Hoos, and K. Leyton-Brown. Sequential model-based optimization for generalalgorithm configuration. In Learning and Intelligent Optimization 5, 2011. 46, 53

D. Jones. A taxonomy of global optimization methods based on response surfaces. Journal of GlobalOptimization, 21(4):345–383, 2001. 21, 45, 55, 65

D. R. Jones, M. Schonlau, and W. J. Welch. Efficient global optimization of expensive black-boxfunctions. Journal of Global Optimization, 13(4):455–492, 1998. 21

P. Kan, R. Huq, J. Hoey, R. Goestschalckx, and A. Mihailidis. The development of an adaptiveupper-limb stroke rehabilitation robotic system. Neuroengineering and Rehabilitation, 2011. 79

M. C. Kennedy and A. O’Hagan. Bayesian calibration of computer models. Journal of the RoyalStatistical Society: Series B (Statistical Methodology), 63(3), 2001. 46

D. G. Krige. A statistical approach to some basic mine valuation problems on the Witwatersrand.Journal of the Chemical, Metallurgical and Mining Society of South Africa, 52:119–139, 1951. 8

A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, Department

110

http://arxiv.org/abs/1207.0580


of Computer Science, University of Toronto, 2009. 60

M. P. Kumar, B. Packer, and D. Koller. Self-paced learning for latent variable models. In Advancesin Neural Information Processing Systems, 2010. 58

H. J. Kushner. A new method for locating the maximum point of an arbitrary multipeak curve inthe presence of noise. Journal of Basic Engineering, 86, 1964. 48

H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio. An empirical evaluation ofdeep architectures on problems with many factors of variation. In International Conference onMachine Learning, 2007. 24

N. D. Lawrence. Probabilistic non-linear principal component analysis with Gaussian process latentvariable models. Journal of Machine Learning Research, 6:1783–1816, 2005. 13, 14, 15, 24, 26

N. D. Lawrence and A. J. Moore. Hierarchical Gaussian process latent variable models. InInternational Conference on Machine Learning, 2007. 18

N. D. Lawrence and J. Quinonero Candela. Local distance preservation in the GP-LVM throughback constraints. In International Conference on Machine Learning, 2006. 18, 29, 30

N. D. Lawrence and R. Urtasun. Non-linear matrix factorization with Gaussian processes. InInternational Conference on Machine Learning, 2009. 27

N. D. Lawrence, M. Seeger, and R. Herbrich. Fast sparse gaussian process methods: The informativevector machine. In Advances in Neural Information Processing Systems, 2002. 97

Y. LeCun, F. J. Huang, and L. Bottou. Learning methods for generic object recognition withinvariance to pose and lighting. In IEEE Conference on Computer Vision and Pattern Recognition,2004. 40, 42

T. Lee and A. Mihailidis. An intelligent emergency response system: preliminary development andtesting of automated fall detection. Journal of Telemedicine and Telecare, 11:194–198, 2005. 87

D. Lizotte. Practical Bayesian Optimization. PhD thesis, University of Alberta, Edmonton, Alberta,Canada, 2008. 21

D. J. Lizotte, R. Greiner, and D. Schuurmans. An experimental methodology for response surfaceoptimization methods. Journal of Global Optimization, 53(4):699–736, 2012. 53

D. Lowe and M. E. Tipping. Neuroscale: Novel topographic feature extraction using RBF networks.In Advances in Neural Information Processing Systems, 1997. 30

E. Lu, R. Wang, R. Huq, D. Gardner, P. Karam, K. Zabjek, D. Hebert, J. Boger, and A. Mihailidis.Development of a robotic device for upper limb stroke rehabilitation: A user-centered designapproach. Journal of Behavioral Robotics, 2012. 79

D. J. MacKay. Bayesian non-linear modeling for the prediction competition. ASHRAE Transactions,100:1053–1062, 1994a. 12

D. J. MacKay. Bayesian neural networks and density networks. In Nuclear Instruments and Methodsin Physics Research, A, pages 73–80, 1994b. 30

D. J. MacKay. Comparison of approximate methods for handling hyperparameters. Neural

111

Computation, 11:1035–1068, 1999. 15

D. J. C. MacKay. Bayesian interpolation. Neural Computation, 4(3):415–447, 1992. 15

D. J. C. MacKay. Introduction to Gaussian processes. Neural Networks and Machine Learning,1998. 28, 29

N. Mahendran, Z. Wang, F. Hamze, and N. de Freitas. Adaptive MCMC with Bayesian optimization.In International Conference on Artificial Intelligence and Statistics, 2012. 46, 53

J. Martens. Deep learning via Hessian-free optimization. In International Conference on MachineLearning, 2010. 20

G. Matheron. Traite de geostatistique appliquee, volume 14. Editions Technip, Paris, 1962. 8

A. Mihailidis, J. Boger, T. Craig, and J. Hoey. The coach prompting system to assist older adultswith dementia through handwashing: An efficacy study. BMC Geriatrics, 8, 2008. 77, 91

K. Miller, M. P. Kumar, B. Packer, D. Goodman, and D. Koller. Max-margin min-entropy models.In International Conference on Artificial Intelligence and Statistics, 2012. 58, 59

T. Minka. Expectation propagation for approximate Bayesian inference. In Uncertainty in ArtificialIntelligence, 2001. 95

J. Mockus. Bayesian Approach to Global Optimization. Kluwer, Dordrecht, Netherlands, 1989. 64

J. Mockus. Application of Bayesian approach to numerical methods of global and stochasticoptimization. Journal of Global Optimization, 4(4):347–365, 1994. 64

J. Mockus, V. Tiesis, and A. Zilinskas. The application of Bayesian methods for seeking theextremum. Towards Global Optimization, 2:117–129, 1978. 21, 45, 63, 64, 81

M. F. Møller. A scaled conjugate gradient algorithm for fast supervised learning. Neural Networks,6(4):525–533, 1993. 15

M. Montemerlo, J. Pineau, N. Roy, S. Thrun, and V. Verma. Experiences with a mobile roboticguide for the elderly. In Association for the Advancement of Artificial Intelligence NationalConference, 2002. 77

I. Murray and R. P. Adams. Slice sampling covariance hyperparameters of latent Gaussian models.In Advances in Neural Information Processing Systems, 2010. 50, 53, 54, 55, 72

I. Murray, R. P. Adams, and D. J. MacKay. Elliptical slice sampling. In International Conferenceon Artificial Intelligence and Statistics, 2010. 72, 95

V. Nair and G. E. Hinton. 3D object recognition with deep belief nets. In Advances in NeuralInformation Processing Systems, 2009. 42

V. Nair and G. E. Hinton. Rectified linear units improve restricted Boltzmann machines. InInternational Conference on Machine Learning, 2010. 26

H. Nait-Charif and S. McKenna. Activity summarization and fall detection in a supportive homeenvironment. In International Conference of Pattern Recognition, 2004. 87

R. Navaratnam, A. W. Fitzgibbon, and R. Cipolla. The joint manifold model for semi-supervisedmulti-valued regression. In International Conference on Computer Vision, 2007. 18

112

R. Neal. Bayesian Learning for Neural Networks. PhD thesis, University of Toronto, Toronto,Canada, 1994. 8

R. Neal. Bayesian learning for neural networks. Lecture Notes in Statistics, 118, 1996. 8, 30

M. A. Osborne, R. Garnett, and S. J. Roberts. Gaussian Processes for Global Optimization. InInternational Conference on Learning and Intelligent Optimization, 2009. 46

M. E. Pollack, L. Brown, D. Colbry, C. E. McCarthy, C. Orosz, B. Peintner, S. Ramakrishnan, andI. Tsamardinos. Autominder: An intelligent cognitive orthotic system for people with memoryimpairment, 2003. 77

M. Ranzato and M. Szummer. Semi-supervised learning of compact document representations withdeep networks. In International Conference on Machine Learning, 2008. 24, 31

M. Ranzato, Y.-L. Boureau, and Y. LeCun. Sparse feature learning for deep belief networks. InAdvances in Neural Information Processing Systems, 2007. 20

P. Rashidi and D. Cook. Keeping the resident in the loop: Adapting the smart home to the user.IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans, 39(5):949–959, 2009. 78

C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press,Cambridge, MA, 2006. 8, 9, 11, 12, 13, 28, 48, 101

S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio. Contractive auto-encoders: Explicitinvariance during feature extraction. In International Conference on Machine Learning, 2011. 20

C. Rougier, J. Meunier, A. St-Arnaud, and J. Rousseau. Monocular 3D head tracking to detect fallsof elderly people. In International Conference of the IEEE Engineering in Medicine and BiologySociety, 2006. 87

S. Roweis. EM algorithms for PCA and SPCA. In Advances in Neural Information ProcessingSystems, 1996. 84

S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding.Science, 290(5500):2323–2326, December 2000. 16

R. Salakhutdinov and G. Hinton. Learning a nonlinear embedding by preserving class neighbourhoodstructure. In International Conference on Artificial Intelligence and Statistics, 2007. 34

R. Salakhutdinov and G. Hinton. Using deep belief nets to learn covariance kernels for Gaussianprocesses. In Advances in Neural Information Processing Systems, 2008. 33

R. Salakhutdinov and H. Larochelle. Efficient learning of deep Boltzmann machines. In Conferenceon Artificial Intelligence and Statistics, 2010. 42

R. Salakhutdinov and A. Mnih. Probabilistic matrix factorization. In Advances in Neural InformationProcessing Systems, 2008. 84

E. Saund. Dimensionality-reduction using connectionist networks. IEEE Transactions on PatternAnalysis and Machine Intelligence, 11:304–314, 1989. 19

A. Saxe, P. W. Koh, Z. Chen, M. Bhand, B. Suresh, and A. Ng. On random weights and unsupervised

113

feature learning. In International Conference on Machine Learning, 2011. 60

M. Schonlau, W. J. Welch, and D. R. Jones. Global versus local search in constrained optimizationof computer models. Lecture Notes Monograph Series, 34:11–25, 1998. 64

A. P. Shon, K. Grochow, A. Hertzmann, and R. P. N. Rao. Learning shared latent structure forimage synthesis and robotic imitation. In Advances in Neural Information Processing Systems,2005. 35

J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake.Real-time human pose recognition in parts from single depth images. In IEEE Conference onComputer Vision and Pattern Recognition, 2011. 92, 93

Smartrisk. The economic burden of injury in Canada, 2009. 87

E. Snelson and Z. Ghahramani. Sparse Gaussian processes using pseudo-inputs. Neural InformationProcessing Systems 18, 2005. 97

J. Snoek, J. Hoey, L. Stewart, R. S. Zemel, and A. Mihailidis. Automated detection of unusualevents on stairs. Image and Vision Computing, 27(1-2):153–166, 2009. 77

J. Snoek, B. Taati, Y. Eskin, and A. Mihailidis. Automatic segmentation of video to aid thestudy of faucet usability for older adults. In IEEE Conference on Computer Vision and PatternRecognition Workshop for Human Communicative Behavior Analysis, 2010a.

J. Snoek, B. Taati, and A. Mihailidis. Automated detection of falls in the home - current challengesand future directions. In International Society of Gerontechnology World Conference., 2010b. 87

J. Snoek, R. Adams, and H. Larochelle. Semiparametric latent variable models for guided represen-tation. In The Learning Workshop (Snowbird), 2011a.

J. Snoek, H. Larochelle, and R. P. Adams. Opportunity cost in Bayesian optimization. In NeuralInformation Processing Systems Workshop on Bayesian Optimization, 2011b.

J. Snoek, R. P. Adams, and H. Larochelle. On nonparametric guidance for learning autoencoderrepresentations. In International Conference on Artificial Intelligence and Statistics, 2012a. 78

J. Snoek, R. P. Adams, and H. Larochelle. Nonparametric guidance of autoencoder representationsusing label information. Journal of Machine Learning Research, 13:2567–2588, 10/2012 2012b.

J. Snoek, H. Larochelle, and R. P. Adams. Practical Bayesian optimization of machine learningalgorithms. In Advances in Neural Information Processing Systems, 2012c. 78, 79, 81

J. Snoek, B. Taati, and A. Mihailidis. An automated machine learning approach applied to roboticstroke rehabilitation. AAAI Symposium on Gerontechnology, 2012d.

J. Spehr, M. Gvercin, S. Winkelbach, E. Steinhagen-Thiessen, , and F. Wahl. Visual fall detection inhome environments. In International Conference of the International Society for Gerontechnology,2008. 87

N. Srinivas, A. Krause, S. Kakade, and M. Seeger. Gaussian process optimization in the banditsetting: No regret and experimental design. In International Conference on Machine Learning,2010. 22, 45, 49, 63

114

M. Stein. Statistical Interpolation of Spatial Data: Some Theory for Kriging. Springer, 1999. 12

B. Taati, J. Snoek, D. Aleman, A. Mihailidis, and A. Ghavamzadeh. Machine learning techniques fordata mining in bone marrow transplant records. In The 54th Annual conference of the CanadianOperational Research Society (CORS), May 2012a. 83

B. Taati, J. Snoek, D. Aleman, A. Mihailidis, and A. Ghavamzadeh. Applying collaborative filteringtechniques to data mining in bone marrow transplant records. In INFORMS, Oct 2012b. 83

B. Taati, R. Wang, R. Huq, J. Snoek, and A. Mihailidis. Vision-based posture assessment to detectand categorize compensation during robotic rehabilitation therapy. In International Conferenceon Biomedical Robotics and Biomechatronics, 2012c. 79, 80, 81

B. Taati, J. Snoek, D. Aleman, and A. Ghavamzadeh. Data mining in bone marrow transplant recordsto identify patients with high odds of survival. Journal of Biomedical and Health Informatics (toappear), 2013. 83

Y. W. Teh, M. Seeger, and M. I. Jordan. Semiparametric latent factor models. In InternationalConference on Artificial Intelligence and Statistics, 2005. 52

M. E. Tipping and C. M. Bishop. Probabilistic principal components analysis. Journal of the RoyalStatistical Society, Series B, 61(3):611–622, 1999. 13

A. Torralba, R. Fergus, and W. T. Freeman. 80 million tiny images: A large data set for nonparametricobject and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence,30(11):1958–1970, 2008. 38

United Nations. World population prospects, 2008. 3

United Nations. Population ageing and development, 2012. 3, 87

R. Urtasun and T. Darrell. Discriminative Gaussian process latent variable model for classification.In International Conference on Machine Learning, 2007. 17, 18, 27, 35

R. Urtasun, D. J. Fleet, and P. Fua. 3D people tracking with Gaussian process dynamical models.In IEEE Conference on Computer Vision and Pattern Recognition, 2006. 16

R. Urtasun, D. J. Fleet, A. Geiger, J. Popovic, T. J. Darrell, and N. D. Lawrence. Topologically-constrained latent variable models. In International Conference on Machine Learning, 2008. 16,19

U.S. Census Bureau. Americas families and living arrangements: 2011. table a2: Family status andhousehold relationship of people 15 years and over, by marital status, age, and sex., 2011. 91

L. van der Maaten. Preserving local structure in Gaussian process latent variable models. In 18thAnnual Belgian-Dutch Conference on Machine Learning, pages 88–91, 2009. 15, 17

L. van der Maaten and G. Hinton. Visualizing data using t-SNE. Journal of Machine LearningResearch, 9:2579–2605, November 2008. 17

E. Vazquez and J. Bect. Convergence properties of the expected improvement algorithm with fixedmean and covariance functions. Journal of Statistical Planning and Inference, 140(11):3088 –3095, 2010. 22

115

P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol. Extracting and composing robust featureswith denoising autoencoders. In International Conference on Machine Learning, pages 1096–1103,2008. 20, 23, 26, 33

F. Vivarelli and C. K. I. Williams. Discovering hidden features with Gaussian process regression. InAdvances in Neural Information Processing Systems, 1999. 28, 99

C. Wang and R. M. Neal. Gaussian process regression with heteroscedastic or non-gaussian residuals.pre-print, 2012. URL http://arxiv.org/abs/1212.6246. 101

J. M. Wang, D. J. Fleet, and A. Hertzmann. Multifactor Gaussian process models for style-contentseparation. In International Conference on Machine Learning, 2007. 35

J. M. Wang, D. J. Fleet, and A. Hertzmann. Gaussian process dynamical models for human motion.IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(2):283–298, 2008. 16, 17, 27

Z. Wang, M. Zoghi, F. Hutter, D. Matheson, and N. de Freitas. Bayesian optimization in a billiondimensions via random embeddings. pre-print, 2013. URL http://arxiv.org/abs/1207.0580.99, 100

C. K. I. Williams. Computation with infinite neural networks. Neural Computation, 10(5):1203–1216,1998. 29, 30

C. K. I. Williams and C. E. Rasmussen. Gaussian processes for regression. In Advances in NeuralInformation Processing Systems, 1996. 12

A. Yao, J. Gall, L. V. Gool, and R. Urtasun. Learning probabilistic non-linear latent variable modelsfor tracking complex activities. In Advances in Neural Information Processing Systems, 2011. 15,96

C.-N. J. Yu and T. Joachims. Learning structural SVMs with latent variables. In InternationalConference on Machine Learning, 2009. 58

R. S. Zemel, C. K. I. Williams, and M. C. Mozer. Lending direction to neural networks. NeuralNetworks, 8:503–512, 1995. 34

116



Documents

Bayesian Optimization and Semiparametric Models …...We establish methods for automatically con guring machine learning model hyperpa-rameters using Bayesian optimization. We develop